Kubernetes
Overview
Kubernetes clusters run as managed VMs inside a VPC, fronted by a managed Load Balancer that exposes the Kubernetes API server. Each cluster ships with:
- Single or HA control plane -- 1 node (cost-optimised) or 3 nodes (survives single-node failure)
- Manual or autoscaled workers -- set a fixed count, or pin
min/maxand let the cluster autoscaler add nodes when pods are pending - Integrated Cloud Controller Manager (CCM) --
Service: type=LoadBalancerin your manifests creates a real managed Load Balancer on the platform - Rolling upgrades -- one-click Kubernetes minor/patch upgrades for both control plane and workers, with drain + surge protection
- Per-cluster security groups -- separate rule scopes for the LB (API exposure), control plane, and workers
- Custom TLS on the API endpoint -- bind a Let's Encrypt or custom certificate to a domain you point at the API LB
Each cluster lives inside one VPC and exposes its API server through a dedicated public + private endpoint (or private-only).
Admin Configuration
Before users can create clusters, an administrator must enable Kubernetes per region and register at least one supported version.
Enabling Kubernetes on a Region
- Navigate to Compute → Hypervisor Groups
- Open the hypervisor group (region) you want to enable
- Tick Enable Kubernetes service in this group
- Make sure the same group has VPC and Load Balancer enabled (Kubernetes needs both)
- Save
Screenshot: Admin > Hypervisor Groups > Edit page showing the "Enable Kubernetes service in this group" toggle.
The region's slug is used as the Kubernetes region label baked into node hostnames. The slug is locked once any cluster exists in the group.
Registering Supported Versions
The supported-versions catalogue is the curated list of Kubernetes versions users can pick when creating or upgrading a cluster.
- Navigate to Kubernetes → Supported Versions in the admin sidebar
- Click Register Version
- Fill in:
- Semantic Version -- e.g.
1.31.0 - State --
Active(selectable),Deprecated(selectable with warning), orEOL(hidden from new clusters) - Control Plane Image -- image baked for control plane nodes (purpose =
Kubernetes Control Plane) - Worker Image -- image baked for worker nodes (purpose =
Kubernetes Worker) - Min CPU Cores / Min RAM (MB) -- minimum plan requirements (used to filter plan picker on the Create wizard)
- EOL Date -- optional date the version reaches end-of-life
- Upgrade From -- comma-separated list of versions that may be upgraded to this version (e.g.
1.30.0, 1.30.1) - Bundled Components -- optional JSON describing co-packaged components (etcd, coredns, cilium, etc.)
- Semantic Version -- e.g.
- Save
Screenshot: Admin > Kubernetes > Supported Versions page showing the registered versions table and the "Register Version" modal.
You bake them yourself. See the Kubernetes Image Baking guide for the full procedure -- everything from base OS prerequisites to which container images to pre-pull.
Admin Cluster Management
Admins can view and operate on all user clusters from Kubernetes → Clusters in the admin sidebar. The admin cluster page mirrors the user view and adds destructive controls:
- Force destroy -- terminate the cluster even when normal delete fails
- Force evict -- forcibly remove a single node that won't drain cleanly
- Suspend / Resume -- suspend a cluster (stops billing for nodes) without deleting it
- Reset state -- clear a stuck
Provisioning/Upgradingstate toRunning(use after manual intervention) - Cancel task -- cancel a stuck task
All destructive actions are audit-logged.
Prerequisites
Before creating a cluster you need:
- A region with Kubernetes enabled -- the region must have VPC, Load Balancer, and Kubernetes enabled by an administrator
- A VPC in that region -- the cluster will run inside it
- A NAT Gateway attached to the VPC -- workers need outbound internet to pull images and reach
pkgs.k8s.io - At least one private subnet -- control plane nodes must sit in a private subnet
- Sufficient credits -- clusters are billed hourly per node (control plane + workers + control plane LB)
If any prerequisite is missing, the Create page surfaces a warning telling you exactly what to fix.
Screenshot: User panel > Kubernetes > Create page showing the Region & VPC step with NAT Gateway warning when a VPC has no NAT GW attached.
Creating a Cluster
- Navigate to Kubernetes → Clusters in the user sidebar
- Click Create Cluster
- Walk through the six-step wizard described below
Screenshot: User panel > Kubernetes > Create page showing the six-step wizard with progress indicator at the top.
Step 1 -- Region & Networking
- Region -- only regions with VPC + Load Balancer + Kubernetes enabled appear
- VPC -- only VPCs in the selected region appear; the panel shows whether the VPC has a NAT Gateway attached
Workers need outbound internet to pull container images. Create a NAT Gateway on the VPC before deploying the cluster.
Step 2 -- Subnets
- Control plane subnet -- must be a private subnet. The API server is reached through the public LB, never directly from the internet.
- Worker subnet -- defaults to the control plane subnet. Pick a different one to isolate workers.
Step 3 -- Cluster Identity
- Name -- display name (up to 50 characters)
- Slug -- lowercase DNS-compatible identifier used in node hostnames (up to 63 chars,
[a-z0-9-]) - Description -- optional free-text notes
- Kubernetes Version -- pick from the admin-curated list of supported versions
- Endpoint Mode:
- Private -- API server reachable only from inside the VPC
- Public & Private -- API server reachable from both the internet and the VPC (defaults to public)
Step 4 -- Control Plane
- Control Node Count:
- 1 node (Single) -- lower cost, no redundancy. Recommended for dev / staging.
- 3 nodes (HA) -- survives single-node failure. Recommended for production.
- Control Plane Plan -- CPU / RAM / disk for each control plane VM
- Control Plane Load Balancer Plan -- resource size for the managed LB fronting the API server
Step 5 -- Workers
The wizard creates the cluster's default node pool. You can add more pools after the cluster is up (see Node Pools for mixed-shape clusters, GPU pools, etc.).
- Worker Plan -- CPU / RAM / disk for each worker VM (plan-defined; per-pool, not changeable on existing nodes)
- Worker Count -- initial number of workers in the default pool (1--100)
- Enable cluster autoscaler -- when on, the autoscaler grows/shrinks the default pool within bounds
- Min Size -- floor (autoscaler will never scale below this)
- Max Size -- ceiling
- Pod CIDR -- in-cluster pod IP range (default
10.244.0.0/16) - Service CIDR -- ClusterIP service range (default
10.96.0.0/12) - Pod Security Admission Default -- cluster-wide default PSA profile (
privileged,baseline, orrestricted)
Step 6 -- Review & Create
The review step shows every selection and an estimated hourly cost (when hourly billing is enabled). Click Create Cluster to launch.
Provisioning typically takes 5--12 minutes depending on cluster size. The cluster page shows a live progress overlay throughout.
Screenshot: Cluster show page during creation showing the progress overlay with phase name, percentage, and an optional "Show details" log expansion.
Cluster Overview Page
After provisioning starts, the cluster detail page is the single pane of glass for everything.
Screenshot: User panel > Kubernetes > Cluster show page with the metadata card at the top (name, status badge, region, version, CP count, worker count, public/private endpoint URLs) and the tab bar below.
The header shows:
- State badge --
Provisioning,Running,Upgrading,Suspended,Failed - Region, version, control plane count, worker count -- quick stat tiles
- Public Endpoint -- the URL of the API server LB (public). Copy with the clipboard button.
- Private Endpoint -- the URL reachable inside the VPC
- Kubeconfig button -- downloads a
kubeconfigfile you can plug intokubectl - Redeploy CCM button -- re-applies the in-cluster cloud-controller-manager (needed only if your master domain changes)
- Delete button -- destroys the cluster and all backing nodes
Kubeconfig
Click Kubeconfig to download a YAML file. Save it and export the path:
export KUBECONFIG=~/Downloads/cluster-mycluster.yaml
kubectl get nodes
You should see the control plane node(s) and workers listed as Ready.
The downloaded kubeconfig points at the public endpoint by default if the cluster has a public LB, otherwise the private endpoint. To reach a private-only cluster, run kubectl from a VM inside the same VPC.
Automatic Certificate Renewal
The cluster's internal certificates and the in-cluster controller tokens (cluster autoscaler, cloud controller manager) renew automatically. You do not need to schedule renewals or run any commands.
What rotates and when:
| Asset | When | Action you need to take |
|---|---|---|
| Control-plane and etcd PKI | 30 days before expiry | None |
| Admin kubeconfig (the one you download from the panel) | Reissued with the PKI above | Re-download from the cluster page after the renewal email |
| Cluster Autoscaler controller token | 30 days before expiry, from inside the cluster | None |
| Cloud Controller Manager token | 30 days before expiry, from inside the cluster | None |
You receive an email 30 days before renewal as a heads-up, and a second email immediately after renewal telling you to re-download the kubeconfig. The cluster page also shows a banner with a one-click download until you acknowledge it.
In-cluster workloads, ingress, the autoscaler, and the CCM are not interrupted during renewal. Only kubectl sessions using the old admin kubeconfig need a refresh.
The cluster autoscaler manifest you apply from the Workers tab is applied once. The autoscaler rotates its own controller token from inside the cluster - you never need to reapply it.
Tabs
The cluster page is organised into tabs across the top of the body.
Screenshot: Cluster show page tab bar with Nodes, Tasks, Workers, Security, SSL & Domains.
Nodes Tab
Lists every VM backing the cluster (control plane + workers) with role, hostname, status, and creation time. Click a row to open a side drawer with:
- CPU and memory donut gauges -- requested vs. allocatable
- Top pods -- highest CPU and memory consumers on that node
- Pod list -- all pods scheduled on the node with their status
Use this view to confirm a node went Ready and to find pods crowding a node before scaling.
Tasks Tab
Shows every operation performed against the cluster (create, scale, upgrade, delete, etc.) with status, progress, and per-task logs. Click a task row to expand its log stream.
Useful when:
- A provisioning task fails -- the log shows the exact phase and error
- An upgrade is in flight -- watch each wave complete
- You want a per-step audit trail of who did what
Filter logs by source (Master / Slave / Cluster / Controller) using the dropdown.
Workers Tab
The workhorse tab for day-2 worker operations.
Screenshot: Workers tab showing autoscaling toggle with min/max bounds, "Scale Workers" form, "Upgrade Workers" form, "Upgrade Control Plane" form, "Worker Labels & Taints" editor, and the Worker Nodes table at the bottom.
Autoscaling
Toggle the cluster autoscaler on/off and set min/max worker bounds. Workers will scale within these bounds in response to pending pods and node utilisation.
Scale Workers
Manually set a desired worker count. Optional reason string is logged for audit.
Upgrade Workers
Roll workers to a newer Kubernetes version using a surge-replace strategy:
- Target version -- pick from the supported versions list
- Max surge -- how many extra workers to launch at once (1 means one-at-a-time, no overhead)
- Drain grace (sec) -- how long to allow pods to terminate gracefully before forcing removal
The upgrade adds a new worker on the target version, drains an old one, terminates it, and repeats until every worker is on the new version.
Upgrade Control Plane
Rolling upgrade for the control plane using surge=1 -- one new CP node provisioned per wave, then one old CP drained. The number of waves equals the control plane node count (1 or 3).
Worker scale and worker upgrade are blocked while a control plane upgrade is running. Wait for the CP upgrade to finish before scaling workers.
Worker Labels & Taints
Apply Kubernetes labels (key=value) and taints (key=value:effect) to all workers. Taint effects: NoSchedule, PreferNoSchedule, NoExecute. Click Apply to push the changes to the cluster.
Worker Nodes table
Per-node row with hostname, status, creation time, and a Delete button to cordon, drain, and remove a single worker.
Security Tab
Per-cluster security group rules grouped into three scopes:
- LB -- rules for the public LB fronting the API server (covers who can reach the K8s API)
- Control Plane -- rules for the control plane nodes themselves
- Worker -- rules for worker nodes
Each scope has Inbound and Outbound sub-tabs. Traffic in either direction is denied by default unless a rule explicitly allows it.
Screenshot: Security tab showing the three SG scope cards (LB / CP / Worker), inbound/outbound sub-tabs, and an "Add Rule" modal.
To add a rule:
- Pick the scope (LB / CP / Worker) and direction (Inbound / Outbound)
- Click Add Rule
- Fill in:
- Protocol -- TCP, UDP, ICMP, ICMPv6, or Any
- Port Range -- min / max (disabled for ICMP / Any)
- IP Version -- IPv4 or IPv6
- CIDR -- e.g.
0.0.0.0/0for any,203.0.113.10/32for one host - Description -- free text
SSL & Domains Tab
Bind a TLS certificate to the cluster API load balancer so you can reach the API on a custom domain (e.g. k8s.example.com).
- Point a DNS
CNAMEfrom your domain to the public LB hostname shown in the info banner - Click Add Certificate
- Pick a source:
- Let's Encrypt -- the platform issues + auto-renews a free certificate. Requires DNS to already resolve to the LB.
- Custom -- paste your own PEM certificate chain + private key
Once issued, the certificate is bound to the API LB and kubectl against your custom domain will work without --insecure-skip-tls-verify.
Screenshot: SSL & Domains tab showing the info banner with the CNAME target, the SSL Certificates table, and the Add Certificate modal with issuance method selection.
Using kubectl
After downloading the kubeconfig:
export KUBECONFIG=~/Downloads/cluster-mycluster.yaml
# Smoke checks
kubectl get nodes
kubectl get pods -A
# Deploy nginx
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer
# Wait for external IP to populate
kubectl get svc nginx -w
The Service: type=LoadBalancer triggers the in-cluster Cloud Controller Manager to create a managed Load Balancer on the platform automatically. The Service's EXTERNAL-IP will be the LB's public IP once provisioning completes (typically 30--90 seconds).
For the full list of supported Service annotations (custom plans, SSL termination, source-range filtering, health checks, blue/green and canary traffic splitting, etc.) see the Service LoadBalancer annotation reference.
Cluster Autoscaler
When enabled, the cluster autoscaler runs as a Deployment inside the cluster and:
- Scales up when there are unschedulable pods that would fit if a new worker existed
- Scales down when a worker has been underutilised for 10 minutes and its pods can move elsewhere
Bounds (min / max) are enforced strictly. The autoscaler will never go below min even if all pods could be evicted, and never above max even if more pods are pending.
To tweak autoscaling behaviour:
- Go to Workers tab
- Toggle Enable cluster autoscaler on/off
- Adjust Minimum workers and Maximum workers
- Click Save bounds
Set min to your baseline (what handles steady-state traffic) and max to your absolute ceiling -- not just a comfortable peak. The autoscaler will only add nodes when actually needed, so a high max is free until you have pending pods.
Node Pools
Workers can be split across multiple node pools, each with its own instance plan, labels, taints, autoscaling bounds, and drain policy. Use this to mix general / memory / compute / GPU shapes in one cluster, or to taint-isolate noisy workloads from system pods.
The cluster wizard creates a single default pool. Open the Pools tab on the cluster page to add, edit, or delete pools.
For per-pool field reference, scheduling examples (nodeSelector / tolerations), and rate-limit tuning, see the Node Pools guide.
Rolling Upgrades
Both control plane and workers are upgraded using a surge strategy: provision new node(s) on the target version, drain the old node(s), terminate, repeat.
Worker Upgrade
- Workers tab → Upgrade Workers card
- Pick a Target version
- Set Max surge (how many new workers to provision per wave; 1 = no overhead, higher = faster)
- Set Drain grace (sec) (default 30; pods get this long to exit cleanly before force-removal)
- Click Upgrade
The upgrade runs in waves until every worker is on the target version. Watch progress on the Tasks tab.
Control Plane Upgrade
- Workers tab → Upgrade Control Plane card
- Pick a Target version
- Set Drain grace (sec)
- Click Upgrade CP
CP upgrade always uses surge=1: one new CP node per wave, one old CP drained per wave. Total waves = control plane node count.
Even with surge + drain, briefly cycling control plane nodes can introduce small API server hiccups visible to kubectl and Service controllers. Schedule CP upgrades when downstream traffic is low.
Skip-Version Upgrades
Kubernetes does not support skipping minor versions (e.g. 1.30 → 1.32 in one hop). The dropdown only shows versions that are valid upgrade targets from your current version, as configured by the administrator.
Backups & Snapshots
Etcd snapshots are taken automatically on a schedule configured at the admin level. Restoring from a snapshot is currently an admin-initiated operation -- contact support if you need to roll back cluster state.
Lifecycle Operations
From the cluster header you can:
- Redeploy CCM -- re-applies the in-cluster cloud-controller-manager. Use after the platform's master domain changes (rare).
- Delete -- destroys the cluster, all CP and worker VMs, the API LB, and the cluster's SGs. Irreversible.
- Force Cleanup -- visible if a previous delete failed; force-removes whatever residual records remain. Use only after a normal delete fails.
Deleting a cluster removes every node, the API LB, certificates, and all in-cluster data. Take a backup of anything you need first.
Troubleshooting
Cluster stuck in Provisioning
- Check the Tasks tab -- the failing task's log shows the phase and error
- Verify the VPC's NAT Gateway is healthy and has internet -- workers will not finish bootstrap without outbound DNS / package fetch
- Verify the chosen Kubernetes version still has its CP and Worker images registered (admin task)
kubectl connection refused
- Confirm the cluster is
Runningon the overview page - For a public endpoint: confirm the SG on the LB scope allows your source IP on the API port (defaults to TCP 6443)
- For a private endpoint: confirm you're running
kubectlfrom a host inside the VPC
Pods stuck in Pending
- Open the Nodes tab and check whether any node is under heavy resource pressure
- If autoscaling is enabled: confirm
maxhasn't been hit - If autoscaling is disabled: scale workers manually from the Workers tab
Service: type=LoadBalancer stuck without EXTERNAL-IP
- Check
kubectl describe svc <name>for aCreateLoadBalancerFailedEvent -- it'll name a specific error likemissing_annotationorunknown_lb_plan - Verify the annotation
service.beta.kubernetes.io/managed-loadbalancer-planreferences an LB plan enabled in the cluster's region - See the Service LoadBalancer annotation reference for every supported annotation
Worker upgrade stalls
- One pod with a strict
PodDisruptionBudgetcan block drain indefinitely - Identify the blocking pod with
kubectl get pods --field-selector status.phase=Running -A -o widefiltered to the draining node - Either relax the PDB, scale the workload temporarily, or increase Drain grace (sec)