Beta Release Version v2.2.3
Version v2.2.3 is a major feature release headlined by Managed Kubernetes, a fully integrated Kubernetes-as-a-Service offering that runs alongside Instances, Volumes, Load Balancers, and Databases. Customers can spin up a control plane (single-node or 3-node HA), attach workers in per-purpose pools, expose Kubernetes Services through the bundled in-cluster cloud controller manager, autoscale workloads end to end with the cluster autoscaler, and roll the cluster forward to a newer Kubernetes version, all without touching the slave host. The release also ships a redesigned master backup pipeline with pluggable storage drivers and Grandfather-Father-Son retention, a new System Health dashboard widget, scheduled task health tracking, an app-wide timezone setting, team-member permissions for Kubernetes resources, retry for failed cluster creates, and a long list of reliability and performance improvements including a 70% reduction in peak load for the hot-path cron loop that runs every 30 seconds against the entire fleet.
- [Feature] Managed Kubernetes - Create production-grade Kubernetes clusters directly from the control panel. Choose single-node or 3-node HA control plane, pick instance plans and subnets for control plane and workers separately, and bring up the cluster with a bundled HAProxy load balancer for the Kubernetes API. Real-time progress streams to the cluster show page via WebSocket; downloaded kubeconfig points at the right private or public endpoint automatically.
- [Feature] Worker Node Pools - Each cluster has a default worker pool and supports unlimited additional pools, each with their own instance plan, labels, taints, autoscaling bounds, and drain settings. Useful for GPU nodes, memory-optimized workloads, or isolating tenants in a single cluster.
- [Feature] Cluster Autoscaler - Bundled cluster autoscaler binary speaks the Hypervisor API directly. Policy-driven scaling on CPU + memory pressure of pending pods, per-pool aware, and respecting each pool's min/max bounds. Manifest generated on demand from the cluster show page, customers grab the YAML and apply with
kubectl apply -f -. Controller token refreshes on a rolling schedule so long-lived clusters never need a manual re-issue. - [Feature] In-Cluster Cloud Controller Manager - Services of type LoadBalancer provision and tear down a real Hypervisor load balancer per service. Service annotations control listener port, backend mode (TCP / HTTP / per-port hybrid), session stickiness, multi-cert SNI, routing rules, and traffic split between subset endpoints.
- [Feature] Worker and Control Plane Rolling Upgrades - Upgrade Workers card on the Workers tab provisions new workers at the target version, drains old ones, repeats. Upgrade Control Plane card does the same for CPs via surge-replace strategy, etcd-quorum-safe at every step. Cluster card shows CP version and worker baseline as two distinct lines with a "mid-upgrade" badge when they diverge.
- [Feature] Retry Failed Cluster Create - A new "Retry create" button on the cluster page tears down partially provisioned artifacts and re-runs the bootstrap on the same row. Cluster name, slug, and identity certificates are preserved so any kubeconfig the user already downloaded stays valid. No more delete-and-recreate after a transient quota or capacity precondition fails.
- [Feature] Cluster Security Groups - Three auto-managed security groups per cluster (LB-only, CP-only, worker-only). Default rules expose the Kubernetes API on
:443via the LB and lock down direct access to CP nodes':6443from outside the cluster. Admins and users layer additional rules through a familiar Inbound / Outbound sub-tabbed interface. - [Feature] Restricted Kubeconfig - Downloaded kubeconfig issued at cluster create exposes only worker nodes to
kubectl get nodes. Control-plane VMs are hidden from end users in the Compute list, billing reports, monitoring tiles, and the cluster Nodes tab. - [Feature] Master Backup Pipeline Redesign - Service-oriented orchestrator with pluggable storage drivers (Local, S3-compatible, Rsync over SSH, NFS), a singleton lock that survives long uploads, Grandfather-Father-Son retention, email + webhook notifications, and a configurable cron expression. Multiple destinations supported per install. Admin pages cover Destinations, Runs, Settings, and Scheduler Health.
- [Feature] Scheduled Task Health Tracking - Every scheduled task is observed via a unified health surface. Per-task tracking of last run, duration, exit code, and consecutive failures. Compact admin Scheduler page with a slide-in drawer per task, friendly task names, and a daily prune to keep the audit table compact. Dashboard tile shows healthy / degraded / failed scheduled-task counts at the top of every page.
- [Feature] System Health Dashboard Widget - Single compact strip on the admin dashboard showing four critical metrics at a glance: most recent successful master backup, scheduled-task health rollup, in-flight long-running tasks, and queue worker failed-job count. Replaces two separate tiles from earlier releases.
- [Feature] Application Timezone Setting - Pick any IANA timezone from a new dropdown under Admin > System > Settings > General. Applied app-wide on boot (Carbon, model date casts, scheduler firing times, direct PHP date functions). Default for customers signing up via self-registration and billing-API user creations, unless explicitly overridden. Existing users keep their own timezone selection.
- [Feature] Kubernetes Team Permissions - New
kubernetes.*permission family with three tiers (view, manage, delete) granted through the existing team-member invitation flow. Predefined roles get sensible defaults from the migration. Custom roles need to be granted the new permissions explicitly. - [Feature] Admin Destructive Controls for Clusters - Dedicated section for safe escape hatches when a cluster has gone wrong. Suspend locks out the customer while preserving forensics. Reset State clears stuck-operation flags. Force Cleanup bypasses normal teardown for clusters with zombie resources. Separate rate limits keep destructive (5/hour) and recovery (20/hour) actions distinct.
- [Feature] AWS-Style Node Drill-Down - Clicking a node on the cluster Nodes tab opens a side drawer with capacity gauges (CPU / RAM / disk), pod listing with pagination and search, taints section, and modern dark/light surface styling.
- [Feature] Cluster-Managed Resource Lockdown - Worker instances and the CP load balancer carry a clear "Cluster-managed" badge and a read-only banner in the user's Compute and Load Balancers lists. Direct power cycle, plan change, or LB rule edit is blocked at the controller. Manage them through the cluster page instead.
- [Feature] Live Load Balancer Filtering - User-side Load Balancers index now supports AJAX live filtering by name, status, and VPC. Useful for customers running dozens of LBs across multiple VPCs.
- [Feature] Cluster Activity Feed - User dashboard activity feed now translates Kubernetes audit-log actions into friendly sentences ("Created cluster prod-01", "Upgraded workers to 1.35.0") alongside the other resource types.
- [Feature] Pre-Flight Quota and Capacity Guards - Cluster create form rejects at submit time when load balancer quota is exhausted, when the chosen VPC has no NAT Gateway (needed for control-plane image pulls), or when the CP subnet is not private. Clear messages name the limit and point at the affected field instead of failing deep in the bootstrap chain.
- [Improvement] Performance at Fleet Scale - A round of profiling against simulated 1000-instance / 100-hypervisor / 100-cluster fleets surfaced several hot paths in the 30-second and 2-minute cron loops. Hypervisor instance statistics collector now bulk-loads instances and bandwidth rows per hypervisor instead of one query per VM and per interface, dropping query count from roughly 5000 per tick to about 200. NAT-gateway monitor caches hypervisor health checks in-process and in Redis with a 25-second TTL, so 100 NAT gateways concentrated on 10 hypervisors fires 10 probes per tick worst case instead of 100, zero in steady state. HA monitor evacuation memoizes target-hypervisor selection per resource spec for the duration of a failure. Self-provisioning bandwidth monitor eager-loads users with their pack instances and bandwidth state. S3 bandwidth collector uses
sum by (bucket)PromQL aggregation, dropping request count from about 30,000 calls per five-minute cycle to about 20. Cloud Service billing run executes its eleven services in parallel (capped at six concurrent) with per-service DB reconnection, collapsing wall-clock from "sum of all services" to "max of the slowest one". Combined effect at fleet scale is roughly a 70% reduction in peak database and slave-RPC load during the 30-second window. - [Improvement] Failed Task Failure Reasons Visible - Cluster Tasks tab now renders the failure reason inline on any failed task card. Previously the underlying error message was stored on the task row but never displayed, users had to message support to find out why a create or upgrade failed.
- [Improvement] Cluster Network Reliability - Per-cluster network namespace routing for slave-side kubectl proxy, HTTP timeout tuned for slave queue pickup latency, atomic VPC IP allocation hardened against parallel creates against the same subnet via row-level locking.
- [Improvement] Kubeconfig Download Hygiene - Strips byte-order marks and leading whitespace that previously broke
kubectlparsing on some platforms. Always uses the restricted user-scoped kubeconfig, never falls back to the admin kubeconfig (which would have exposed control-plane visibility and cluster-wide privileges). - [Improvement] Step-Based Cluster Create Wizard - Cluster create rebuilt as a step-based wizard with AJAX-driven Select2 controls for region, VPC, subnet, instance plan, LB plan, and version pickers. Progressive disclosure with required-field asterisks, info banners explaining trade-offs (HA vs single-node, public vs private endpoint), and a cost summary tile before submission.
- [Improvement] Cluster Pages Modernization - Cluster index uses status dots, sticky headers, tabular numbers, and a refined empty state with a CTA. Cluster show pages get state-coloured accent bars, breadcrumbs, tab counts, spinner badges on in-flight states, and consistent empty-state cards. Nodes tab shows instance names instead of UUIDs.
- [Improvement] Tasks Tab Pagination and Inline Logs - Long-running clusters no longer hit a wall at task #15. Tasks tab loads in pages with inline log expansion under the selected task and a source filter (master / slave / cluster).
- [Improvement] Worker Node Role Label - Workers automatically receive the standard
node-role.kubernetes.io/workerlabel sokubectl get nodesshows the worker role instead of<none>. Re-applies idempotently and skips clusters whose labels are already stable. - [Improvement] User Scripts Sidebar Reorg - Moved User Scripts from the Security section to the Compute section in the sidebar, where it logically belongs alongside Instances and Images.
- [Improvement] Customer-Safe Idempotency for Kubernetes Writes - Every Kubernetes write endpoint accepts an idempotency key and replays the original response from a 24-hour cache instead of double-acting on retries. Standard
X-RateLimit-*headers on every response, 429 with a clear retry-after when exceeded, isolated per-user. - [Improvement] Backup Notification Email Collapse - Backup failures used to fan out one email per admin recipient, accumulating addresses across iterations and causing duplicate inbox copies. Notifications are now collapsed into a single message with admins on CC. (Carryover fix for v2.2.2's instance-backup notification path.)
- [Improvement] Settings Encryption Double-Encrypt Fix - First-time setting writes no longer end up ciphertext-of-ciphertext (which broke later reads). The mutator now runs exactly once on both create and update paths. (Carryover fix for v2.2.2's expanded settings surface.)
- [Improvement] Schedule Cron Expression Validation - An invalid expression saved into the settings table (corrupted ciphertext, hand-edited typo) used to take the entire scheduler down for that tick. The expression is now validated at boot, invalid values log a warning and fall back to the legacy frequency default. (Carryover fix for v2.2.2's backup schedule customization.)
- [Improvement] VNC Password Decoding Hardening - Per-instance VNC password accessor now returns an empty string instead of throwing when the stored ciphertext is unreadable, so a single corrupt row no longer breaks the entire instance page.
- [Improvement] Download Task Heartbeats - Image-download tasks now stay alive while the slave is still actively reporting progress, instead of being failed by the stuck-task janitor mid-transfer on slow links.
- [Improvement] Install Auto-Generates Slave Callback Signing Key - Fresh installs and updates auto-generate the slave callback signing key if one is not already configured, eliminating a manual
.envstep from the install runbook. - [Improvement] Hostname Allocator Atomic - The
h<N>instance-name allocator now uses an atomic increment, eliminating the race where two clusters created in parallel could pick the same VM name and one would fail at deploy. - [Fix] Stuck Cluster Destroy - Clusters whose state column became empty (partial-create wedge) can now be force-destroyed without the state machine refusing the transition.
- [Fix] Phantom Upgrade Waves - The CP rolling upgrade orchestrator no longer matches pre-existing target-version nodes as "this wave's new CP", which previously caused old CPs to be torn down before the actual new ones had joined.
- [Fix] Reconciler vs Orchestrator Race - The stuck-CP reconciler now skips clusters that still have an active orchestration task or operation lock in flight, preventing it from tearing down a healthy CP that simply took longer than 15 minutes to bootstrap.
- [Fix] CCM Rollout Target Name - Cluster bootstrap previously hung at the CCM install phase because the bundled in-cluster controller binary was watching a stale deployment name. Rebuilt binary published as part of this release. Existing deployments that hit
ccm_install fail rollout_timeouton cluster create should retry the create after upgrading. - [Fix] Cluster Index List Auto-Refresh - User cluster index now refreshes via the same WebSocket channel the show page uses, instead of relying on stale Inertia hydration timing.
- [Fix] NAT-Gateway-Owned IPs Skipped - Atomic IP allocation queries now skip IPs reserved by NAT gateways, preventing a fresh cluster from racing the NAT GW for the same address on a small subnet.
- [Fix] Slave Kubectl Authentication Header - Master-issued kubectl requests use the correct slave authentication header. Previously failed with 401 in some VPC topologies.
- [Fix] Worker Node Ready Watchdog - The workers-ready wait no longer hangs on nodes that fall into the Unknown state. Orchestrator now fails fast after a two-minute Unknown window so the customer sees a real error instead of a 30-minute hang.
- [Fix] Endpoint URL Clobbered on Finalize - The final cluster-create phase no longer overwrites the LB-derived endpoint with a broken primary-IP placeholder.
- [Fix] Cluster Delete IP Cleanup - Load balancer IP lookups during cluster delete now span all VPC subnets instead of only the CP subnet, so destroyed clusters fully release every IP they ever claimed.
- [Fix] VNC Port Allocation During HA Evacuation - Per-instance VNC port lookups during a mass evacuation no longer race when many instances target the same destination hypervisor.
- [Fix] Slave Version Cache TTL - Cached the slave version with a TTL that overlaps the hourly refresh cron, eliminating the "Unknown" state that appeared between refreshes.
- [Fix] Admin License Page Render Crash - Resolved a
Cannot read properties of undefined (reading 'call')from the webpack manifest caused by mixing new and legacy Inertia packages on a page resolved by the legacy app entry. - [Fix] Admin Dashboard Hypervisor Stats - Fixed a Select2 initialization race on heavy installs with many slaves that previously produced
( undefined ) - undefinedin the dropdown and anUncaught TypeErrorfrom a missing field. - [Fix] AI Middleware Truthiness - The AI middleware now accepts loose-truthy values (
1,"1",true,"true") for the AI-enabled setting, fixing a 403 on the admin AI conversations page when the setting was stored as a string.
Managed Kubernetes
Kubernetes is now a first-class resource type in the control panel. Users can create, scale, upgrade, and destroy production-grade clusters from the same UI they already use for instances and load balancers, while admins retain full operational visibility through a parallel admin-side view.
Cluster Lifecycle
The cluster create flow is a step-based wizard with progressive disclosure. Customers pick a region, name and slug, version, control-plane sizing, networking, and worker sizing. Required fields surface as asterisks. A cost summary tile shows recurring credit cost before submission.
Choose between a 1-node control plane (cheap, suitable for development) or a 3-node high-availability control plane (production-grade, etcd-quorum-safe) at creation time. HA mode is enforced at 3 nodes to keep etcd quorum math sound.
Every cluster gets a bundled HAProxy load balancer for the Kubernetes API, deployed automatically on cluster create. Customers can pick the LB plan from a dropdown so they can size it for the expected control-plane traffic. The LB exposes a private VPC IP by default, and optionally a public IP if the endpoint mode is set to public-and-private. The downloaded kubeconfig is automatically rewritten to point at the right hostname.
Control plane and workers can live on different VPC subnets, both auto-validated against the cluster's chosen region. The cluster show page in both admin and user panels shows live state badges, a hero tile with the Kubernetes logo, region, and version, and a tab-based navigation (Overview, Nodes, Pools, Security, SSL, Upgrades, Tasks).
Real-time progress streams via WebSocket. Every cluster bootstrap phase, every worker join, every upgrade wave, every node label, and every load-balancer change broadcasts to the show page as it happens. The Tasks tab paginates the full audit history. Selecting a task expands an inline log panel with source filtering (master / slave / cluster).
Retry Failed Cluster Create
When a cluster create errors out (most often because a quota or capacity precondition trips mid-bootstrap, or because a slave is temporarily unreachable), users no longer need to delete and recreate. A new Retry create button on the cluster page tears down any partially provisioned artifacts (control-plane VMs, the bundled load balancer, security groups, reserved IPs) and re-runs the bootstrap on the same row.
The cluster name, slug, and CA bundle are preserved. Any kubeconfig the user already downloaded mid-flight stays trusted across the retry. The button is visible only when the cluster is in the error state, sits between Redeploy CCM and Delete, and shows the previous failure's reason inline in the confirmation modal.
Worker Pools and Autoscaling
Every cluster has one default worker pool created automatically on cluster create. Customers can add additional pools with their own instance plan, labels, taints, autoscaling bounds, and drain settings. Useful for workloads that need GPU nodes, memory-optimized nodes, or isolated tenants in a single cluster.
Each pool exposes its own min/max bounds and current target count. Editing target count in the pool modal manually scales workers up or down. Toggling autoscaling on the pool hands control over to the cluster autoscaler.
The bundled cluster autoscaler binary speaks the Hypervisor API directly. Scaling is policy-driven on CPU and memory pressure of pending pods, per-pool aware, and respects each pool's min/max bounds. The autoscaler manifest is generated on demand from the cluster show page. Customers grab the YAML and apply it with kubectl apply -f -. The autoscaler refreshes its controller token on a rolling schedule via the same broker the in-cluster cloud-controller-manager uses, so long-lived clusters never need a manual token re-issue.
Per-pool labels and taints are applied to workers in that pool at join time and re-applied on every reconcile, so a node accidentally relabelled via kubectl heals back to the platform spec. When a worker is replaced during a surge upgrade or capacity churn, its original creation time follows it onto the replacement node, so FIFO scale-down policies pick genuinely older workloads instead of the latest replacement.
Rolling Upgrades
The Upgrade Workers card on the Workers tab takes a target Kubernetes version, max surge, and drain grace period. The orchestrator provisions a new worker, waits for Ready, cordons and drains the oldest in-version worker, and repeats until every worker is at the target version. Wave count, current wave, and per-wave progress stream live to the UI.
The Upgrade Control Plane card on the same tab uses a fixed surge=1 strategy. New CPs join the etcd quorum, get verified as voting members, then old CPs are gracefully removed. Worker drain is deliberately skipped for control-plane nodes: they run static pods that exit cleanly when the underlying VM is destroyed.
Workers can be upgraded only up to the cluster's current control-plane version. Cross-CP upgrades surface a clear error pointing the user at the CP-upgrade card. Single-minor jumps only, major-version jumps are blocked at the form level.
The cluster card shows the control-plane version and worker baseline as two distinct lines, plus a "mid-upgrade" badge when they diverge during a staged upgrade. No more confusion about which half of the cluster is on which version.
Networking and Security
Every cluster gets three auto-managed security groups: an LB-only group, a CP-only group, and a worker-only group. Default rules expose the Kubernetes API on :443 via the LB and lock down direct access to the control-plane nodes' :6443 from outside the cluster. Admins and users layer additional rules through a familiar Inbound / Outbound sub-tabbed interface, with cluster-managed rules visibly separated from user-added ones.
Each cluster's bundled LB exposes a per-port SSL tab where the user can attach cluster-managed Let's Encrypt or custom certificates without touching the LB directly.
Worker instances and the CP load balancer are flagged as cluster-managed. They appear in the user's Compute and Load Balancers lists with a clear "Cluster-managed" badge and a read-only banner. Direct power cycle, network detach, plan change, or LB rule edit on cluster-managed resources is blocked at the controller. Manage them through the cluster page instead.
Control-plane VMs do not appear in the user's Compute list, billing reports, monitoring tiles, or cluster Nodes tab. A restricted kubeconfig issued at cluster create exposes only worker nodes to kubectl get nodes. The full admin kubeconfig is never served to end users.
Clusters provision into the customer's existing VPC and subnets. No NAT layer to manage, cluster traffic uses the same private routing already in place for the customer's instances. A NAT Gateway is required in the chosen VPC because control-plane bootstrap needs egress for image pulls, and the cluster create form refuses up front if one is missing.
Service:LoadBalancer
Services of type LoadBalancer provision and tear down a real Hypervisor load balancer per service through the in-cluster controller. Service annotations control listener port, backend mode (TCP / HTTP / per-port hybrid), session stickiness, multi-cert SNI, routing rules, and traffic split between subset endpoints.
LBs created this way show up in the user's regular Load Balancers list under their cluster column with a type filter to separate them from manually created LBs. Editing the underlying LB directly is blocked: service annotations are the source of truth, and an out-of-band change would be reverted on the next reconcile.
Audit, Rate Limiting, and Team Permissions
Every Kubernetes write endpoint accepts an idempotency key, and customers may safely retry a POST or PATCH and the platform replays the original response from the 24-hour cache instead of double-charging or double-acting.
Per-user rate limits on Kubernetes mutations are broken out by operation class (apply labels, delete node, upgrade, redeploy CCM). Standard X-RateLimit-* headers on every response, 429 with a clear retry-after when exceeded, isolated per-user so noisy tenants do not drown out quiet ones.
Every cluster action (create, delete, upgrade, scale, label, suspend) logs to the user dashboard activity feed with friendly action names.
Team / subuser permissions now cover Kubernetes end to end. A new kubernetes.* permission family with three tiers (view, manage, delete) is granted through the existing team-member invitation flow. Predefined roles get sensible defaults from the migration. Custom roles need to be granted the new permissions explicitly if they should be able to manage clusters. Subusers with view-only permissions still get the same restricted kubeconfig as the cluster owner: no privilege escalation through downloaded credentials.
Admin Destructive Controls
A dedicated admin section gives operators safe escape hatches for clusters that have gone wrong without forcing a full destroy.
Suspend locks out the customer while preserving the cluster row for forensics. User-side mutation routes refuse with a clear "suspended by admin" message, admin-side mutation still works.
Reset State clears the cluster's stuck-operation flag when a master-side orchestrator crash leaves a cluster pinned in starting or destroying forever.
Force Cleanup is a nuclear destroy that ignores the normal phase chain. Used when the cluster has zombie resources on the slave that the regular destroy cannot reach.
Destructive endpoints cap at 5/hour per admin, recovery endpoints (reset state) cap at 20/hour, so operators chasing a flapping cluster do not lock themselves out.
Reconciliation Safety Nets
Background reconcilers catch the edge cases that the synchronous orchestrator misses:
- Stuck-cluster reconciler runs every five minutes and promotes worker nodes whose
kubectl get nodesreports Ready but whose internal state never flipped (usually because a bootstrap callback was dropped mid-flight). - Stuck control-plane reconciler does the same for control-plane VMs after a longer threshold, with a guard that refuses to act on clusters that still have an active orchestration task in flight, so a long-running create with a slow first CP does not get torn down from under the orchestrator.
- Worker role label stamper ensures workers carry the standard
node-role.kubernetes.io/workerlabel sokubectl get nodesshows the worker role instead of<none>. Re-applies idempotently and skips clusters whose labels are already stable, so the cost across hundreds of clusters in steady state is effectively zero. - Pending-deletion processor handles cordon then drain then delete in batches without blocking the rest of the cluster from operating.
Master Backup Pipeline Redesign
The master self-backup system has been rebuilt from scratch around a service-oriented orchestrator with pluggable storage drivers, a singleton lock that survives long uploads, and a real scheduler-health surface.
Storage Drivers
Four drivers ship in v2.2.3:
- Local writes snapshots straight to a local path. Best for staging or single-host installs.
- S3-compatible streams multipart uploads through the platform's filesystem layer. Works with RustFS, AWS S3, Cloudflare R2, Backblaze B2, and any MinIO-compatible endpoint.
- Rsync pushes to a remote SSH host with proper port and bandwidth-limit flag handling. Ideal for off-site backup-server topologies.
- NFS writes to a mounted NFS export.
Multiple destinations are supported per install, each configured as its own entry on the new Backup Destinations page.
Operational Pipeline
The snapshot builder dumps the database in a single transaction (no inconsistency between tables), tars the relevant config and storage paths, and emits an immutable result manifest with byte counts, file counts, duration, and SHA-256 checksum.
A singleton lock with a 2-hour lifetime prevents two backups from running simultaneously while still surviving long uploads on slow links.
Grandfather-Father-Son retention keeps daily, weekly, and monthly buckets pruned on configurable thresholds. The retention calculator runs after every successful upload.
Email and webhook notifications fire on partial or failed runs. Email content is collapsed into a single CC'd message per failure event so admin inboxes do not fill up with duplicates (a regression from v2.2.2's instance-backup notification path, now fixed for both surfaces).
The schedule cron expression is configurable (defaults to daily at 02:00, the legacy frequency setting still works as a fallback). Invalid expressions saved into the settings table no longer take the scheduler down for that tick.
Admin UI
The new admin pages live under their own section:
- Destinations lists every configured destination with a driver-conditional add/edit modal so unused fields stay hidden.
- Runs shows backup history with a slide-in drawer per run that exposes per-step timings, transferred bytes, and any error messages.
- Settings consolidates schedule expression, retention windows, and notification recipients in one place.
- Scheduler Health surfaces backup runs alongside every other scheduled task on the platform.
Scheduled Task Health Tracking
Every Laravel-managed scheduled task (built-in, custom, vendor) is now observed via a unified health surface.
Per-task tracking of last run time, duration, exit code, and consecutive failures. Visible at a glance from the admin Scheduler page with a compact table layout, one row per task, a slide-in drawer for full run history, and stat tiles at the top for fleet-wide visibility.
Friendly task names map the raw command strings to human-readable labels so the table reads like a status board, not a stack trace. A daily prune keeps the audit table compact while preserving enough history to spot recurring failures.
A dashboard tile on the admin home shows the count of healthy / degraded / failed scheduled tasks at the top of every page.
System Health Dashboard Widget
The admin dashboard now has a single, compact System Health strip showing four critical metrics at a glance:
- Most recent successful master backup (and how long ago)
- Scheduled-task health rollup
- In-flight long-running tasks across the platform
- Queue worker failed-job count
The two earlier separate tiles (backup status and cron status) are removed in favour of this consolidated widget. Click any metric to navigate to its detail page.
Application Timezone Setting
A new Application Timezone setting lives under Admin > System > Settings > General. Pick any IANA timezone identifier (Asia/Kolkata, America/New_York, Europe/Berlin, and so on).
Applied app-wide on boot: Carbon, model date casts, scheduler firing times, and direct PHP date functions all honour the configured zone.
Customers signing up through self-registration inherit the configured timezone. Billing-API user creations honour it too unless the integrating system explicitly supplies one. Existing users keep their own timezone selection so the change is non-disruptive.
Invalid identifiers (typo, deprecated zone) are rejected at boot with a clear warning in the logs, and the platform falls back to the safe .env default rather than refusing to start.
Performance at Fleet Scale
A round of profiling against simulated fleets of 1000 instances, 100 hypervisors, and 100 Kubernetes clusters surfaced several hot paths in the 30-second and 2-minute cron loops. The biggest wins:
The hypervisor instance statistics collector (runs every 30 seconds) now bulk-loads instances and bandwidth rows per hypervisor instead of one query per VM and per interface. At 100 hypervisors with 1000 instances total, query count drops from roughly 5000 per tick to about 200.
Hypervisor health checks are shared between the HA monitor and the VPC NAT-gateway monitor through an in-process + Redis cache with a 25-second TTL. 100 NAT gateways concentrated on 10 hypervisors fires 10 probes per tick worst case instead of 100, and zero in steady state.
The HA monitor evacuation flow memoizes target-hypervisor selection results per resource spec for the duration of a single failure. 100 instances with the same instance plan now share a single resource-availability lookup instead of recomputing the same answer 100 times.
The self-provisioning bandwidth monitor eager-loads pack users with their pack instances and bandwidth state in a single query batch per chunk, dropping the per-user query count by an order of magnitude.
The S3 bandwidth collector uses a single sum by (bucket) PromQL query per server instead of three queries per bucket. At 10 servers with 100 buckets each, the request count drops from about 30,000 calls per five-minute cycle to about 20.
The Cloud Service billing run executes its eleven billing services in parallel (capped at six concurrent processes) with per-service database reconnection on fork. Wall-clock time per tick collapses from "sum of all service runtimes" to "max of the slowest one", comfortably under the 2-minute cron window even with hundreds of NAT gateways and load balancers.
Targeted database indexes on instance-name lookups, bandwidth-by-date scans, and NAT-gateway-by-hypervisor queries eliminate full-table scans in the affected hot paths.
The combined effect at the simulated 100-hypervisor / 1000-instance scale is roughly a 70% reduction in peak database and slave-RPC load during the 30-second window.
Upgrade Notes
- Refresh dependencies and rebuild frontend bundles. The new Kubernetes cluster pages, node pool modals, master backup admin pages, and System Health widget are all Vue components. The bundles must be rebuilt before any of the new UI surfaces appear.
- Run database migrations.
- Restart queue workers so the new event broadcasters, console commands, and queues are picked up.
- Restart the WebSocket broadcaster so the new Kubernetes channels and master-backup channels are registered.
- Slave deployment: each hypervisor needs the matching slave update with the new Kubernetes per-cluster proxy, the new cluster bootstrap agent, and the volume snapshot/backup engine bridges. Trigger a slave update from the admin panel's hypervisor manage page. The asynchronous update flow introduced in v2.2.2 makes this safe to run live.
- Kubernetes setup:
- Mark at least one hypervisor group as Kubernetes-enabled.
- Register at least one Kubernetes supported version, with the matching control-plane and worker images uploaded.
- Optionally register the cluster-autoscaler image so customers can grab the autoscaler manifest from the cluster show page.
- Optionally configure the in-cluster cloud-controller-manager image, the platform falls back to a sensible default if none is configured.
- Application Timezone: review the new General-tab Application Timezone setting and pick the zone you want new self-registered and billing-API users to inherit. Existing users keep their own timezone.
- Master Backup: register at least one Backup Destination on the new admin page if you want master self-backups. Configure schedule and retention through the Settings page in the same section.
- Team permissions: existing predefined roles get sensible Kubernetes-permission defaults from the migration. Custom team roles need to be reviewed and granted the new permissions explicitly if they should be able to manage clusters.
- First-time Kubernetes cluster creates require a VPC with an active NAT Gateway in the chosen region, since control-plane bootstrap needs egress for image pulls. Customers without one are blocked at the create form with a clear remediation message.