High Availability
The High Availability (HA) feature provides automated monitoring, failure detection, and recovery for your hypervisor infrastructure. When a hypervisor fails, the system automatically evacuates instances to healthy hosts, ensuring maximum uptime for your virtual machines.
Overview
The HA system continuously monitors hypervisor health and takes automated actions when failures are detected:
- Health Monitoring: Every 30 seconds, the system checks all HA-enabled hypervisors using parallel processing
- Failure Detection: Combines ping tests and API validation to detect unhealthy hosts
- IPMI Fencing: Automatically powers off failed hypervisors to prevent split-brain scenarios
- Instance Evacuation: Migrates instances from failed hypervisors to healthy alternatives
- Event Logging: Maintains comprehensive audit trail of all HA actions
Architecture
Monitoring Process
The HA monitor runs as a scheduled task (ha:monitor) every 30 seconds with the following workflow:
┌─────────────────────────────────────────────────────────────┐
│ HA Monitor (Every 30 seconds) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Load HA-enabled hypervisors │
│ (HypervisorGroup or CsLocation HA) │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Parallel Health Checks (10 workers) │
│ - Ping test │
│ - API validation (/health endpoint) │
└───────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────┐ ┌─────────────┐
│ Healthy │ │ Failed │
└───────────┘ └─────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Increment │ │ IPMI Fence │ │ Evacuate │
│ Failures │ │ (Power Off) │ │ Instances │
└──────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┴───────────────────┘
│
▼
┌──────────────┐
│ Log Event │
└──────────────┘
Prerequisites
1. IPMI Configuration
Each hypervisor must have IPMI (Intelligent Platform Management Interface) configured for remote power management:
- IPMI enabled in BIOS/UEFI
- Network connectivity to IPMI interface from master server
- Valid credentials (username/password) for IPMI access
- ipmitool installed on master server:
# Debian/Ubuntu
apt-get install ipmitool
# CentOS/RHEL
yum install ipmitool
2. Network Requirements
- Master server must be able to reach all hypervisor IPMI interfaces
- Hypervisors must be able to reach each other for migrations
- Reliable network connection between master and all hypervisors
3. Storage Configuration
For automatic evacuation to work effectively:
- Shared storage (Ceph/RBD datastores) recommended for instant migrations
- Local storage instances will require full disk transfer during evacuation
- Sufficient storage capacity on destination hypervisors
Configuration
Enable HA at Location Level
HA can be enabled for all hypervisors in a Cloud Service location:
- Admin Navigation menu Navigate to Cloud Service → Locations
- Edit the desired location
- Configure HA settings:
- HA Enabled: Toggle to enable HA for this location
- Failure Threshold: Number of consecutive failures before taking action (default: 3)
- Evacuation Enabled: Enable automatic instance evacuation on failure
- Click Save
All hypervisors assigned to this location will be monitored for HA.
Enable HA at Hypervisor Group Level
HA can also be enabled for specific hypervisor groups:
- Navigate to Compute → Hypervisor Groups
- Edit the desired group
- Configure HA settings:
- HA Enabled: Toggle to enable HA for this group
- Failure Threshold: Number of consecutive failures before taking action (default: 3)
- Evacuation Enabled: Enable automatic instance evacuation on failure
- Click Save
All hypervisors in this group will be monitored for HA.
Configure IPMI Credentials
Each hypervisor requires IPMI credentials for fencing:
- Navigate to Admin Panel → Hypervisors
- Edit the hypervisor
- Scroll to IPMI Configuration section:
- IPMI Host: IP address or hostname of IPMI interface
- IPMI Port: Port number (default: 623)
- IPMI Username: IPMI username
- IPMI Password: IPMI password
- Click Save
Test IPMI connectivity from the master server:
ipmitool -I lanplus -H <ipmi-host> -U <username> -P <password> power status
Configure Per-Instance HA
Instances can opt-in/opt-out of HA evacuation:
- Navigate to instance details page
- Edit instance settings
- Toggle HA Enabled field
- Configure Max Restart Attempts: Number of restart attempts before evacuation (default: 3)
When a hypervisor fails:
- HA Enabled = Yes: Instance will be evacuated to another hypervisor
- HA Enabled = No: Instance will remain on the failed hypervisor (manual intervention required)
Monitoring and Events
Event Types
The HA system logs four types of events:
| Event Type | Description | Trigger |
|---|---|---|
hypervisor_up | Hypervisor recovered from failure | Health checks pass after being down |
hypervisor_down | Hypervisor marked as failed | Consecutive failures reached threshold |
hypervisor_fenced | IPMI power-off executed | Before evacuation begins |
instance_evacuated | Instance migrated to new host | Evacuation completed successfully |
Viewing Events
Access the HA event log:
- Navigate to System → HA Events
Each event includes:
- Timestamp
- Event type
- Affected hypervisor/instance
- Additional metadata (destination hypervisor, failure count, etc.)
Best Practices
1. Set Appropriate Thresholds
- Failure Threshold:
- Too low (1-2): May trigger false positives during network blips
- Too high (5+): Delays response to genuine failures
- Recommended: 3-4 consecutive failures (90-120 seconds)
2. Test IPMI Regularly
- Verify IPMI credentials are current
- Test power operations periodically
- Monitor IPMI network connectivity
- Keep IPMI firmware updated
3. Plan Evacuation Capacity
- Ensure sufficient resources on remaining hypervisors for evacuated instances
- Consider N+1 or N+2 redundancy when sizing infrastructure
- Monitor resource utilization to avoid capacity issues during evacuation
4. Use Shared Storage When Possible
- Ceph/RBD datastores: Near-instant evacuation (no disk transfer)
- Local storage: Full disk transfer required (slower evacuation)
- Mixed environments supported with automatic detection
5. Monitor Event Log
- Review HA events regularly
- Investigate frequent failures (may indicate hardware/network issues)
- Alert on fencing events for immediate attention