High Availability

The High Availability (HA) feature provides automated monitoring, failure detection, and recovery for your hypervisor infrastructure. When a hypervisor fails, the system automatically evacuates instances to healthy hosts, ensuring maximum uptime for your virtual machines.

Overview

The HA system continuously monitors hypervisor health and takes automated actions when failures are detected:

Health Monitoring: Every 30 seconds, the system checks all HA-enabled hypervisors using parallel processing
Failure Detection: Combines ping tests and API validation to detect unhealthy hosts
IPMI Fencing: Automatically powers off failed hypervisors to prevent split-brain scenarios
Instance Evacuation: Migrates instances from failed hypervisors to healthy alternatives
Event Logging: Maintains comprehensive audit trail of all HA actions

Architecture

Monitoring Process

The HA monitor runs as a scheduled task (ha:monitor) every 30 seconds with the following workflow:

┌─────────────────────────────────────────────────────────────┐
│  HA Monitor (Every 30 seconds)                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
        ┌───────────────────────────────────────┐
        │  Load HA-enabled hypervisors          │
        │  (HypervisorGroup or CsLocation HA)   │
        └───────────────────────────────────────┘
                            │
                            ▼
        ┌───────────────────────────────────────┐
        │  Parallel Health Checks (10 workers)  │
        │  - Ping test                          │
        │  - API validation (/health endpoint)  │
        └───────────────────────────────────────┘
                            │
                ┌───────────┴───────────┐
                ▼                       ▼
        ┌───────────┐           ┌─────────────┐
        │  Healthy  │           │   Failed    │
        └───────────┘           └─────────────┘
                                        │
                    ┌───────────────────┼───────────────────┐
                    ▼                   ▼                   ▼
            ┌──────────────┐    ┌─────────────┐    ┌─────────────┐
            │ Increment    │    │ IPMI Fence  │    │  Evacuate   │
            │ Failures     │    │ (Power Off) │    │  Instances  │
            └──────────────┘    └─────────────┘    └─────────────┘
                    │                   │                   │
                    └───────────────────┴───────────────────┘
                                        │
                                        ▼
                                ┌──────────────┐
                                │  Log Event   │
                                └──────────────┘

Prerequisites

1. IPMI Configuration

Each hypervisor must have IPMI (Intelligent Platform Management Interface) configured for remote power management:

IPMI enabled in BIOS/UEFI
Network connectivity to IPMI interface from master server
Valid credentials (username/password) for IPMI access
ipmitool installed on master server:

# Debian/Ubuntu
apt-get install ipmitool

# CentOS/RHEL
yum install ipmitool

2. Network Requirements

Master server must be able to reach all hypervisor IPMI interfaces
Hypervisors must be able to reach each other for migrations
Reliable network connection between master and all hypervisors

3. Storage Configuration

For automatic evacuation to work effectively:

Shared storage (Ceph/RBD datastores) recommended for instant migrations
Local storage instances will require full disk transfer during evacuation
Sufficient storage capacity on destination hypervisors

Configuration

Enable HA at Location Level

HA can be enabled for all hypervisors in a Cloud Service location:

Admin Navigation menu Navigate to Cloud Service → Locations
Edit the desired location
Configure HA settings:
- HA Enabled: Toggle to enable HA for this location
- Failure Threshold: Number of consecutive failures before taking action (default: 3)
- Evacuation Enabled: Enable automatic instance evacuation on failure
Click Save

All hypervisors assigned to this location will be monitored for HA.

Enable HA at Hypervisor Group Level

HA can also be enabled for specific hypervisor groups:

Navigate to Compute → Hypervisor Groups
Edit the desired group
Configure HA settings:
- HA Enabled: Toggle to enable HA for this group
- Failure Threshold: Number of consecutive failures before taking action (default: 3)
- Evacuation Enabled: Enable automatic instance evacuation on failure
Click Save

All hypervisors in this group will be monitored for HA.

Configure IPMI Credentials

Each hypervisor requires IPMI credentials for fencing:

Navigate to Admin Panel → Hypervisors
Edit the hypervisor
Scroll to IPMI Configuration section:
- IPMI Host: IP address or hostname of IPMI interface
- IPMI Port: Port number (default: 623)
- IPMI Username: IPMI username
- IPMI Password: IPMI password
Click Save

Testing IPMI

Test IPMI connectivity from the master server:

ipmitool -I lanplus -H <ipmi-host> -U <username> -P <password> power status

Configure Per-Instance HA

Instances can opt-in/opt-out of HA evacuation:

Navigate to instance details page
Edit instance settings
Toggle HA Enabled field
Configure Max Restart Attempts: Number of restart attempts before evacuation (default: 3)

When a hypervisor fails:

HA Enabled = Yes: Instance will be evacuated to another hypervisor
HA Enabled = No: Instance will remain on the failed hypervisor (manual intervention required)

Monitoring and Events

Event Types

The HA system logs four types of events:

Event Type	Description	Trigger
`hypervisor_up`	Hypervisor recovered from failure	Health checks pass after being down
`hypervisor_down`	Hypervisor marked as failed	Consecutive failures reached threshold
`hypervisor_fenced`	IPMI power-off executed	Before evacuation begins
`instance_evacuated`	Instance migrated to new host	Evacuation completed successfully

Viewing Events

Access the HA event log:

Navigate to System → HA Events

Each event includes:

Timestamp
Event type
Affected hypervisor/instance
Additional metadata (destination hypervisor, failure count, etc.)

Best Practices

1. Set Appropriate Thresholds

Failure Threshold:
- Too low (1-2): May trigger false positives during network blips
- Too high (5+): Delays response to genuine failures
- Recommended: 3-4 consecutive failures (90-120 seconds)

2. Test IPMI Regularly

Verify IPMI credentials are current
Test power operations periodically
Monitor IPMI network connectivity
Keep IPMI firmware updated

3. Plan Evacuation Capacity

Ensure sufficient resources on remaining hypervisors for evacuated instances
Consider N+1 or N+2 redundancy when sizing infrastructure
Monitor resource utilization to avoid capacity issues during evacuation

4. Use Shared Storage When Possible

Ceph/RBD datastores: Near-instant evacuation (no disk transfer)
Local storage: Full disk transfer required (slower evacuation)
Mixed environments supported with automatic detection

5. Monitor Event Log

Review HA events regularly
Investigate frequent failures (may indicate hardware/network issues)
Alert on fencing events for immediate attention

Overview​

Architecture​

Monitoring Process​

Prerequisites​

1. IPMI Configuration​

2. Network Requirements​

3. Storage Configuration​

Configuration​

Enable HA at Location Level​

Enable HA at Hypervisor Group Level​

Configure IPMI Credentials​

Configure Per-Instance HA​

Monitoring and Events​

Event Types​

Viewing Events​

Best Practices​

1. Set Appropriate Thresholds​

2. Test IPMI Regularly​

3. Plan Evacuation Capacity​

4. Use Shared Storage When Possible​

5. Monitor Event Log​