Skip to main content

High Availability

The High Availability (HA) feature provides automated monitoring, failure detection, and recovery for your hypervisor infrastructure. When a hypervisor fails, the system automatically evacuates instances to healthy hosts, ensuring maximum uptime for your virtual machines.

Overview

The HA system continuously monitors hypervisor health and takes automated actions when failures are detected:

  1. Health Monitoring: Every 30 seconds, the system checks all HA-enabled hypervisors using parallel processing
  2. Failure Detection: Combines ping tests and API validation to detect unhealthy hosts
  3. IPMI Fencing: Automatically powers off failed hypervisors to prevent split-brain scenarios
  4. Instance Evacuation: Migrates instances from failed hypervisors to healthy alternatives
  5. Event Logging: Maintains comprehensive audit trail of all HA actions

Architecture

Monitoring Process

The HA monitor runs as a scheduled task (ha:monitor) every 30 seconds with the following workflow:

┌─────────────────────────────────────────────────────────────┐
│ HA Monitor (Every 30 seconds) │
└─────────────────────────────────────────────────────────────┘


┌───────────────────────────────────────┐
│ Load HA-enabled hypervisors │
│ (HypervisorGroup or CsLocation HA) │
└───────────────────────────────────────┘


┌───────────────────────────────────────┐
│ Parallel Health Checks (10 workers) │
│ - Ping test │
│ - API validation (/health endpoint) │
└───────────────────────────────────────┘

┌───────────┴───────────┐
▼ ▼
┌───────────┐ ┌─────────────┐
│ Healthy │ │ Failed │
└───────────┘ └─────────────┘

┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Increment │ │ IPMI Fence │ │ Evacuate │
│ Failures │ │ (Power Off) │ │ Instances │
└──────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┴───────────────────┘


┌──────────────┐
│ Log Event │
└──────────────┘

Prerequisites

1. IPMI Configuration

Each hypervisor must have IPMI (Intelligent Platform Management Interface) configured for remote power management:

  • IPMI enabled in BIOS/UEFI
  • Network connectivity to IPMI interface from master server
  • Valid credentials (username/password) for IPMI access
  • ipmitool installed on master server:
# Debian/Ubuntu
apt-get install ipmitool

# CentOS/RHEL
yum install ipmitool

2. Network Requirements

  • Master server must be able to reach all hypervisor IPMI interfaces
  • Hypervisors must be able to reach each other for migrations
  • Reliable network connection between master and all hypervisors

3. Storage Configuration

For automatic evacuation to work effectively:

  • Shared storage (Ceph/RBD datastores) recommended for instant migrations
  • Local storage instances will require full disk transfer during evacuation
  • Sufficient storage capacity on destination hypervisors

Configuration

Enable HA at Location Level

HA can be enabled for all hypervisors in a Cloud Service location:

  1. Admin Navigation menu Navigate to Cloud ServiceLocations
  2. Edit the desired location
  3. Configure HA settings:
    • HA Enabled: Toggle to enable HA for this location
    • Failure Threshold: Number of consecutive failures before taking action (default: 3)
    • Evacuation Enabled: Enable automatic instance evacuation on failure
  4. Click Save

All hypervisors assigned to this location will be monitored for HA.

Enable HA at Hypervisor Group Level

HA can also be enabled for specific hypervisor groups:

  1. Navigate to ComputeHypervisor Groups
  2. Edit the desired group
  3. Configure HA settings:
    • HA Enabled: Toggle to enable HA for this group
    • Failure Threshold: Number of consecutive failures before taking action (default: 3)
    • Evacuation Enabled: Enable automatic instance evacuation on failure
  4. Click Save

All hypervisors in this group will be monitored for HA.

Configure IPMI Credentials

Each hypervisor requires IPMI credentials for fencing:

  1. Navigate to Admin PanelHypervisors
  2. Edit the hypervisor
  3. Scroll to IPMI Configuration section:
    • IPMI Host: IP address or hostname of IPMI interface
    • IPMI Port: Port number (default: 623)
    • IPMI Username: IPMI username
    • IPMI Password: IPMI password
  4. Click Save
Testing IPMI

Test IPMI connectivity from the master server:

ipmitool -I lanplus -H <ipmi-host> -U <username> -P <password> power status

Configure Per-Instance HA

Instances can opt-in/opt-out of HA evacuation:

  1. Navigate to instance details page
  2. Edit instance settings
  3. Toggle HA Enabled field
  4. Configure Max Restart Attempts: Number of restart attempts before evacuation (default: 3)

When a hypervisor fails:

  • HA Enabled = Yes: Instance will be evacuated to another hypervisor
  • HA Enabled = No: Instance will remain on the failed hypervisor (manual intervention required)

Monitoring and Events

Event Types

The HA system logs four types of events:

Event TypeDescriptionTrigger
hypervisor_upHypervisor recovered from failureHealth checks pass after being down
hypervisor_downHypervisor marked as failedConsecutive failures reached threshold
hypervisor_fencedIPMI power-off executedBefore evacuation begins
instance_evacuatedInstance migrated to new hostEvacuation completed successfully

Viewing Events

Access the HA event log:

  1. Navigate to SystemHA Events

Each event includes:

  • Timestamp
  • Event type
  • Affected hypervisor/instance
  • Additional metadata (destination hypervisor, failure count, etc.)

Best Practices

1. Set Appropriate Thresholds

  • Failure Threshold:
    • Too low (1-2): May trigger false positives during network blips
    • Too high (5+): Delays response to genuine failures
    • Recommended: 3-4 consecutive failures (90-120 seconds)

2. Test IPMI Regularly

  • Verify IPMI credentials are current
  • Test power operations periodically
  • Monitor IPMI network connectivity
  • Keep IPMI firmware updated

3. Plan Evacuation Capacity

  • Ensure sufficient resources on remaining hypervisors for evacuated instances
  • Consider N+1 or N+2 redundancy when sizing infrastructure
  • Monitor resource utilization to avoid capacity issues during evacuation

4. Use Shared Storage When Possible

  • Ceph/RBD datastores: Near-instant evacuation (no disk transfer)
  • Local storage: Full disk transfer required (slower evacuation)
  • Mixed environments supported with automatic detection

5. Monitor Event Log

  • Review HA events regularly
  • Investigate frequent failures (may indicate hardware/network issues)
  • Alert on fencing events for immediate attention