Skip to content

[toygres] Automatic failover monitoring via instance actor #26

@affandar

Description

@affandar

Repository

affandar/toygres

Concept

The instance actor (long-running orchestration managing instance lifecycle) monitors primary health and automatically triggers failover when issues are detected.

Health Monitoring

pub struct HealthMonitorConfig {
    pub check_interval: Duration,
    pub failure_threshold: u32,
    pub check_timeout: Duration,
    pub max_eligible_lag_bytes: u64,
}

pub struct HealthCheck {
    pub connectivity: bool,
    pub query_responsive: bool,
    pub replication_healthy: bool,
    pub disk_space_ok: bool,
}

Instance Actor Integration

  • Use durable timers for health check scheduling
  • Track consecutive failures
  • When threshold exceeded, select best replica and trigger automatic failover
  • If no eligible replica, send critical alert

Failover Target Selection

Select replica with:

  • State == Streaming
  • Replication lag within threshold
  • Minimum lag among eligible replicas

Safeguards

  • Cooldown period between automatic failovers
  • Manual override to disable automatic failover
  • Quorum check to prevent false positives from network partition
  • Notification to on-call before/during failover
  • Audit log of all automatic failover decisions

See: proposals/toygres-improvements.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    toygresToygres test application

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions