Skip to content

[Feature Request] Unified node reliability: join backfill + sandbox failover + snapshot DR backup #578

@mj37yhyy

Description

@mj37yhyy

Problem / Motivation

CubeSandbox currently has two reliability gaps around node lifecycle:

  1. New node join gap
    Template distribution is mostly event-time based. A newly joined node can miss historical distribution events, so it may not have required local rootfs/template replicas, which reduces schedulability.

  2. Node failure gap
    When a node fails, all running sandboxes on that node are lost. There is no unified automatic failover path to reschedule/restart them on healthy nodes.

At the same time, Cube’s local cubecow fast path is a major performance advantage and should be preserved.

Proposed Solution

Introduce a unified reliability design with three parts:

  1. Join Backfill (new node warmup)

    • Add WARMING state for newly joined nodes.
    • Backfill active/pinned templates to local replicas before the node becomes READY.
    • Gate readiness by a minimum coverage threshold (for example, 80%).
  2. Sandbox Failover Controller

    • Detect node NOT_READY / failure.
    • Enumerate running sandboxes previously on that node.
    • Auto-reschedule and restart them on healthy nodes:
      • Prefer existing local replicas.
      • Fallback to restore from backup artifacts when local replicas are missing.
    • Track per-sandbox failover jobs with retry/backoff and observability.
  3. Snapshot DR Backup (side path, async)

    • Keep local cubecow as hot path.
    • Add async backup upload to object storage after snapshot/template becomes locally ready.
    • Use backup only for disaster/failover scenarios (not as default read path).

Principles

  • Preserve current hot-path latency for local snapshot/rollback/create.
  • Add a reliable bypass path for node-loss recovery.
  • Support gradual rollout with feature flags and per-cluster enablement.

The proposed hybrid model keeps Cube’s performance strengths while adding practical disaster recovery for node failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions