Problem / Motivation
CubeSandbox currently has two reliability gaps around node lifecycle:
-
New node join gap
Template distribution is mostly event-time based. A newly joined node can miss historical distribution events, so it may not have required local rootfs/template replicas, which reduces schedulability.
-
Node failure gap
When a node fails, all running sandboxes on that node are lost. There is no unified automatic failover path to reschedule/restart them on healthy nodes.
At the same time, Cube’s local cubecow fast path is a major performance advantage and should be preserved.
Proposed Solution
Introduce a unified reliability design with three parts:
-
Join Backfill (new node warmup)
- Add
WARMING state for newly joined nodes.
- Backfill active/pinned templates to local replicas before the node becomes
READY.
- Gate readiness by a minimum coverage threshold (for example, 80%).
-
Sandbox Failover Controller
- Detect node
NOT_READY / failure.
- Enumerate running sandboxes previously on that node.
- Auto-reschedule and restart them on healthy nodes:
- Prefer existing local replicas.
- Fallback to restore from backup artifacts when local replicas are missing.
- Track per-sandbox failover jobs with retry/backoff and observability.
-
Snapshot DR Backup (side path, async)
- Keep local cubecow as hot path.
- Add async backup upload to object storage after snapshot/template becomes locally ready.
- Use backup only for disaster/failover scenarios (not as default read path).
Principles
- Preserve current hot-path latency for local snapshot/rollback/create.
- Add a reliable bypass path for node-loss recovery.
- Support gradual rollout with feature flags and per-cluster enablement.
The proposed hybrid model keeps Cube’s performance strengths while adding practical disaster recovery for node failures.
Problem / Motivation
CubeSandbox currently has two reliability gaps around node lifecycle:
New node join gap
Template distribution is mostly event-time based. A newly joined node can miss historical distribution events, so it may not have required local rootfs/template replicas, which reduces schedulability.
Node failure gap
When a node fails, all running sandboxes on that node are lost. There is no unified automatic failover path to reschedule/restart them on healthy nodes.
At the same time, Cube’s local cubecow fast path is a major performance advantage and should be preserved.
Proposed Solution
Introduce a unified reliability design with three parts:
Join Backfill (new node warmup)
WARMINGstate for newly joined nodes.READY.Sandbox Failover Controller
NOT_READY/ failure.Snapshot DR Backup (side path, async)
Principles
The proposed hybrid model keeps Cube’s performance strengths while adding practical disaster recovery for node failures.