[Feature Request] Unified node reliability: join backfill + sandbox failover + snapshot DR backup

## Problem / Motivation

CubeSandbox currently has two reliability gaps around node lifecycle:

1. **New node join gap**  
   Template distribution is mostly event-time based. A newly joined node can miss historical distribution events, so it may not have required local rootfs/template replicas, which reduces schedulability.

2. **Node failure gap**  
   When a node fails, all running sandboxes on that node are lost. There is no unified automatic failover path to reschedule/restart them on healthy nodes.

At the same time, Cube’s local cubecow fast path is a major performance advantage and should be preserved.

## Proposed Solution

Introduce a unified reliability design with three parts:

1. **Join Backfill (new node warmup)**  
   - Add `WARMING` state for newly joined nodes.  
   - Backfill active/pinned templates to local replicas before the node becomes `READY`.  
   - Gate readiness by a minimum coverage threshold (for example, 80%).

2. **Sandbox Failover Controller**  
   - Detect node `NOT_READY` / failure.  
   - Enumerate running sandboxes previously on that node.  
   - Auto-reschedule and restart them on healthy nodes:
     - Prefer existing local replicas.
     - Fallback to restore from backup artifacts when local replicas are missing.
   - Track per-sandbox failover jobs with retry/backoff and observability.

3. **Snapshot DR Backup (side path, async)**  
   - Keep local cubecow as hot path.  
   - Add async backup upload to object storage after snapshot/template becomes locally ready.  
   - Use backup only for disaster/failover scenarios (not as default read path).

### Principles

- Preserve current hot-path latency for local snapshot/rollback/create.
- Add a reliable bypass path for node-loss recovery.
- Support gradual rollout with feature flags and per-cluster enablement.

The proposed hybrid model keeps Cube’s performance strengths while adding practical disaster recovery for node failures.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Unified node reliability: join backfill + sandbox failover + snapshot DR backup #578

Problem / Motivation

Proposed Solution

Principles

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request] Unified node reliability: join backfill + sandbox failover + snapshot DR backup #578

Description

Problem / Motivation

Proposed Solution

Principles

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions