[Feature Request] Support cordon compute node

## Problem / Motivation

In multi-node CubeSandbox clusters, **CubeMaster** continuously schedules new sandbox instances onto any **Cubelet** (compute node) that has available capacity. Today there is no supported way to tell the scheduler "stop placing new sandboxes on this node, but leave the ones already running alone."

This makes routine cluster operations risky and disruptive. Operators are affected whenever they need to:

- **Perform node maintenance** — kernel/host upgrades, hardware repair, Cubelet/CubeHypervisor version rollout. Without cordon, new sandboxes keep landing on a node you are trying to drain, so the node never quiesces.
- **Decommission a node** — gradually remove a machine from the pool by letting its existing sandboxes age out naturally instead of force-killing them.
- **Investigate a degraded node** — isolate a node showing high error rates, disk pressure, or network issues from new scheduling while keeping current workloads observable.
- **Canary / partial rollout** — quarantine a subset of nodes from receiving new traffic during a staged upgrade.

The only current workarounds (stopping the Cubelet service, or firewalling the node) are destructive — they kill or orphan the running sandboxes on that node, which defeats the purpose of a graceful operation.

## Proposed Solution

Add a **cordon / uncordon** capability for compute nodes, analogous to `kubectl cordon` / `kubectl uncordon` in Kubernetes.

Behavior:
- A cordoned node is marked **unschedulable**: CubeMaster excludes it from placement decisions for new sandbox creation requests.
- **Existing sandboxes on the node continue running** and remain fully usable (exec, snapshot, network) — cordon only affects *new* placement.
- `uncordon` returns the node to the schedulable pool.
- The unschedulable state should be **persisted** in cluster state so it survives CubeMaster restarts and is not silently reset.

Suggested surface (open to maintainer direction):

- **Admin API on CubeMaster**, e.g.:
  - `POST /api/v1/nodes/{node_id}/cordon`
  - `POST /api/v1/nodes/{node_id}/uncordon`
  - Node listing (`GET /api/v1/nodes`) exposes a `schedulable: true|false` field.
- A CLI convenience wrapper, e.g. `cube-cli node cordon <node_id>` / `node uncordon <node_id>`, if a cluster CLI exists.
- Optional: a reason/annotation string recorded with the cordon for auditing.

A natural follow-up (could be a separate issue) is a **drain** operation built on top of cordon: cordon the node, then proactively migrate or wait out the existing sandboxes before maintenance.

## Alternatives Considered

- **Stop the Cubelet service on the node** — works to halt new placement, but immediately terminates all running sandboxes on that node. Not graceful.
- **Firewall / network-isolate the node** — orphans running sandboxes and breaks CubeProxy routing; also destructive and hard to reason about.
- **Set node capacity/weight to 0 via config** — if such tuning exists, it is static config rather than a runtime, per-node, reversible operation, and would still require a restart/reload. Cordon is intended to be an online, instantly reversible scheduling flag.

A first-class cordon flag in CubeMaster's scheduler is preferred because it is non-destructive, runtime-controllable, reversible, and matches the mental model operators already have from Kubernetes.

## Additional Context

- Components involved: **CubeMaster** (scheduler / cluster state), with a read-only `schedulable` field surfaced for **Cubelet** nodes.
- Prior art: Kubernetes node cordon/uncordon/drain (https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cordon).
- Environment: multi-node CubeSandbox cluster deployment (CubeMaster + multiple Cubelets). Relevant to any production cluster that needs zero-downtime node maintenance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support cordon compute node #563

Problem / Motivation

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request] Support cordon compute node #563

Description

Problem / Motivation

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions