-
Notifications
You must be signed in to change notification settings - Fork 149
Description
@summary
Model is qwen3-next-80B-A3
On an 8‑node cluster (8×GPU per node, ≈1.5 TB RAM per node), Ray’s OOM prevention repeatedly kills freshly scheduled vLLM actors during engine initialization (collective_rpc → offload_states). This happens even with NVMe object spilling configured. Moderate reductions in parallelism did not help; the only stable setting so far is GRPO_GROUP=1 (a single inference actor), which suggests the node‑level kill is triggered by concurrent actor initialization spikes rather than steady‑state memory use.
Environment
• Cluster: 8 nodes; 8× GPUs per node
• RAM per node: physical ≈1.5 TB (Slurm mem limit was 1200 G)
• Orchestration: Slurm + Singularity/Apptainer (containers)
• Python: 3.11
• PyTorch: 2.8.0 (CUDA 12.8)
• vLLM: 0.10.2
• Ray: (cluster started by the orchestrator; can provide exact version if needed)
• Storage: local NVMe used for Ray temp + object spill
• Ray object spilling: filesystem to NVMe (directory redacted)
What happened
During pipeline startup, multiple vLLM inference actors are created nearly concurrently. While the engine calls offload_states, node memory briefly spikes (actor RSS + Ray object store reservation + page cache). Ray’s node memory monitor then kills the most recently scheduled actor around 92–93% node RAM usage (default RAY_memory_usage_threshold=0.90).
Sanitized excerpt:
Exception: Task was killed due to the node running low on memory.
Memory on the node was 1399.68GB / 1510.89GB (0.926...), which exceeds the memory usage threshold of 0.9.
Ray killed this worker because it was the most recently scheduled task.
Top memory users:
~12–19 GB each: ray::RayWorkerWrapper.execute_method (several)
~0.6–0.7 GB: ray::ActorWorker.initialize / reward worker
Stack (abbrev.):
ray::ActorWorker.initialize()
→ roll.distributed.strategy.vllm_strategy: offload_states
→ vllm.llm: collective_rpc
→ vllm.v1.engine.core_client: call_utility
→ concurrent.futures ... "Task was killed due to the node running low on memory"
Steps to reproduce (high‑level)
1. Start a Ray cluster inside containers across 8 nodes (8×GPU each), with Ray temp and object spill on local NVMe.
2. Launch a rollout/RL‑style pipeline that spins up multiple vLLM inference actors concurrently (e.g., GRPO/RLVR).
3. During vLLM engine initialization (collective_rpc → offload_states), node memory usage spikes into the low‑90% range.
4. Ray kills the most recently scheduled vLLM actor as above.
What we tried
• Raise Ray’s kill threshold: RAY_memory_usage_threshold from 0.90 → 0.97; RAY_memory_monitor_refresh_ms from 1000 → 2000.
→ Fewer kills, but still fragile during the bursty init phase.
• Move Ray temp & spill to NVMe: effective for object spill, but does not change the node‑level kill behavior.
• Reduce parallelism (these did not stabilize init):
• GRPO_GROUP: 8 → 4
• VP_SIZE: 12 → 8
• ROLLOUT_BS: 64 → 48 (→ 32)
• Current workaround (stable): GRPO_GROUP=1 (single inference actor). With one actor, the concurrent init spike disappears and the job runs.
Object store sizing
We understand the Ray object store defaults to reserving a large fraction of node memory (≈30%), which is quite big on 1.5 TB nodes. In our setup, Ray is started by the orchestrator and we have not been able to set --object-store-memory yet. We can adopt it if there’s a recommended value or a supported way to pass it through in containerized, orchestrated clusters.
Questions
1. Recommended configuration for large‑RAM nodes suffering from short init spikes:
• Practical ranges for RAY_memory_usage_threshold and RAY_memory_monitor_refresh_ms to avoid premature kills while staying safe from OS OOM?
2. Sizing the object store on very large RAM nodes:
• Is the 30% heuristic appropriate here? Any rule of thumb (e.g., cap at 128–256 GiB) to reduce the “baseline used” and avoid false positives?
3. Accounting vs page cache:
• Does the memory monitor count Linux page cache as “used”? Is there a way to base decisions on MemAvailable to avoid penalizing transient cache?
4. Initialization bursts:
• Best practices to coordinate/stagger actor startup or temporarily relax the monitor during known burst phases (e.g., vLLM offload_states)?
5. Alternative controls:
• Would per‑actor memory requests/limits or scheduling hints help stagger actor creation?
• Any plans for a built‑in grace period before killing newly scheduled tasks?
Expected behavior
Avoid killing actors during short, predictable initialization spikes when the system would be healthy after the burst (especially with object spill enabled), or provide clear guidance to tune thresholds and object store sizing for such workloads.
Actual behavior
Freshly scheduled vLLM actors are killed by Ray’s OOM prevention at ~92–93% node usage during initialization bursts, leading to repeated restarts unless we reduce to GRPO_GROUP=1 (single actor).
Additional context
• We can provide ray --version and raylet.out snippets if helpful.
• All paths, IPs, and job identifiers here are redacted.