-
Notifications
You must be signed in to change notification settings - Fork 191
Add SLURM script for launching multi-node Ray clusters with Singularity #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add SLURM script for launching multi-node Ray clusters with Singularity #1269
Conversation
…ty/Apptainer Signed-off-by: Federico D'Ambrosio <[email protected]>
Greptile OverviewGreptile SummaryThis PR adds deployment infrastructure for running NeMo Curator on SLURM clusters with Singularity/Apptainer in air-gapped environments. Key additions:
Implementation highlights:
The script provides a solid foundation for HPC users to deploy NeMo Curator pipelines efficiently. Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant SLURM
participant Script as ray-singularity-sbatch.sh
participant HeadNode as Ray Head Node
participant WorkerNodes as Ray Worker Nodes
participant Container as Singularity Container
User->>SLURM: sbatch with RUN_COMMAND env var
SLURM->>Script: Allocate nodes and start job
Script->>Script: Detect resources (CPU/GPU per node)
Script->>Script: Create temp directories (ray_tmp, ray_spill, etc.)
Script->>Script: Set up environment variables
Script->>HeadNode: srun on first node
HeadNode->>Container: singularity exec with --nv --bind
Container->>Container: ray start --head --block &
Note over Container: Exposes GCS, Dashboard, Client ports
Script->>Script: Wait HEAD_STARTUP_WAIT seconds
loop For each worker node
Script->>WorkerNodes: srun on worker node
WorkerNodes->>Container: singularity exec with --nv --bind
Container->>HeadNode: ray start --address HEAD_IP:GCS_PORT
Container->>Container: Register with Ray cluster
end
Script->>Script: Wait WORKER_STARTUP_WAIT seconds
Script->>HeadNode: srun RUN_COMMAND on head node
HeadNode->>Container: singularity exec bash -c RUN_COMMAND
Container->>Container: Execute user Python script
Note over Container: Script uses ray.init() or XennaExecutor
Container->>HeadNode: Return results
HeadNode->>Script: Command finished
Script->>Script: Trigger cleanup trap
Script->>Script: Remove temp directories
Script->>SLURM: Job complete
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: federico-dambrosio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Description
This PR adds a sample SLURM script and accompanying documentation for running NeMo Curator pipelines on a multi-node Ray cluster using Singularity / Apptainer.
Specifically, it:
ray-singularity-sbatch.sh, a generic SLURM batch script that:CONTAINER_CMDknob.HF_HUB_OFFLINE=1.No existing code paths are modified; this is an example script + documentation intended to make it easier for users to run NeMo Curator on SLURM-based HPC systems.
Similar to #1168 but for Slurm clusters with Singularity and no internet connection on compute nodes.
Usage
On the SLURM side, the corresponding submission looks like:
Checklist