Skip to content

Conversation

@federico-dambrosio
Copy link

Description

This PR adds a sample SLURM script and accompanying documentation for running NeMo Curator pipelines on a multi-node Ray cluster using Singularity / Apptainer.

Specifically, it:

  • Adds ray-singularity-sbatch.sh, a generic SLURM batch script that:
    • Starts a Ray head on the first SLURM node and Ray workers on the remaining nodes.
    • Runs a user-provided Python command inside a NeMo Curator container on the head node.
    • Supports both Singularity and Apptainer via a CONTAINER_CMD knob.
    • Is safe for air-gapped clusters by default via HF_HUB_OFFLINE=1.
  • Adds a README documenting:
    • Prerequisites (NeMo Curator container, SLURM, Singularity/Apptainer).
    • How the script works and how to customize SBATCH directives.
    • All relevant environment knobs (ports, HF cache, scratch paths, mounts, etc.).
    • Example usage patterns for NeMo Curator pipelines.

No existing code paths are modified; this is an example script + documentation intended to make it easier for users to run NeMo Curator on SLURM-based HPC systems.

Similar to #1168 but for Slurm clusters with Singularity and no internet connection on compute nodes.

Usage

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna.executor import XennaExecutor

# Define your pipeline
pipeline = Pipeline(...)
pipeline.add_stage(...)

# Use the XennaExecutor to run on the Ray cluster started by the sbatch script
executor = XennaExecutor()
results = pipeline.run(executor=executor)

On the SLURM side, the corresponding submission looks like:

export IMAGE=/path/to/nemo-curator_25.09.sif

RUN_COMMAND="python curator_pipeline.py" \
sbatch --nodes=2 --gres=gpu:4 ray-singularity-sbatch.sh

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 24, 2025

Greptile Overview

Greptile Summary

This PR adds deployment infrastructure for running NeMo Curator on SLURM clusters with Singularity/Apptainer in air-gapped environments.

Key additions:

  • ray-singularity-sbatch.sh: Production-ready SLURM batch script that orchestrates multi-node Ray clusters with automatic resource detection and comprehensive error handling
  • Comprehensive README with detailed documentation on configuration, environment variables, and usage patterns
  • Support for both Singularity and Apptainer container runtimes
  • Air-gapped cluster support with HF_HUB_OFFLINE=1 by default
  • Flexible configuration through environment variables without script modification

Implementation highlights:

  • Proper error handling with set -euo pipefail
  • Automatic cleanup of temporary directories via trap handlers
  • Resource auto-detection from SLURM environment variables
  • Environment variable propagation into containers using SINGULARITYENV_*
  • Background process management for Ray head and worker nodes

The script provides a solid foundation for HPC users to deploy NeMo Curator pipelines efficiently.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it adds self-contained deployment scripts without modifying existing code paths.
  • Score of 4 reflects well-structured implementation with proper error handling, comprehensive documentation, and no changes to existing code. The script follows bash best practices with set -euo pipefail, proper variable quoting in most places, and cleanup handlers. Previous review comments about unquoted variables on lines 171, 175, 216, and 220 have been noted but don't affect the core functionality.
  • No files require special attention - both files are new additions that don't affect existing functionality.

Important Files Changed

File Analysis

Filename Score Overview
tutorials/deployment/slurm/ray-singularity-sbatch.sh 4/5 Adds comprehensive SLURM batch script for multi-node Ray cluster with Singularity/Apptainer. Script is well-structured with proper error handling and environment variable propagation.
tutorials/deployment/slurm/README.md 5/5 Comprehensive documentation covering prerequisites, usage patterns, environment variables, and troubleshooting. Clear examples with appropriate warnings for air-gapped clusters.

Sequence Diagram

sequenceDiagram
    participant User
    participant SLURM
    participant Script as ray-singularity-sbatch.sh
    participant HeadNode as Ray Head Node
    participant WorkerNodes as Ray Worker Nodes
    participant Container as Singularity Container

    User->>SLURM: sbatch with RUN_COMMAND env var
    SLURM->>Script: Allocate nodes and start job
    
    Script->>Script: Detect resources (CPU/GPU per node)
    Script->>Script: Create temp directories (ray_tmp, ray_spill, etc.)
    Script->>Script: Set up environment variables
    
    Script->>HeadNode: srun on first node
    HeadNode->>Container: singularity exec with --nv --bind
    Container->>Container: ray start --head --block &
    Note over Container: Exposes GCS, Dashboard, Client ports
    
    Script->>Script: Wait HEAD_STARTUP_WAIT seconds
    
    loop For each worker node
        Script->>WorkerNodes: srun on worker node
        WorkerNodes->>Container: singularity exec with --nv --bind
        Container->>HeadNode: ray start --address HEAD_IP:GCS_PORT
        Container->>Container: Register with Ray cluster
    end
    
    Script->>Script: Wait WORKER_STARTUP_WAIT seconds
    
    Script->>HeadNode: srun RUN_COMMAND on head node
    HeadNode->>Container: singularity exec bash -c RUN_COMMAND
    Container->>Container: Execute user Python script
    Note over Container: Script uses ray.init() or XennaExecutor
    Container->>HeadNode: Return results
    
    HeadNode->>Script: Command finished
    Script->>Script: Trigger cleanup trap
    Script->>Script: Remove temp directories
    Script->>SLURM: Job complete
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@federico-dambrosio federico-dambrosio changed the title Add SLURM script for launching multi-node Ray clusters with Singulari… Add SLURM script for launching multi-node Ray clusters with Singularity Nov 24, 2025
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: federico-dambrosio <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant