Skip to content

Conversation

rahimftd
Copy link
Contributor

Problem

The SkypilotExecutor cannot launch skypilot managed jobs, which support features such as automatic retries and recovery from spot preemptions.

Managed jobs use a different sdk than regular jobs. As such, the SkypilotExecutor cannot be used to launch both types of jobs.

Solution

This pr adds a SkypilotJobsExecutor and SkypilotJobsScheduler, which use the jobs sdk to launch managed jobs. The executor works with local and remote Skypilot API servers.

Example usage

executor = run.SkypilotJobsExecutor(
    gpus="H100",
    launcher="torchrun",
    gpus_per_node=8,
    env_vars={},
    num_nodes=4,
    container_image="nvcr.io/nvidia/nemo:dev",
    infra="kubernetes",
    idle_minutes_to_autostop=10,
    autodown=True,
    packager=run.GitArchivePackager(subpath="nemo_training"),
)
run.run(recipe, executor=executor, name=experiment_name, log_level="DEBUG")

Testing Strategy

  • Tested with a remote and local api server
  • Unit tests

@rahimftd
Copy link
Contributor Author

@romilbhardwaj

Signed-off-by: Rahim Dharssi <[email protected]>
@hemildesai
Copy link
Contributor

hemildesai commented Sep 29, 2025

Thanks for the amazing contribution. It looks like only check failing is the codecoverage check (78.07% there out of a minimum of 80). You can take a look at the missed lines here - https://app.codecov.io/gh/NVIDIA-NeMo/Run/pull/343?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=NVIDIA-NeMo

(You can ignore the other failures)

Signed-off-by: Rahim Dharssi <[email protected]>
@rahimftd
Copy link
Contributor Author

@hemildesai Added some unit tests. Thanks!

Copy link
Contributor

@hemildesai hemildesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

@hemildesai hemildesai merged commit 6bc4319 into NVIDIA-NeMo:main Sep 30, 2025
19 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants