Version based GPU configuration and QoS addition #1092

kvenkateshan-meta · 2025-07-23T00:32:50Z

Summary:
Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call gpus-per-node in sbatch to ensure failure free allocation.

https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md

Changes here

Introduced Slurm Version based GPU request configuration
Introduced an option QoS parameter which can be used to control priority of jobs.

Differential Revision: D78778304

facebook-github-bot · 2025-07-23T00:32:59Z

This pull request was exported from Phabricator. Differential Revision: D78778304

Summary: Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call `gpus-per-node` in sbatch to ensure failure free allocation. https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md # Changes here 1. Introduced Slurm Version based GPU request configuration 2. Introduced an option QoS parameter which can be used to control priority of jobs. Differential Revision: D78778304

facebook-github-bot · 2025-07-23T20:29:38Z

This pull request was exported from Phabricator. Differential Revision: D78778304

Summary: Pull Request resolved: pytorch#1092 Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call `gpus-per-node` in sbatch to ensure failure free allocation. https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md # Changes here 1. Introduced Slurm Version based GPU request configuration 2. Introduced an option QoS parameter which can be used to control priority of jobs. Differential Revision: D78778304

Summary: Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call `gpus-per-node` in sbatch to ensure failure free allocation. https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md # Changes here 1. Introduced Slurm Version based GPU request configuration 2. Introduced an option QoS parameter which can be used to control priority of jobs. Reviewed By: kiukchung Differential Revision: D78778304

facebook-github-bot · 2025-07-28T22:33:31Z

This pull request was exported from Phabricator. Differential Revision: D78778304

Summary: Pull Request resolved: pytorch#1092 Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call `gpus-per-node` in sbatch to ensure failure free allocation. https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md # Changes here 1. Introduced Slurm Version based GPU request configuration 2. Introduced an option QoS parameter which can be used to control priority of jobs. Reviewed By: kiukchung Differential Revision: D78778304

Summary: Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call `gpus-per-node` in sbatch to ensure failure free allocation. https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md # Changes here 1. Introduced Slurm Version based GPU request configuration 2. Introduced an option QoS parameter which can be used to control priority of jobs. Reviewed By: kiukchung Differential Revision: D78778304

Summary: Pull Request resolved: pytorch#1092 Slurm 24.11.0rc1 and beyond do not suport GRES per task. So we need to call `gpus-per-node` in sbatch to ensure failure free allocation. https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-24.11.md # Changes here 1. Introduced Slurm Version based GPU request configuration 2. Introduced an option QoS parameter which can be used to control priority of jobs. Reviewed By: kiukchung Differential Revision: D78778304

facebook-github-bot · 2025-07-29T00:44:55Z

This pull request was exported from Phabricator. Differential Revision: D78778304

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 23, 2025

facebook-github-bot added the fb-exported label Jul 23, 2025

kvenkateshan-meta force-pushed the export-D78778304 branch 2 times, most recently from f67d4b6 to d06b519 Compare July 23, 2025 20:25

kvenkateshan-meta force-pushed the export-D78778304 branch from d06b519 to 9a36468 Compare July 23, 2025 20:29

kvenkateshan-meta force-pushed the export-D78778304 branch from 9a36468 to 899defb Compare July 28, 2025 22:29

kvenkateshan-meta force-pushed the export-D78778304 branch from 899defb to e28eb0f Compare July 28, 2025 22:33

kvenkateshan-meta force-pushed the export-D78778304 branch from e28eb0f to fb9e308 Compare July 29, 2025 00:40

kvenkateshan-meta force-pushed the export-D78778304 branch from fb9e308 to 942aa7c Compare July 29, 2025 00:44

kiukchung approved these changes Jul 29, 2025

View reviewed changes

facebook-github-bot merged commit ae55901 into pytorch:main Jul 29, 2025
23 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Version based GPU configuration and QoS addition #1092

Version based GPU configuration and QoS addition #1092

Uh oh!

kvenkateshan-meta commented Jul 23, 2025

Uh oh!

facebook-github-bot commented Jul 23, 2025

Uh oh!

facebook-github-bot commented Jul 23, 2025

Uh oh!

facebook-github-bot commented Jul 28, 2025

Uh oh!

facebook-github-bot commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Version based GPU configuration and QoS addition #1092

Version based GPU configuration and QoS addition #1092

Uh oh!

Conversation

kvenkateshan-meta commented Jul 23, 2025

Changes here

Uh oh!

facebook-github-bot commented Jul 23, 2025

Uh oh!

facebook-github-bot commented Jul 23, 2025

Uh oh!

facebook-github-bot commented Jul 28, 2025

Uh oh!

facebook-github-bot commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!