Skip to content

Refactor script to use 'overwrites' variable for command-line arguments in training scripts #1473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 31, 2025

Conversation

idoh
Copy link
Contributor

@idoh idoh commented Jul 28, 2025

The goal of this PR is to add support for command line arguments to the bash training scripts. The run_train.sh had support for overrides, however, the multinode_trainer.slurm script did not. This overrides flag add supports for commands like: sbatch multinode_trainer.slurm --job.description="TEST_RUN"

However, there is a problem with the current overrides implementation, when passing arguments with space such as "TEST RUN" instead of "TEST_RUN" then the variable job.description would only get TEST as input and the training script throws an error for unrecognizing the argument RUN which is passed in a different line. To address this I simplify the code and directly pass the additional overrides through $@. This solves the issue for commands such as: sbatch multinode_trainer.slurm --job.description="TEST RUN"

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 28, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice, had one comment

@@ -34,7 +34,7 @@ export LOGLEVEL=INFO
export FI_PROVIDER="efa"
# Ensure that P2P is available
# export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# export NCCL_IB_DISABLE=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disabling this? cc @lessw2020 if you have more context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, I always had to comment this out as new severs typically support IB, which is probably similar experience to others who use torchtitan for pre-training.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems quite reasonable.

@tianyu-l tianyu-l merged commit b1dc330 into pytorch:main Jul 31, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants