-
Notifications
You must be signed in to change notification settings - Fork 259
Initial Blueprint G4 #4685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Blueprint G4 #4685
Conversation
nadig-google
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cboneti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @LAVEEN , I noticed you have some pre-commits failing. Also, you would need to add the documentation for this blueprint in the examples/README.md file.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new ml-slurm-g4 blueprint, which is a valuable addition for users looking to deploy GPU-based machine learning workloads on Slurm. The configuration is mostly well-structured. My review includes a few suggestions to improve it. The most important ones are to ensure that the Slurm controller and login nodes use the same custom OS image as the compute nodes for a consistent environment. I've also pointed out a couple of opportunities to clean up the configuration by removing an extraneous key and some verbose network settings. Overall, these are good changes that will provide a useful example to users.
2c44dca to
d9f8e90
Compare
nadig-google
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Add
ml-slurm-g4blueprint for GPU-based ML workloads.This change introduces a new example blueprint,
ml-slurm-g4, to deploy a Slurm cluster optimized for machine learning workloads on Google Cloud.The blueprint provisions the following resources:
g4_partition) withg4-standard-48machine types.slurm-gcp-6-11-ubuntu-2204-lts-nvidia-570image family.This provides users with a reference architecture for setting up a performant, GPU-enabled HPC environment for machine learning applications.