-
Notifications
You must be signed in to change notification settings - Fork 259
a3high single blueprint to use the tcpx patched kernel #4821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7a30b2c to
c4b7cb7
Compare
0d665d0 to
a216e5a
Compare
|
/gcbrun |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the a3-highgpu-8g Slurm cluster deployment by consolidating three separate blueprints into a single, more streamlined blueprint. This new blueprint also automates the process of building a custom image with a TCPx-patched kernel for enhanced network performance. The changes significantly improve the user experience by simplifying the deployment process. The review identifies a critical issue where configuration variables are swapped, a high-severity issue in the documentation that could cause user commands to fail, and a couple of medium-severity issues related to misleading comments in the new blueprint file. Overall, this is a valuable improvement, and addressing the identified issues will make it even better.
examples/machine-learning/a3-highgpu-8g/a3high-slurm-blueprint.yaml
Outdated
Show resolved
Hide resolved
examples/machine-learning/a3-highgpu-8g/a3high-slurm-blueprint.yaml
Outdated
Show resolved
Hide resolved
examples/machine-learning/a3-highgpu-8g/a3high-slurm-blueprint.yaml
Outdated
Show resolved
Hide resolved
cboneti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
examples/machine-learning/a3-highgpu-8g/a3high-slurm-blueprint.yaml
Outdated
Show resolved
Hide resolved
samskillman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some sections can be removed / altered.
Building custom image for a3high slurm that installs tcpx kernel, compatible nvidia and cuda drivers and then stacks up latest slurm-gcp version onto it.
The change also includes combining the earlier 3 blueprints into a single blueprint that spins up the base, builds the image and then deploys the cluster.
Added integration tests based on the new blueprint. Integration test has been tested.