Skip to content

Conversation

@bytetwin
Copy link
Collaborator

@bytetwin bytetwin commented Nov 4, 2025

Building custom image for a3high slurm that installs tcpx kernel, compatible nvidia and cuda drivers and then stacks up latest slurm-gcp version onto it.

The change also includes combining the earlier 3 blueprints into a single blueprint that spins up the base, builds the image and then deploys the cluster.

Added integration tests based on the new blueprint. Integration test has been tested.

@bytetwin bytetwin force-pushed the a3h-bp branch 3 times, most recently from 7a30b2c to c4b7cb7 Compare November 5, 2025 13:00
@bytetwin bytetwin added release-key-new-features Added to release notes under the "Key New Features" heading. release-breaking-changes Prevents "smooth" re-deploy across versions labels Nov 5, 2025
@bytetwin bytetwin force-pushed the a3h-bp branch 4 times, most recently from 0d665d0 to a216e5a Compare November 6, 2025 15:32
@bytetwin bytetwin marked this pull request as ready for review November 6, 2025 15:54
@bytetwin bytetwin requested review from a team and samskillman as code owners November 6, 2025 15:54
@bytetwin
Copy link
Collaborator Author

bytetwin commented Nov 6, 2025

/gcbrun

@bytetwin
Copy link
Collaborator Author

bytetwin commented Nov 7, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the a3-highgpu-8g Slurm cluster deployment by consolidating three separate blueprints into a single, more streamlined blueprint. This new blueprint also automates the process of building a custom image with a TCPx-patched kernel for enhanced network performance. The changes significantly improve the user experience by simplifying the deployment process. The review identifies a critical issue where configuration variables are swapped, a high-severity issue in the documentation that could cause user commands to fail, and a couple of medium-severity issues related to misleading comments in the new blueprint file. Overall, this is a valuable improvement, and addressing the identified issues will make it even better.

Copy link
Member

@cboneti cboneti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

Copy link
Collaborator

@samskillman samskillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some sections can be removed / altered.

@bytetwin bytetwin merged commit 061f7c5 into GoogleCloudPlatform:develop Nov 13, 2025
10 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-breaking-changes Prevents "smooth" re-deploy across versions release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants