Skip to content

Conversation

@LAVEEN
Copy link
Contributor

@LAVEEN LAVEEN commented Sep 22, 2025

Add ml-slurm-g4 blueprint for GPU-based ML workloads.

This change introduces a new example blueprint, ml-slurm-g4, to deploy a Slurm cluster optimized for machine learning workloads on Google Cloud.

The blueprint provisions the following resources:

  • A Slurm partition (g4_partition) with g4-standard-48 machine types.
  • A Slurm controller and login nodes.
  • Uses the slurm-gcp-6-11-ubuntu-2204-lts-nvidia-570 image family.

This provides users with a reference architecture for setting up a performant, GPU-enabled HPC environment for machine learning applications.

@LAVEEN LAVEEN requested review from a team and samskillman as code owners September 22, 2025 16:21
@LAVEEN LAVEEN added the release-improvements Added to release notes under the "Improvements" heading. label Sep 22, 2025
nadig-google
nadig-google previously approved these changes Sep 29, 2025
Copy link
Contributor

@nadig-google nadig-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cboneti
cboneti previously requested changes Nov 6, 2025
Copy link
Member

@cboneti cboneti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @LAVEEN , I noticed you have some pre-commits failing. Also, you would need to add the documentation for this blueprint in the examples/README.md file.

@cboneti
Copy link
Member

cboneti commented Nov 6, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ml-slurm-g4 blueprint, which is a valuable addition for users looking to deploy GPU-based machine learning workloads on Slurm. The configuration is mostly well-structured. My review includes a few suggestions to improve it. The most important ones are to ensure that the Slurm controller and login nodes use the same custom OS image as the compute nodes for a consistent environment. I've also pointed out a couple of opportunities to clean up the configuration by removing an extraneous key and some verbose network settings. Overall, these are good changes that will provide a useful example to users.

@LAVEEN LAVEEN force-pushed the g4new branch 7 times, most recently from 2c44dca to d9f8e90 Compare November 11, 2025 20:29
@LAVEEN LAVEEN requested a review from nadig-google November 11, 2025 20:30
@LAVEEN LAVEEN enabled auto-merge November 11, 2025 20:38
Copy link
Contributor

@nadig-google nadig-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@LAVEEN LAVEEN dismissed cboneti’s stale review November 12, 2025 04:30

Resolved all comments

@LAVEEN LAVEEN merged commit 6e4cf2b into GoogleCloudPlatform:develop Nov 12, 2025
10 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants