Add llama3-1-405b 16node recipe on A4x #53

sahilgoogle · 2025-11-26T21:17:58Z

Adds the optimal helm chart for llama405b on A4X(GB200) machines

suffiank · 2025-11-27T00:26:52Z

training/a4x/llama3-1-405b/values.yaml

+    value: VERSION
+  - name: NCCL_ALGO
+    value: "Ring,Tree"
+  - name: NCCL_NET_GDR_LEVEL


Where did these (NCCL variables) get inherited from? I think these are supposed to be set a script that the host networking team ships.

We have these set in our internal scripts, and the training fails without these due to communication errors. Looks like its not set by the host networking script?

suffiank · 2025-11-27T00:29:08Z

training/a4x/llama3-1-405b/values.yaml

+targetNodepools: null
+tasSettings:
+  topologyRequest:
+    kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname


Is this right? Does it need to require SBRG and then prefer a certain number of VMs within a rack?

suffiank · 2025-11-27T00:32:53Z

training/a4x/llama3-1-405b/llama3-1-405b-fp8cs-gbs2048.py

@@ -0,0 +1,207 @@
+"""Nemo2 pretraining recipe for Llama 3.1 405B model."""


Nothing in this script hard codes FP8-CS or GBS 2048, I think? So curious if that needs to be in the filename.

It should follow the same format like

llama3-1-8b/nemo-pretraining-gke/16node-FP8CS-GBS128/recipe

Add llama3-1-405b 16node recipe on A4x

44baf27

suffiank reviewed Nov 27, 2025

View reviewed changes

suffiank approved these changes Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add llama3-1-405b 16node recipe on A4x #53

Add llama3-1-405b 16node recipe on A4x #53

Uh oh!

sahilgoogle commented Nov 26, 2025 •

edited

Loading

Uh oh!

suffiank Nov 27, 2025 •

edited

Loading

Uh oh!

sahilgoogle Dec 5, 2025

Uh oh!

suffiank Nov 27, 2025

Uh oh!

suffiank Nov 27, 2025

Uh oh!

tonyjohnchen Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,207 @@
		"""Nemo2 pretraining recipe for Llama 3.1 405B model."""

Add llama3-1-405b 16node recipe on A4x #53

Are you sure you want to change the base?

Add llama3-1-405b 16node recipe on A4x #53

Uh oh!

Conversation

sahilgoogle commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suffiank Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sahilgoogle Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

suffiank Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

suffiank Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

tonyjohnchen Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sahilgoogle commented Nov 26, 2025 •

edited

Loading

suffiank Nov 27, 2025 •

edited

Loading