-
Notifications
You must be signed in to change notification settings - Fork 52
Add llama3-1-405b 16node recipe on A4x #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| value: VERSION | ||
| - name: NCCL_ALGO | ||
| value: "Ring,Tree" | ||
| - name: NCCL_NET_GDR_LEVEL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did these (NCCL variables) get inherited from? I think these are supposed to be set a script that the host networking team ships.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have these set in our internal scripts, and the training fails without these due to communication errors. Looks like its not set by the host networking script?
| targetNodepools: null | ||
| tasSettings: | ||
| topologyRequest: | ||
| kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? Does it need to require SBRG and then prefer a certain number of VMs within a rack?
| @@ -0,0 +1,207 @@ | |||
| """Nemo2 pretraining recipe for Llama 3.1 405B model.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this script hard codes FP8-CS or GBS 2048, I think? So curious if that needs to be in the filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should follow the same format like
Adds the optimal helm chart for llama405b on A4X(GB200) machines