[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron? #1102

jinzhuer · 2024-07-11T10:01:09Z

jinzhuer
Jul 11, 2024

Issue Description:

I read the information about reproducibility, which mentions using --deterministic-mode by setting NCCL_ALGO, NVTE_ALLOW_NONDETERMINISTIC_ALGO=0, and not using --use-flash-attn to achieve deterministic training.

I tested Megatron with dual-node (TP=2, PP=2) setups using eight A800 GPUs each, training for 50 iterations. I used this configuration for multiple runs and checked whether the saved models were identical each time (comparing parameters one by one). I found that setting NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 alone ensured identical model parameters across runs. It seems only this setting matters for reproducibility in my tests. Conversely, not setting this environment variable resulted in different model parameters being saved after each run.

Questions:

Under what conditions do NCCL_ALGO and --use-flash-attn cause non-deterministic training results?
In my environment, NCCL_ALGO defaults to None. In this case, how does NCCL choose the algorithm, and how can I know which algorithm is being selected?

Environment Details:

Hardware: Eight A800 GPUs per node
Setup: TP=2, PP=2
Training iterations: 50
Deterministic setting used: NVTE_ALLOW_NONDETERMINISTIC_ALGO=0

Thank you for your assistance.

yaox12 · 2024-07-16T06:29:18Z

yaox12
Jul 16, 2024
Collaborator

Flash attention added a deterministic flag since v2.4. For FA version >= 2.4, NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 will automatically set this flag. For FA version < 2.4, you need to disable it.
NCCL_ALGO=NVLS is only supported on platforms with NVLink switches. You can set NCCL_DEBUG=INFO to check which algorithm is selected.

0 replies

yangbofun · 2024-11-05T12:49:32Z

yangbofun
Nov 5, 2024

Is nccl algo deterministic?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron? #1102

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron? #1102

Uh oh!

jinzhuer Jul 11, 2024

Issue Description:

Replies: 2 comments

Uh oh!

yaox12 Jul 16, 2024 Collaborator

Uh oh!

Uh oh!

yangbofun Nov 5, 2024

jinzhuer
Jul 11, 2024

yaox12
Jul 16, 2024
Collaborator

yangbofun
Nov 5, 2024