Skip to content

Commit ea021c9

Browse files
committed
Reapply "[Dev] fix(megatron-fsdp): Resolve hang caused by non-deterministic reduce-scatter (NVIDIA#2252)"
This reverts commit 7b8e39e.
1 parent 7b8e39e commit ea021c9

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2782,6 +2782,9 @@ def reduce_gradients(
27822782
outer_fsdp_group_grad_reduce (bool, optional): Whether to reduce gradients
27832783
across outer-DP groups. Defaults to False.
27842784
"""
2785+
# Sort parameters by their bucket IDs to ensure a deterministic processing order.
2786+
# Performing reduce-scatter operations out of order can lead to hangs.
2787+
params = sorted(list(params), key=lambda x: self.buffer.param_to_param_group[x])
27852788
for param in params:
27862789
bucket_id = self.buffer.param_to_param_group[param]
27872790
param_group = self.buffer.parameter_groups[bucket_id]

0 commit comments

Comments
 (0)