Skip to content

Commit c6e2b29

Browse files
xuwchenshjwudpyanring
authored
[Dev] fix(megatron-fsdp): Resolve hang caused by non-deterministic reduce-scatter (#2252)
Co-authored-by: Jianbin Chang <[email protected]> Co-authored-by: Zijie Yan <[email protected]>
1 parent a4fce1d commit c6e2b29

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2782,6 +2782,9 @@ def reduce_gradients(
27822782
outer_fsdp_group_grad_reduce (bool, optional): Whether to reduce gradients
27832783
across outer-DP groups. Defaults to False.
27842784
"""
2785+
# Sort parameters by their bucket IDs to ensure a deterministic processing order.
2786+
# Performing reduce-scatter operations out of order can lead to hangs.
2787+
params = sorted(list(params), key=lambda x: self.buffer.param_to_param_group[x])
27852788
for param in params:
27862789
bucket_id = self.buffer.param_to_param_group[param]
27872790
param_group = self.buffer.parameter_groups[bucket_id]

0 commit comments

Comments
 (0)