-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Closed
Labels
Description
Description
The EXPERT_MODEL_PARALLEL_GROUP process group is created without a timeout parameter in megatron/core/parallel_state.py, causing it to use PyTorch's default timeout (10 minutes for NCCL) instead of the user-specified --distributed-timeout-minutes value.
Location
File: megatron/core/parallel_state.py
Line: 1077-1083
Current Code (Buggy)
for ranks in expert_decoder_rank_generator.get_ranks('ep'):
group = create_group(
ranks,
pg_options=get_nccl_options("ep", nccl_comm_cfgs),
group_desc="EXPERT_MODEL_PARALLEL_GROUP",
) # ❌ Missing: timeout=timeoutExpected Code (Fixed)
for ranks in expert_decoder_rank_generator.get_ranks('ep'):
group = create_group(
ranks,
timeout=timeout, # ✅ Add this line
pg_options=get_nccl_options("ep", nccl_comm_cfgs),
group_desc="EXPERT_MODEL_PARALLEL_GROUP",
)