Skip to content

Missing timeout parameter in EXPERT_MODEL_PARALLEL_GROUP #2238

@potatowarriors

Description

@potatowarriors

Description

The EXPERT_MODEL_PARALLEL_GROUP process group is created without a timeout parameter in megatron/core/parallel_state.py, causing it to use PyTorch's default timeout (10 minutes for NCCL) instead of the user-specified --distributed-timeout-minutes value.

Location

File: megatron/core/parallel_state.py
Line: 1077-1083

Current Code (Buggy)

for ranks in expert_decoder_rank_generator.get_ranks('ep'):
    group = create_group(
        ranks,
        pg_options=get_nccl_options("ep", nccl_comm_cfgs),
        group_desc="EXPERT_MODEL_PARALLEL_GROUP",
    )  # ❌ Missing: timeout=timeout

Expected Code (Fixed)

for ranks in expert_decoder_rank_generator.get_ranks('ep'):
    group = create_group(
        ranks,
        timeout=timeout,  # ✅ Add this line
        pg_options=get_nccl_options("ep", nccl_comm_cfgs),
        group_desc="EXPERT_MODEL_PARALLEL_GROUP",
    )

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions