Missing timeout parameter in EXPERT_MODEL_PARALLEL_GROUP

Description

The `EXPERT_MODEL_PARALLEL_GROUP` process group is created without a `timeout` parameter in `megatron/core/parallel_state.py`, causing it to use PyTorch's default timeout (10 minutes for NCCL) instead of the user-specified `--distributed-timeout-minutes` value.

Location

**File**: `megatron/core/parallel_state.py`  
**Line**: 1077-1083

Current Code (Buggy)

```python
for ranks in expert_decoder_rank_generator.get_ranks('ep'):
    group = create_group(
        ranks,
        pg_options=get_nccl_options("ep", nccl_comm_cfgs),
        group_desc="EXPERT_MODEL_PARALLEL_GROUP",
    )  # ❌ Missing: timeout=timeout
```

Expected Code (Fixed)

```python
for ranks in expert_decoder_rank_generator.get_ranks('ep'):
    group = create_group(
        ranks,
        timeout=timeout,  # ✅ Add this line
        pg_options=get_nccl_options("ep", nccl_comm_cfgs),
        group_desc="EXPERT_MODEL_PARALLEL_GROUP",
    )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing timeout parameter in EXPERT_MODEL_PARALLEL_GROUP #2238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing timeout parameter in EXPERT_MODEL_PARALLEL_GROUP #2238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions