Skip to content

fix: resolve multi-node training hanging in Kubernetes environments #6377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

amyanger
Copy link

@amyanger amyanger commented Aug 5, 2025

Description

Addresses issue #6349 where multi-node training gets stuck during distributed initialization when using torchrun in Kubernetes.

Root Cause

  • Missing rendezvous backend configuration in torchrun
  • No master node readiness checks in K8s pod startup
  • Insufficient timeout configuration for container networking
  • Lack of Kubernetes-specific networking setup

Solution

Enhanced Initialization (colossalai/initialize.py)

  • Add master node readiness checks for non-master ranks
  • Implement configurable timeouts via environment variables
  • Provide detailed error messages with troubleshooting guidance
  • Add robust error handling for distributed process group init

Kubernetes Utilities (colossalai/utils/k8s_distributed.py)

  • Environment variable validation with helpful errors
  • Automatic K8s networking configuration (NCCL, Gloo)
  • YAML generation for headless services and training jobs
  • Comprehensive diagnostics and troubleshooting tools

Documentation & Examples

  • Complete K8s multi-node training guide
  • Minimal 2-node test setup for validation
  • Working example with distributed operations testing
  • Test suite for validation

Usage

Replace basic torchrun with enhanced configuration:

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=$NODE_RANK \
  --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  --rdzv_id=$JOB_ID --rdzv_conf="timeout=1800,read_timeout=120" \
  scripts/diffusion/train.py

Backward Compatibility

-  100% backward compatible - no breaking changes
-  Enhanced error messages guide users to solutions
-  New features opt-in via environment variables

Testing

- Tested with logic validation
- 2-node test configuration provided
- Unit tests included

Fixes #6349

Addresses issue hpcaitech#6349 where multi-node training gets stuck during
distributed initialization when using torchrun in Kubernetes.

Root Cause:
- Missing rendezvous backend configuration in torchrun
- No master node readiness checks in K8s pod startup
- Insufficient timeout configuration for container networking
- Lack of Kubernetes-specific networking setup

Solution:
Enhanced Initialization (colossalai/initialize.py):
- Add master node readiness checks for non-master ranks
- Implement configurable timeouts via environment variables
- Provide detailed error messages with troubleshooting guidance
- Add robust error handling for distributed process group init

Kubernetes Utilities (colossalai/utils/k8s_distributed.py):
- Environment variable validation with helpful errors
- Automatic K8s networking configuration (NCCL, Gloo)
- YAML generation for headless services and training jobs
- Comprehensive diagnostics and troubleshooting tools

Documentation & Examples:
- Complete K8s multi-node training guide
- Minimal 2-node test setup for validation
- Working example with distributed operations testing
- Test suite for validation

Usage:
Replace basic torchrun with enhanced configuration:
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=\   --rdzv_backend=c10d --rdzv_endpoint=\:\   --rdzv_id=\ --rdzv_conf='timeout=1800,read_timeout=120'   scripts/diffusion/train.py

Backward Compatibility:
- 100% backward compatible - no breaking changes
- Enhanced error messages guide users to solutions
- New features opt-in via environment variables
@amyanger amyanger requested a review from a team as a code owner August 5, 2025 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant