·
35 commits
to main
since this release
Immutable
release. Only release title and notes can be modified.
Release Notes
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
NVIDIA Resiliency Extension v0.5.0
Highlights
-
In-job restarts
- PRs (185, 190, 201) improve the scalability, profiling, and performance of in-job restarts through improvements to the rendezvous operation
- Key scaling and fault-tolerance improvements:
- New barrier-based rendezvous operation introduces a substantial redesign that addresses several limitations of the previous dynamic rendezvous implementation. This provides more predictable, stable, and scalable in-job behavior
- Faster termination path:
- The worker termination timeout (--workers-stop-timeout) has been reduced from 30 seconds to 15 seconds, improving failure recovery latency and overall job responsiveness
- New Flag for Infra-Aligned Rank Assignment:
- A new flag, --ft-use-infra-group-rank, allows in-job scaling to follow the infrastructure scheduler’s rank assignment, preserving topology-aware placement decisions
- Migration Guidance:
- While the previous dynamic rendezvous-based implementation (v1) remains supported, users are strongly encouraged to adopt barrier-based rendezvous (v2) for improved reliability, stability, and performance
-
Enhanced GPU and NVLink health checks
- PR 145 introduces several improvements to health check module including
- Refactored
GPUHealthCheckto support device-specific monitoring - New
NVLHealthCheckclass for NVLink health validation - Automatic health check chaining in
WrapperclassChainedGPUHealthCheckandChainedNVLHealthCheckfor in-process use - Single GPU health check API for individual device validation and updated trace collector to use new GPU health check API
- Refactored
- PR 145 introduces several improvements to health check module including
-
Checkpointing
- PRs (108, 138, 154, 169, 170, 193, 197, 199) improve the stability of checkpointing by deprecating the use of fork in asynchronous checkpointing, simplifying error propagation and shutdown cleanup logic
- Introduced the option to use Multithread File IO Instead of Multiprocess to simplify error propagation logic, improve shutdown cleanup and enhance overall stability
- Made persistent async checkpoint worker default (except for local checkpointing) and fixed cross-call state pollution
- Added ability to abort async checkpoint process
- PRs (108, 138, 154, 169, 170, 193, 197, 199) improve the stability of checkpointing by deprecating the use of fork in asynchronous checkpointing, simplifying error propagation and shutdown cleanup logic
-
Fault attribution (new module introduced in v0.5)
- PR 141 introduces the base attribution class which can be used to define any attribution module. This provides asynchronous combining multiple modules directly.
- PR 172 improves error attribution by dumping NCCL traces from PyTorch for collective analysis on hang or watchdog timeout
- It is an experimental module to identify ranks interrupting workload progress by analyzing Flight Recorder traces. It detects GPU errors, host issues, and GIL locks
- PyT’s watchdog is currently configured to include the training process’s stack trace when generating Flight Recorder traces. However, this can lead to a deadlock if the trainer fails inside a routine that performs collectives while holding the GIL, since capturing the stack trace requires reacquiring the GIL. A new environment variable, TORCH_INCLUDE_STACK_TRACE=False (Default: True), has been added to PyTorch main to avoid this issue. This change will be included in the NGC PyT 25.11 container.
Known Issues & Limitations
- Spare-Node Support
- Spare nodes are not supported by either dynamic rendezvous or barrier-based rendezvous in the current release.
- The earlier dynamic rendezvous technically supported spare nodes, but only when infra group rank assignment was not used. That mode isn't viable in real deployments because bypassing the infrastructure topology-aware rank assignment leads to degraded performance and inconsistent scaling behavior. Because of this, spare-node support isn't available in this release.
- With barrier-based rendezvous, we've aligned fully with infra-assigned ranks to ensure correctness and performance. Spare-node support for barrier-based rendezvous is planned for a future update.
- CUDA 12 and Ubuntu 22.04 users are advised to build from source, since PyPI wheel for v0.5 defaults to CUDA 13
- In-process restart requires NCCL < v2.28.3 OR >= 2.28.9 due to a segmentation fault issue