Speeding up distributed tests #1095
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PhysicsNeMo Pull Request
his PR is meant to enable a fast-path for distributed tests, for checks on numerical accuracy and functionality of functions such as
scatter_v
(differentiable scatter collective), as well asShardTensor
layers. PhysicsNeMo tests now come in 3 categories:pytest.mark.multigpu_static
represent tests that can use a static, init-once-and-reuse test environment.pytest.mark.multigpu_dynamic
represent tests that need a fresh distributed environment at run time.Motivation
The distributed test path currently runs all distributed tests with process spawning, which takes several seconds to spawn and coordinate torch distributed. One single test is not a big deal; thousands of tests are problematic for testing performance, making unit testing of distributed tools impractical. Because it's slow and impractical, many distributed functionality items have tests that aren't regularly run, and this has led to functionality regressions.
Some tests have to have dynamic start up, to accommodate testing 2D parallelism as well as testing the distributed manager, etc. Other tests, such as validating numerical accuracy of sharded layers, can and should use static parallelism. With just one process launch, these tests run very quickly.
For static-sized distributed tests, such as layer checks, we can use just one
torchrun
(or similar) command to spawn all groups.Tooling
Several items are needed to enable this functionality:
multigpu
marks tomultigpu_dynamic
marks.--multigpu-static
on the command line, thepytest_configure
function will initialize via DistributedManager.distributed_mesh
anddistributed_mesh_2d
fixtures to use in static tests.Changes
modify_environment
, since that's only need for starting up the distributed manager.Pain points
multigpu_static
job. If it breaks, that whole pipeline breaks.multigpu-dynamic
pipeline should really be first and a prereq in CI.Wishlist
Performance
In testing, I saw tests running double-digit speedups faster (minutes to seconds) due to removing the launch and init overhead.
How to run it
Previously, tests were run like this:
Now, it would be a two-step process:
Description
Checklist
Dependencies