Speeding up distributed tests #1095

coreyjadams · 2025-08-26T17:25:25Z

PhysicsNeMo Pull Request

his PR is meant to enable a fast-path for distributed tests, for checks on numerical accuracy and functionality of functions such as scatter_v (differentiable scatter collective), as well as ShardTensor layers. PhysicsNeMo tests now come in 3 categories:

pytest.mark.multigpu_static represent tests that can use a static, init-once-and-reuse test environment.
pytest.mark.multigpu_dynamic represent tests that need a fresh distributed environment at run time.
SingleGPU are everything that isn't marked as one of the multi-gpu types.

Motivation

The distributed test path currently runs all distributed tests with process spawning, which takes several seconds to spawn and coordinate torch distributed. One single test is not a big deal; thousands of tests are problematic for testing performance, making unit testing of distributed tools impractical. Because it's slow and impractical, many distributed functionality items have tests that aren't regularly run, and this has led to functionality regressions.

Some tests have to have dynamic start up, to accommodate testing 2D parallelism as well as testing the distributed manager, etc. Other tests, such as validating numerical accuracy of sharded layers, can and should use static parallelism. With just one process launch, these tests run very quickly.

For static-sized distributed tests, such as layer checks, we can use just one torchrun (or similar) command to spawn all groups.

Tooling

Several items are needed to enable this functionality:

New markers for static and dynamic multigpu tests, and migrating most multigpu marks to multigpu_dynamic marks.
A distributed printing plugin, to ensure we only print out once per rank in multigpu.
Several updates to conftest.py, and a distributed conftest.py to enable a new fixture
- When using --multigpu-static on the command line, the pytest_configure function will initialize via DistributedManager.
- The collection modification will now sort tests into groups based on markings:
  - distributed tests will only run under their marks.
  - single gpus tests are everything else, and won't run in distributed runs.
- The distributed test folder has distributed_mesh and distributed_mesh_2d fixtures to use in static tests.

Changes

multigpu-static tests no longer need modify_environment, since that's only need for starting up the distributed manager.
They also no longer need to use torch.multiprocessing, and the tests can run directly without worker functions.

Pain points

The DistributedManager needs to be initialized at the start of the multigpu_static job. If it breaks, that whole pipeline breaks.
- So ... the multigpu-dynamic pipeline should really be first and a prereq in CI.
It's probably advisable to separate the tests by file (not mix and match dynamic / static). It would probably work, but might cause confusion when trying to debug.

Wishlist

The single-gpu tests should probably get accelerated with pytest-xdist if we can get reliable access to multiple GPUs.
The coverage calculation does not include distributed tests. It'd be nice to include it.
The ops supported by ShardTensor should all get their tests ported into a testing tool here, in a folder under distributed.
- They currently live elsewhere and need to get ported.

Performance

In testing, I saw tests running double-digit speedups faster (minutes to seconds) due to removing the launch and init overhead.

How to run it

Previously, tests were run like this:

py.test [--multigpu] test/

Now, it would be a two-step process:

py.test --multigpu-dynamic test/ #run the dynamic tests
torchrun --nproc-per-node 2 -m pytest --multigpu-static test/ # run the fast path tests

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

…plugin.

…m minutes to seconds to run.

coreyjadams · 2025-08-29T21:46:19Z

/blossom-ci

coreyjadams added 6 commits August 26, 2025 17:10

Add conftest files to incorporate new fixtures and distributed print …

f799eee

…plugin.

Port autograd testing to static distributed testing tools

473cd4f

Move shard tensor initialization to static tooling

d52af79

Rename radius search testing tools and port to static functionality.

b1ae757

Reclassify multigpu to multigpu_dynamic

5c658c3

Port reduction testing to static framework

d26eb86

coreyjadams marked this pull request as draft August 26, 2025 17:25

coreyjadams requested review from NickGeneva and ktangsali August 26, 2025 17:25

coreyjadams added 4 commits August 27, 2025 17:22

Merge branch 'NVIDIA:main' into static-multigpu-testing

28ac863

Rebuild all current distributed tests in new format. reduces time fro…

3f482f9

…m minutes to seconds to run.

Update distributed tests for certain models.

8372350

Merge branch 'NVIDIA:main' into static-multigpu-testing

d915a31

coreyjadams marked this pull request as ready for review August 29, 2025 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speeding up distributed tests #1095

Speeding up distributed tests #1095

Uh oh!

coreyjadams commented Aug 26, 2025

Uh oh!

coreyjadams commented Aug 29, 2025

Uh oh!

Uh oh!

Speeding up distributed tests #1095

Are you sure you want to change the base?

Speeding up distributed tests #1095

Uh oh!

Conversation

coreyjadams commented Aug 26, 2025

PhysicsNeMo Pull Request

Motivation

Tooling

Changes

Pain points

Wishlist

Performance

How to run it

Description

Checklist

Dependencies

Uh oh!

coreyjadams commented Aug 29, 2025

Uh oh!

Uh oh!