Skip to content

Conversation

coreyjadams
Copy link
Collaborator

PhysicsNeMo Pull Request

his PR is meant to enable a fast-path for distributed tests, for checks on numerical accuracy and functionality of functions such as scatter_v (differentiable scatter collective), as well as ShardTensor layers. PhysicsNeMo tests now come in 3 categories:

  • pytest.mark.multigpu_static represent tests that can use a static, init-once-and-reuse test environment.
  • pytest.mark.multigpu_dynamic represent tests that need a fresh distributed environment at run time.
  • SingleGPU are everything that isn't marked as one of the multi-gpu types.

Motivation

The distributed test path currently runs all distributed tests with process spawning, which takes several seconds to spawn and coordinate torch distributed. One single test is not a big deal; thousands of tests are problematic for testing performance, making unit testing of distributed tools impractical. Because it's slow and impractical, many distributed functionality items have tests that aren't regularly run, and this has led to functionality regressions.

Some tests have to have dynamic start up, to accommodate testing 2D parallelism as well as testing the distributed manager, etc. Other tests, such as validating numerical accuracy of sharded layers, can and should use static parallelism. With just one process launch, these tests run very quickly.

For static-sized distributed tests, such as layer checks, we can use just one torchrun (or similar) command to spawn all groups.

Tooling

Several items are needed to enable this functionality:

  • New markers for static and dynamic multigpu tests, and migrating most multigpu marks to multigpu_dynamic marks.
  • A distributed printing plugin, to ensure we only print out once per rank in multigpu.
  • Several updates to conftest.py, and a distributed conftest.py to enable a new fixture
    • When using --multigpu-static on the command line, the pytest_configure function will initialize via DistributedManager.
    • The collection modification will now sort tests into groups based on markings:
      • distributed tests will only run under their marks.
      • single gpus tests are everything else, and won't run in distributed runs.
    • The distributed test folder has distributed_mesh and distributed_mesh_2d fixtures to use in static tests.

Changes

  • multigpu-static tests no longer need modify_environment, since that's only need for starting up the distributed manager.
  • They also no longer need to use torch.multiprocessing, and the tests can run directly without worker functions.

Pain points

  • The DistributedManager needs to be initialized at the start of the multigpu_static job. If it breaks, that whole pipeline breaks.
    • So ... the multigpu-dynamic pipeline should really be first and a prereq in CI.
  • It's probably advisable to separate the tests by file (not mix and match dynamic / static). It would probably work, but might cause confusion when trying to debug.

Wishlist

  • The single-gpu tests should probably get accelerated with pytest-xdist if we can get reliable access to multiple GPUs.
  • The coverage calculation does not include distributed tests. It'd be nice to include it.
  • The ops supported by ShardTensor should all get their tests ported into a testing tool here, in a folder under distributed.
    • They currently live elsewhere and need to get ported.

Performance

In testing, I saw tests running double-digit speedups faster (minutes to seconds) due to removing the launch and init overhead.

How to run it

Previously, tests were run like this:

py.test [--multigpu] test/

Now, it would be a two-step process:

py.test --multigpu-dynamic test/ #run the dynamic tests
torchrun --nproc-per-node 2 -m pytest --multigpu-static test/ # run the fast path tests

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@coreyjadams coreyjadams marked this pull request as draft August 26, 2025 17:25
@coreyjadams coreyjadams marked this pull request as ready for review August 29, 2025 21:46
@coreyjadams
Copy link
Collaborator Author

/blossom-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant