Adds Python-based test runner for RCCL #2034

atulkulk · 2025-11-05T22:50:02Z

Details

Work item: Sub-task of LWPCOMMLIBS-713

What were the changes?

Adds Python-based test runner for RCCL with hierarchical JSON configuration support, replacing shell-based test execution with a maintainable and extensible framework that supports GTest, performance tests, and custom executables.
Includes integrated LLVM code coverage reporting, MPI multi-rank/multi-node test execution, flexible test filtering, automated CMake build integration, and environment variable management with path expansion.
Provides clean output, comprehensive logging, and configuration inheritance via "extends" directive for easy test suite organization and reusability.

Why were the changes made?

Test execution with a Python framework that provides better extensibility, hierarchical JSON configuration for easier test management, and integrated LLVM code coverage reporting.
Enables better test organization through configuration inheritance, environment variable management with path expansion, and supports multiple test types (GTest, performance, custom) with flexible filtering and automated build integration.

How was the outcome achieved?

Implemented a fairly simple modular Python runner with three core components (ArgumentParser, TestConfigProcessor, TestExecutor) that parse JSON configurations with hierarchical inheritance, orchestrate CMake builds, execute MPI-based tests, and integrate LLVM coverage tools.

Additional Documentation:
Please go over README.md for more information

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

venksubramd · 2025-11-09T12:09:10Z

Very comprehensive. Nice work.

A few minor observations:

In test_runner.py

    # Check environment
    if not executor.check_environment():
        return

Should we report before exiting? Especially if verbose is set, but even otherwise?

In test_runner.py

Unless I'm reading this wrong, it appears that the following code is redundant:
# Return based on results
if executor.test_results:
failed = executor.test_results.count(executor.RESULT_FAILED)
timeout = executor.test_results.count(executor.RESULT_TIMEOUT)
if failed > 0 or timeout > 0:
return

Was there instead an intent to log something if failed or timeout are > 0?

In test_executor.py

Would it be better to turn the following into two separate enums?

# Exit codes
EXIT_SUCCESS = 0
EXIT_FAILURE = 1
EXIT_TIMEOUT = 124

# Test results
RESULT_PASSED = "PASSED"
RESULT_FAILED = "FAILED"
RESULT_TIMEOUT = "TIMEOUT"
RESULT_SKIPPED = "SKIPPED"

Update node and process validation Updated node detection count and modified validation method Update validation logic to include max procs and nodes

eidenyoshida added the ci:code-coverage label Nov 7, 2025

atulkulk added 27 commits November 14, 2025 16:54

Added MPI support to execute unit/functional tests

60ab64f

Update node and process validation Updated node detection count and modified validation method Update validation logic to include max procs and nodes

Address review comments

2864863

Fix warnings

f307ca1

Added a new NET transport test and clean up

aa1f9ed

Added MPI test logging mechanism

0312b88

Decoupled GTest framework

71e0618

Added Net IB functional tests

69fbe14

Updated with resource guards

e10d642

Added NET IB tests and refactored code

cb1ff83

Update P2pWorkflow test

cd32beb

Update documentation

3de4332

Added python test runner to execute rccl tests

c74ee0c

Disabled capture output to avoid hangs

6138d2d

Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

1777216

Add MPI_TESTS_ENABLED guard to the file

463b325

Converted test_type to boolean gtest flag

8adfd82

Removed unused return values

75194a6

Added custom rccl library usage

2188dea

Removed json output

d48fb16

Fix Shm and NetIB tests

b22e53f

Updates to test_runner: added num_gpus field

665ed1f

Applied refactoring and cleanup

6d3e029

Replaced BufferGuard with AutoGuard

e971a78

Modified test debug logging

9f4e338

Address review comments

415b2c7

Prepend env vars for single node, single process executions

6756cda

Added separate enums for exit and result codes

186b0ad

atulkulk force-pushed the ak_mpifw_exec branch from 512d38f to 186b0ad Compare November 14, 2025 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds Python-based test runner for RCCL #2034

Adds Python-based test runner for RCCL #2034

atulkulk commented Nov 5, 2025

Uh oh!

venksubramd commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds Python-based test runner for RCCL #2034

Are you sure you want to change the base?

Adds Python-based test runner for RCCL #2034

Conversation

atulkulk commented Nov 5, 2025

Details

Approval Checklist

Uh oh!

venksubramd commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants