Skip to content

Conversation

@atulkulk
Copy link
Contributor

@atulkulk atulkulk commented Nov 5, 2025

Details

Work item: Sub-task of LWPCOMMLIBS-713

What were the changes?

  • Adds Python-based test runner for RCCL with hierarchical JSON configuration support, replacing shell-based test execution with a maintainable and extensible framework that supports GTest, performance tests, and custom executables.

  • Includes integrated LLVM code coverage reporting, MPI multi-rank/multi-node test execution, flexible test filtering, automated CMake build integration, and environment variable management with path expansion.

  • Provides clean output, comprehensive logging, and configuration inheritance via "extends" directive for easy test suite organization and reusability.

Why were the changes made?

  • Test execution with a Python framework that provides better extensibility, hierarchical JSON configuration for easier test management, and integrated LLVM code coverage reporting.

  • Enables better test organization through configuration inheritance, environment variable management with path expansion, and supports multiple test types (GTest, performance, custom) with flexible filtering and automated build integration.

How was the outcome achieved?

  • Implemented a fairly simple modular Python runner with three core components (ArgumentParser, TestConfigProcessor, TestExecutor) that parse JSON configurations with hierarchical inheritance, orchestrate CMake builds, execute MPI-based tests, and integrate LLVM coverage tools.

Additional Documentation:
Please go over README.md for more information

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@venksubramd
Copy link

Very comprehensive. Nice work.

A few minor observations:

In test_runner.py

    # Check environment
    if not executor.check_environment():
        return

Should we report before exiting? Especially if verbose is set, but even otherwise?

In test_runner.py

Unless I'm reading this wrong, it appears that the following code is redundant:
# Return based on results
if executor.test_results:
failed = executor.test_results.count(executor.RESULT_FAILED)
timeout = executor.test_results.count(executor.RESULT_TIMEOUT)
if failed > 0 or timeout > 0:
return

Was there instead an intent to log something if failed or timeout are > 0?

In test_executor.py

Would it be better to turn the following into two separate enums?

# Exit codes
EXIT_SUCCESS = 0
EXIT_FAILURE = 1
EXIT_TIMEOUT = 124

# Test results
RESULT_PASSED = "PASSED"
RESULT_FAILED = "FAILED"
RESULT_TIMEOUT = "TIMEOUT"
RESULT_SKIPPED = "SKIPPED"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants