Skip to content

Conversation

@speriaswamy-amd
Copy link
Contributor

@speriaswamy-amd speriaswamy-amd commented Dec 5, 2025

Summary

This PR implements the architecture for Aorta distributed training benchmarks, using Docker SDK over SSH for container orchestration, in this PR the following are introduced:

  • A modular runner-parser architecture as a new pattern for benchmark integration; clean separation of concerns between benchmark execution (runners) and result analysis (parsers), making it easier to add new benchmarks in the future.
  • Docker SDK over SSH: Programmatic container orchestration with streaming output
  • Pydantic Validation: Fail-fast config validation before benchmark execution and after execution to validate results
  • TraceLens Integration: Accurate per-rank GPU timeline analysis
  • Override Layer Pattern: Clean exposure of Aorta's internal configs (RCCL, environment, training)
  • Real-time Streaming: Live output from long-running container commands
  • Threshold Validation - Configurable performance thresholds (compute ratio, iteration time, etc.)
  • Create final report about the processed data on the head node

New dependencies added:

docker >= 7.0.0          # Docker SDK for container orchestration
tracelens                # TraceLens for PyTorch trace analysis (optional)
orjson                   # Fast JSON parsing required by tracelen (optional)
openpyxl                 # Excel output for TraceLens (optional)

Usage

# Run Aorta benchmark
pytest -vvv tests/benchmark/test_aorta.py \
    --cluster_file input/cluster_file/cluster.json \
    --config_file input/aorta_benchmark.yaml

# Run only validation (fast)
pytest tests/benchmark/test_aorta.py -k "validate"

Files Created

  • runners/_base_runner.py: Base classes and configs for benchmark runners
  • runners/aorta.py: Aorta benchmark runner with Docker SDK over SSH
  • parsers/schemas.py: Pydantic models for configs, metrics, and results
  • parsers/tracelens.py: TraceLens integration for PyTorch profiler analysis
  • tests/benchmark/test_aorta.py: Pytest-based Aorta benchmark tests with validation
  • input/aorta_benchmark.yaml: Benchmark configuration with override layer pattern

Notes

  • This architecture provides us a clean template for developing new benchmarks; create runner, create parsers, creates tests validate.
  • The config validation using pydantic should be extended to other benchmarks to fail early in case somethings missing
  • Multinode Aorta benchmark is also supported with the same architecture with some tweaks (testing this one)

For more information: please refer AICOMRCCL-271

@speriaswamy-amd speriaswamy-amd self-assigned this Dec 5, 2025
@speriaswamy-amd speriaswamy-amd added the enhancement New feature or request label Dec 5, 2025
return False

# Launch container
container = self._launch_container(client, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@speriaswamy-amd , is this call blocking, if yes as the setup function is iterating over all the servers and launching container serially, this will take long time in case of scaled setup. Can we do parallelism using ThreadPoolExecutor or gevent. NOTE: pssh use gevent co-routine lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants