Aorta automation #36

speriaswamy-amd · 2025-12-05T20:29:37Z

Summary

This PR implements the architecture for Aorta distributed training benchmarks, using Docker SDK over SSH for container orchestration, in this PR the following are introduced:

A modular runner-parser architecture as a new pattern for benchmark integration; clean separation of concerns between benchmark execution (runners) and result analysis (parsers), making it easier to add new benchmarks in the future.
Docker SDK over SSH: Programmatic container orchestration with streaming output
Pydantic Validation: Fail-fast config validation before benchmark execution and after execution to validate results
TraceLens Integration: Accurate per-rank GPU timeline analysis
Override Layer Pattern: Clean exposure of Aorta's internal configs (RCCL, environment, training)
Real-time Streaming: Live output from long-running container commands
Threshold Validation - Configurable performance thresholds (compute ratio, iteration time, etc.)
Create final report about the processed data on the head node

New dependencies added:

docker >= 7.0.0          # Docker SDK for container orchestration
tracelens                # TraceLens for PyTorch trace analysis (optional)
orjson                   # Fast JSON parsing required by tracelen (optional)
openpyxl                 # Excel output for TraceLens (optional)

Usage

# Run Aorta benchmark
pytest -vvv tests/benchmark/test_aorta.py \
    --cluster_file input/cluster_file/cluster.json \
    --config_file input/aorta_benchmark.yaml

# Run only validation (fast)
pytest tests/benchmark/test_aorta.py -k "validate"

Files Created

runners/_base_runner.py: Base classes and configs for benchmark runners
runners/aorta.py: Aorta benchmark runner with Docker SDK over SSH
parsers/schemas.py: Pydantic models for configs, metrics, and results
parsers/tracelens.py: TraceLens integration for PyTorch profiler analysis
tests/benchmark/test_aorta.py: Pytest-based Aorta benchmark tests with validation
input/aorta_benchmark.yaml: Benchmark configuration with override layer pattern

Notes

This architecture provides us a clean template for developing new benchmarks; create runner, create parsers, creates tests validate.
The config validation using pydantic should be extended to other benchmarks to fail early in case somethings missing
Multinode Aorta benchmark is also supported with the same architecture with some tweaks (testing this one)

For more information: please refer AICOMRCCL-271

… report

cijohnson · 2025-12-05T23:49:52Z

runners/aorta.py

+                    return False
+
+                # Launch container
+                container = self._launch_container(client, node)


@speriaswamy-amd , is this call blocking, if yes as the setup function is iterating over all the servers and launching container serially, this will take long time in case of scaled setup. Can we do parallelism using ThreadPoolExecutor or gevent. NOTE: pssh use gevent co-routine lib

speriaswamy-amd added 7 commits December 5, 2025 14:17

Aorta benchmark yaml configuration

f63e247

runner/parser architecture for clear seperation of concerns

4627090

Base runner implementation

3449a5d

Aorta benchmark runner

900866a

Aorta and Tracelens parser schemas

823606d

Validate input config, run aorta benchmarks, validate results, create…

31e8c87

… report

Requirements

f2c60c5

speriaswamy-amd requested review from cijohnson, solaiys and venksrin09 December 5, 2025 20:30

speriaswamy-amd self-assigned this Dec 5, 2025

speriaswamy-amd added the enhancement New feature or request label Dec 5, 2025

cijohnson reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aorta automation #36

Aorta automation #36

Uh oh!

speriaswamy-amd commented Dec 5, 2025 •

edited

Loading

Uh oh!

cijohnson Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aorta automation #36

Are you sure you want to change the base?

Aorta automation #36

Uh oh!

Conversation

speriaswamy-amd commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New dependencies added:

Usage

Files Created

Notes

Uh oh!

cijohnson Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

speriaswamy-amd commented Dec 5, 2025 •

edited

Loading