Skip to content

M4jupitercannon/Rocm-agent

Repository files navigation

ROCm-Agent

Two-tool optimization pipeline for AMD ROCm GPUs: Analyzer (profiling + bottleneck detection) and Optimizer (AITER-first kernel replacement + Triton custom kernels + fusion).

Architecture

User Model (PyTorch / vLLM)
         │
    ┌────▼────┐
    │ Analyzer │  env setup → baseline → profiling → roofline analysis
    └────┬────┘
         │  bottlenecks.json, analysis_summary.json
    ┌────▼─────┐
    │ Optimizer │  AITER match → fusion detect → Triton fallback → integrate
    └──────────┘

Analyzer

  • Environment setup (Docker rocm/vllm-dev or venv)
  • Baseline benchmarking (vllm bench serve or torch.profiler)
  • Kernel trace collection (--enforce-eager + torch_profiler_record_shapes)
  • TraceLens roofline analysis (TFLOPS/s, TB/s, arithmetic intensity, bound type)
  • Bottleneck ranking with optimization strategy tags

Optimizer

  • Goal 1 — Kernel optimization: Target high time-proportion and low roofline-efficiency kernels
  • Goal 2 — Kernel fusion: Detect and apply operator fusion patterns
  • AITER-first: Check ROCm/aiter for pre-optimized AMD kernels before writing custom code
  • Triton fallback: @triton.jit with @triton.autotune for remaining targets
  • Integration: vLLM CustomOp.register_oot() or VLLM_ROCM_USE_AITER env vars

Quick Start

# 1. Analyze a vLLM model
bash scripts/analyze.sh --model Qwen/Qwen3-8B --output /tmp/rocm_agent_qwen

# 2. Optimize based on analysis results
bash scripts/optimize.sh --project /tmp/rocm_agent_qwen

# 3. Or run the full pipeline
bash scripts/run_pipeline.sh --model Qwen/Qwen3-8B --output /tmp/rocm_agent_qwen

Project Structure

ROCm-Agent/
├── analyzer/              # Tool 1: profiling + bottleneck analysis
│   ├── SKILL.md           # Agent instructions
│   ├── env_setup.py       # Docker/venv environment setup
│   ├── baseline_runner.py # vLLM or PyTorch baseline benchmarks
│   ├── profiler.py        # Trace collection
│   ├── trace_analyzer.py  # TraceLens integration + roofline
│   ├── bottleneck_ranker.py # Bottleneck ranking + strategy tags
│   └── platform_specs.py  # AMD GPU peak specs
│
├── optimizer/             # Tool 2: AITER + Triton optimization
│   ├── SKILL.md           # Agent instructions
│   ├── aiter_matcher.py   # Map bottlenecks to AITER operators
│   ├── fusion_analyzer.py # Detect fusion opportunities
│   ├── triton_optimizer.py# Custom Triton kernel generation
│   ├── kernel_tester.py   # Correctness + benchmark
│   └── integrator.py      # vLLM CustomOp / AITER env vars
│
├── aiter_catalog/         # AITER operator knowledge base
│   ├── operator_map.json  # Bottleneck → AITER mapping
│   └── templates/         # Ready-to-use integration code
│
├── agent_workdir/         # CUDA-Agent style workspace template
│   ├── SKILL.md           # ROCm-adapted agent instructions
│   ├── model.py           # Original model template
│   ├── model_new.py       # Optimized model template
│   └── kernels/           # Custom Triton kernels
│
├── utils/                 # Shared utilities
│   ├── compile.py         # Triton compilation helpers
│   ├── verification.py    # Output correctness checks
│   ├── profiling.py       # Performance comparison
│   └── gpu_info.py        # AMD GPU detection
│
├── scripts/               # CLI entry points
│   ├── analyze.sh
│   ├── optimize.sh
│   └── run_pipeline.sh
│
└── examples/              # Working examples
    ├── rmsnorm/
    └── vllm_qwen/

Supported Hardware

GPU Memory BW BF16 TFLOPS FP8 TFLOPS
MI300X 5.3 TB/s 708 1273
MI325X 6.0 TB/s 843 1519
MI355X 8.0 TB/s 1686 3567

Dependencies

  • Python 3.10+
  • PyTorch 2.x with ROCm support
  • vLLM (ROCm build)
  • AITER (AI Tensor Engine for ROCm) (auto-cloned if not found)
  • TraceLens (auto-cloned if not found)
  • Triton (included with PyTorch ROCm)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors