Two-tool optimization pipeline for AMD ROCm GPUs: Analyzer (profiling + bottleneck detection) and Optimizer (AITER-first kernel replacement + Triton custom kernels + fusion).
User Model (PyTorch / vLLM)
│
┌────▼────┐
│ Analyzer │ env setup → baseline → profiling → roofline analysis
└────┬────┘
│ bottlenecks.json, analysis_summary.json
┌────▼─────┐
│ Optimizer │ AITER match → fusion detect → Triton fallback → integrate
└──────────┘
- Environment setup (Docker
rocm/vllm-devor venv) - Baseline benchmarking (
vllm bench serveortorch.profiler) - Kernel trace collection (
--enforce-eager+torch_profiler_record_shapes) - TraceLens roofline analysis (TFLOPS/s, TB/s, arithmetic intensity, bound type)
- Bottleneck ranking with optimization strategy tags
- Goal 1 — Kernel optimization: Target high time-proportion and low roofline-efficiency kernels
- Goal 2 — Kernel fusion: Detect and apply operator fusion patterns
- AITER-first: Check ROCm/aiter for pre-optimized AMD kernels before writing custom code
- Triton fallback:
@triton.jitwith@triton.autotunefor remaining targets - Integration: vLLM
CustomOp.register_oot()orVLLM_ROCM_USE_AITERenv vars
# 1. Analyze a vLLM model
bash scripts/analyze.sh --model Qwen/Qwen3-8B --output /tmp/rocm_agent_qwen
# 2. Optimize based on analysis results
bash scripts/optimize.sh --project /tmp/rocm_agent_qwen
# 3. Or run the full pipeline
bash scripts/run_pipeline.sh --model Qwen/Qwen3-8B --output /tmp/rocm_agent_qwenROCm-Agent/
├── analyzer/ # Tool 1: profiling + bottleneck analysis
│ ├── SKILL.md # Agent instructions
│ ├── env_setup.py # Docker/venv environment setup
│ ├── baseline_runner.py # vLLM or PyTorch baseline benchmarks
│ ├── profiler.py # Trace collection
│ ├── trace_analyzer.py # TraceLens integration + roofline
│ ├── bottleneck_ranker.py # Bottleneck ranking + strategy tags
│ └── platform_specs.py # AMD GPU peak specs
│
├── optimizer/ # Tool 2: AITER + Triton optimization
│ ├── SKILL.md # Agent instructions
│ ├── aiter_matcher.py # Map bottlenecks to AITER operators
│ ├── fusion_analyzer.py # Detect fusion opportunities
│ ├── triton_optimizer.py# Custom Triton kernel generation
│ ├── kernel_tester.py # Correctness + benchmark
│ └── integrator.py # vLLM CustomOp / AITER env vars
│
├── aiter_catalog/ # AITER operator knowledge base
│ ├── operator_map.json # Bottleneck → AITER mapping
│ └── templates/ # Ready-to-use integration code
│
├── agent_workdir/ # CUDA-Agent style workspace template
│ ├── SKILL.md # ROCm-adapted agent instructions
│ ├── model.py # Original model template
│ ├── model_new.py # Optimized model template
│ └── kernels/ # Custom Triton kernels
│
├── utils/ # Shared utilities
│ ├── compile.py # Triton compilation helpers
│ ├── verification.py # Output correctness checks
│ ├── profiling.py # Performance comparison
│ └── gpu_info.py # AMD GPU detection
│
├── scripts/ # CLI entry points
│ ├── analyze.sh
│ ├── optimize.sh
│ └── run_pipeline.sh
│
└── examples/ # Working examples
├── rmsnorm/
└── vllm_qwen/
| GPU | Memory BW | BF16 TFLOPS | FP8 TFLOPS |
|---|---|---|---|
| MI300X | 5.3 TB/s | 708 | 1273 |
| MI325X | 6.0 TB/s | 843 | 1519 |
| MI355X | 8.0 TB/s | 1686 | 3567 |