AGENT.md

This file provides guidance for AI agents working with the InferenceX codebase.

Project Overview

InferenceX is an open-source, automated benchmarking system that continuously tracks LLM inference performance across different hardware platforms (NVIDIA B200/H100/H200/GB200, AMD MI300X/MI325X/MI355X) and software stacks (vLLM, SGLang, TensorRT-LLM, ATOM). Results are published to https://inferencex.com/.

Directory Structure

├── benchmarks/              # Shell scripts for running benchmarks
│   ├── benchmark_lib.sh     # Shared benchmarking/eval utilities
│   ├── dsr1_*.sh            # Deepseek R1-specific benchmark scripts
│   └── gptoss_*.sh          # gptoss-specific benchmark scripts
├── runners/                 # Launch scripts for different hardware
│   ├── launch_b200/h100/h200-*.sh     # NVIDIA launcher scripts
│   └── launch_mi*.sh                  # AMD launcher scripts
├── utils/                   # Python utilities
│   ├── matrix_logic/        # Config generation and validation
│   │   ├── generate_sweep_configs.py  # CLI for generating benchmark matrix
│   │   ├── validation.py              # Pydantic validation models
│   │   └── test_*.py                  # Unit tests
│   ├── bench_serving/       # Benchmark serving client (upstreamed from vLLM)
│   │   ├── benchmark_serving.py       # Main benchmark client script
│   │   ├── backend_request_func.py    # Backend-specific request functions
│   │   └── benchmark_utils.py         # Utility functions
│   ├── evals/               # Eval task definitions for lm-eval
│   │   ├── EVALS.md         # Evals documentation
│   │   ├── gsm8k.yaml
│   │   └── gpqa_diamond.yaml
│   ├── collect_eval_results.py  # Aggregates eval results
│   ├── process_result.py    # Post-processes benchmark results
│   ├── process_changelog.py # Processes perf-changelog.yaml
│   └── summarize.py         # Generates markdown summaries
├── .github/
│   ├── workflows/           # GitHub Actions CI/CD
│   │   ├── run-sweep.yml    # Main performance sweep
│   │   ├── e2e-tests.yml    # End-to-end testing
│   │   ├── benchmark-tmpl.yml  # Benchmark job template
│   │   └── collect-evals.yml   # Eval results collection
│   └── configs/             # Master configuration files
│       ├── nvidia-master.yaml
│       ├── amd-master.yaml
│       └── runners.yaml
└── perf-changelog.yaml      # Triggers benchmarks on changes

Terminology

STP (Single Token Prediction): Standard autoregressive decoding where one token is generated per forward pass. No speculative decoding or MTP (Multi-Token Prediction) is used. When a benchmark is labeled "STP only", it means vanilla decoding without any speculation.
MTP (Multi-Token Prediction): A technique where the model predicts multiple tokens per forward pass, typically using speculative decoding methods like EAGLE or NEXTN.

Key Technologies

Python 3.13: Core automation and config generation
Pydantic: Configuration validation (V2 with strict mode)
Bash: Benchmark execution and infrastructure orchestration
YAML: Configuration files
GitHub Actions: CI/CD workflows
Evals: lm-eval validation of benchmark results
pytest: Testing framework

Development Workflow

Running Tests

cd utils
python -m pytest matrix_logic/ -v

Generating Benchmark Configs

# Full sweep with all configs
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml

# Filter by model prefix (dsr1 or gptoss)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --model-prefix dsr1

# Filter by framework (sglang, trt, vllm, atom, dynamo-trt, dynamo-sglang)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --framework sglang

# Filter by precision (fp4, fp8)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --precision fp8

# Filter by runner type (b200, h100, h200, gb200, mi300x, mi325x, mi355x)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --runner-type b200

Processing Results

python utils/process_result.py
python utils/summarize.py

Supported Configuration Values

When working with benchmark configurations, use these valid values:

Models (model-prefix):

dsr1 - DeepSeek-R1-0528
gptoss - GPT-OSS-120B

Precisions:

fp4
fp8

Frameworks:

sglang - SGLang inference engine
trt - TensorRT-LLM
vllm - vLLM inference engine
atom - AMD ATOM framework
dynamo-trt - NVIDIA Dynamo with TensorRT-LLM backend
dynamo-sglang - NVIDIA Dynamo with SGLang backend
sglang-disagg - SGLang disaggregated inference

Runners (NVIDIA):

b200 - NVIDIA B200 GPU
b200-trt - NVIDIA B200 with TensorRT
h100 - NVIDIA H100 GPU
h200 - NVIDIA H200 GPU
gb200 - NVIDIA GB200 (multi-node)

Runners (AMD):

mi300x - AMD MI300X GPU
mi325x - AMD MI325X GPU
mi355x - AMD MI355X GPU

Sequence Lengths (ISL/OSL):

1k1k - 1024 input / 1024 output
8k1k - 8192 input / 1024 output

Code Conventions

Python

Use type hints: list[str], dict, Optional[int]
Pydantic models for validation with extra='forbid'
Field aliases for YAML compatibility: Field(alias="model-prefix")
Docstrings for functions

YAML

Kebab-case for field names: model-prefix, conc-start, dp-attn
Master configs define all benchmark configurations
perf-changelog.yaml triggers which configs to benchmark

Bash

Source shared utilities: source benchmark_lib.sh
Functions: check_env_vars(), wait_for_server_ready(), run_benchmark_serving(), run_eval(), append_lm_eval_summary()
Parameters passed via environment variables

Git

Conventional commit messages
Use [skip-sweep] in commit message to skip benchmarks
Changes to perf-changelog.yaml trigger benchmark runs

Common Tasks

Adding a New Benchmark Configuration

Add entry to .github/configs/nvidia-master.yaml or amd-master.yaml
Add corresponding entry to perf-changelog.yaml to trigger benchmark
Run validation: python utils/matrix_logic/generate_sweep_configs.py full-sweep ...

Adding a New Runner

Add runner to .github/configs/runners.yaml
Create launcher script in runners/ directory
Update relevant master config with new runner type

Registering Recipes from srtslurm

For disaggregated multi-node configurations (dynamo-sglang, dynamo-trt), recipes are stored in the external srtslurm repository. To stage these recipes in InferenceX:

1. Locate source recipes in srtslurm:

# Example: H200 sglang disagg recipes
ls /path/to/srtslurm/recipes/h200/
# 1k1k/  8k1k/

2. Analyze recipe structure: Each recipe YAML contains:

name: Recipe identifier
model: Model path/container info
resources: GPU type, prefill/decode node/worker counts
backend.sglang_config: Prefill and decode configuration (tp-size, dp-size, ep-size, dp-attention, etc.)
benchmark: ISL/OSL and concurrency settings

3. Add config to nvidia-master.yaml:

dsr1-fp8-h200-dynamo-sglang:
  image: lmsysorg/sglang:v0.5.8-cu130-runtime
  model: deepseek-ai/DeepSeek-R1-0528
  model-prefix: dsr1
  runner: h200-multinode-slurm
  precision: fp8
  framework: dynamo-sglang
  multinode: true
  disagg: true
  seq-len-configs:
  - isl: 1024
    osl: 1024
    search-space:
    - conc-list: [1, 4, 16, 32, 64, 128, 256, 512]
      prefill:
        num-worker: 1
        tp: 8
        ep: 1
        dp-attn: false
        additional-settings:
        - "CONFIG_FILE=recipes/h200/1k1k/bs128-agg-tp.yaml"
      decode:
        num-worker: 0
        tp: 8
        ep: 1
        dp-attn: false

4. Key mapping from srtslurm to nvidia-master.yaml:

srtslurm field	nvidia-master.yaml field
`resources.prefill_workers`	`prefill.num-worker`
`resources.decode_workers`	`decode.num-worker`
`sglang_config.prefill.tp-size`	`prefill.tp`
`sglang_config.prefill.ep-size`	`prefill.ep`
`sglang_config.prefill.enable-dp-attention`	`prefill.dp-attn`
`benchmark.concurrencies` (parsed)	`conc-list`
Recipe file path	`additional-settings: CONFIG_FILE=...`

5. Common patterns:

Aggregated (AGG): Single node, num-worker: 1 for prefill, num-worker: 0 for decode
TEP (Tensor-Expert Parallel): dp-attn: false, ep: 1
DEP (Data-Expert Parallel): dp-attn: true, ep: 8 (typically)
Low latency: More decode workers (e.g., 9), lower concurrencies
High throughput: Fewer decode workers, higher concurrencies

6. Add perf-changelog entry:

- config-keys:
    - dsr1-fp8-h200-dynamo-sglang
  description:
    - "Add DSR1 FP8 H200 Dynamo SGLang disaggregated multinode configuration"
    - "Image: lmsysorg/sglang:v0.5.8-cu130-runtime"
    - "Recipes sourced from srtslurm repo (recipes/h200/)"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

7. Validate configuration:

python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --framework dynamo-sglang

Updating Docker Images

When upgrading Docker images in benchmark scripts and master configs .yaml:

Update the image tag in the relevant .github/configs/*-master.yaml and/or benchmarks/*.sh script(s)
Update any related environment variables or configuration parameters

MUST: Add an entry to perf-changelog.yaml: for example:

- config-keys:
    - dsr1-fp8-*-vllm  # Use wildcards to match multiple configs
  description:
    - "Update vLLM image from v0.11.2 to v0.13.0"
    - "Add VLLM_MXFP4_USE_MARLIN=1 environment variable"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

This triggers benchmarks for affected configs and tracks performance changes

Debugging Benchmark Failures

Check GitHub Actions logs for the failed job
Look at environment variables passed to benchmark script
Review benchmark script in benchmarks/ directory
Check wait_for_server_ready() logs for server startup issues

Evals (Accuracy Validation)

Evals run optional accuracy checks to ensure model outputs aren't degraded by inference optimizations. They can run alongside benchmarks or independently in eval-only mode.

When Evals Run

Evals are off by default (RUN_EVAL=false). When enabled, they run at two concurrency levels per configuration group:

Highest concurrency per (model, runner, framework, precision, ISL, OSL, spec-decoding, dp-attn)
Lower-median concurrency per (model, runner, framework, precision, ISL, OSL, spec-decoding, dp-attn)

This selection logic is in mark_eval_entries() in utils/matrix_logic/generate_sweep_configs.py.

Note: Evals only run on 8k1k sequence length.

Eval Framework: lm-eval

The default eval framework is lm-evaluation-harness (lm-eval).

Running Evals via CLI

# Generate configs (evals marked by default on 8k1k subset)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml

# Generate throughput-only configs (skip evals)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --no-evals

# Generate ONLY the eval subset (excludes non-eval configs)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
  --config-files .github/configs/nvidia-master.yaml \
  --evals-only

Eval Integration in Benchmark Scripts

All benchmark scripts in benchmarks/ follow one of two flows:

# Combined mode (benchmark + eval):
# 1. Start server
# 2. wait_for_server_ready
# 3. run_benchmark_serving (throughput)
# 4. Conditionally run evals:
if [ "${RUN_EVAL}" = "true" ]; then
    run_eval --framework lm-eval --port "$PORT"
    append_lm_eval_summary
fi

# Eval-only mode (EVAL_ONLY=true):
# 1. Compute expanded context via compute_eval_context_length
# 2. Start server with expanded context (--context-length or --max-model-len)
# 3. wait_for_server_ready
# 4. run_benchmark_serving returns immediately (skipped)
# 5. run_eval + append_lm_eval_summary

Key Eval Functions in `benchmarks/benchmark_lib.sh`

Function	Description
`run_eval`	Unified entrypoint - dispatches to framework-specific runner
`run_lm_eval`	Runs lm-eval harness against the OpenAI-compatible endpoint
`append_lm_eval_summary`	Writes `meta_env.json` and moves eval artifacts to workspace
`_install_lm_eval_deps`	Installs lm-eval dependencies
`_patch_lm_eval`	Patches lm-eval for reasoning tokens and TRT compatibility
`compute_eval_context_length`	Computes eval context length (5x benchmark context, capped at model native max)
`get_native_max_context_length`	Extracts model's native max context length from HF config

Eval Results Collection

Eval results are collected by .github/workflows/collect-evals.yml:

Downloads all eval_* artifacts
Runs utils/collect_eval_results.py to aggregate results
Outputs agg_eval_<exp_name>.json with all eval metrics
Publishes summary table to GitHub Step Summary

Fetching Eval Results

# Download eval results artifact
gh run download <RUN_ID> --repo SemiAnalysisAI/InferenceX -n eval_results_all -D ./evals

# View eval summary
cat ./evals/agg_eval_all.json | jq -r '
  .[] | [.hw, .framework, .precision, .tp, .conc, .task, (.score * 100 | round | . / 100)]
  | @tsv' | column -t

# Filter to specific hardware
cat ./evals/agg_eval_all.json | jq '[.[] | select(.hw == "B200")]'

Eval Metrics

Field	Description
`score`	Primary metric (exact match for GSM8K)
`em_strict`	Strict exact match (requires `####` format)
`em_flexible`	Flexible extraction (looser number matching)
`n_eff`	Number of samples evaluated
`task`	Eval task name (e.g., `gsm8k`)

Environment Variables for Evals

Variable	Default	Description
`RUN_EVAL`	`false`	Enable eval after throughput benchmark
`EVAL_ONLY`	`false`	Skip throughput, only run evals (set by workflow)
`EVAL_FRAMEWORK`	`lm-eval`	Eval framework to use
`EVAL_TASKS_DIR`	`utils/evals/gsm8k.yaml`	Path to lm-eval task YAML
`EVAL_RESULT_DIR`	`/tmp/eval_out-*`	Output directory for eval results
`EVAL_MAX_MODEL_LEN`	`16384`	Max context for eval (set by `compute_eval_context_length`)
`EVAL_CONCURRENT_REQUESTS`	`64`	Concurrent requests during eval

Adding a New Eval Task

Create a task YAML in utils/evals/ (follow lm-eval task format)
Set EVAL_TASKS_DIR=utils/evals/<your_task>.yaml when running benchmarks
Update utils/collect_eval_results.py if new metrics need extraction

lm-eval Patches

The codebase includes patches for lm-eval compatibility (_patch_lm_eval):

Reasoning token handling: Extracts reasoning_content when message.content is empty
TRT compatibility: Avoids injecting {"type": "text"} for non-HF tokenizers

These patches are applied via sitecustomize.py in PYTHONPATH.

Key Files to Understand

utils/matrix_logic/validation.py - Defines all configuration schemas
utils/matrix_logic/generate_sweep_configs.py - Config generation logic
utils/bench_serving/benchmark_serving.py - Benchmark client for measuring serving performance
.github/configs/nvidia-master.yaml - NVIDIA benchmark definitions
.github/workflows/run-sweep.yml - Main CI/CD workflow
.github/workflows/collect-evals.yml - Eval results collection workflow
benchmarks/benchmark_lib.sh - Shared benchmark/eval utilities
utils/evals/ - Eval task definitions (gsm8k.yaml, math500.yaml)
utils/collect_eval_results.py - Aggregates eval results into JSON/table

Testing

Tests are located in utils/matrix_logic/:

test_validation.py - Pydantic model validation tests
test_generate_sweep_configs.py - Config generation tests
test_process_result.py - Result processing tests

Run with: python -m pytest utils/matrix_logic/ -v

Markers available: slow, integration

Important Notes

Make sure no new directories are created in /workspace during the benchmark. Files are ok.

Fetching GitHub Actions Benchmark Results

When asked to analyze benchmark results from a GitHub Actions run URL, use the gh CLI.

Commands

# List artifacts for a run
gh api /repos/SemiAnalysisAI/InferenceX/actions/runs/<RUN_ID>/artifacts --jq '.artifacts[].name'

# Download aggregated results
gh run download <RUN_ID> --repo SemiAnalysisAI/InferenceX -n results_bmk -D ./results

Parsing Results (IMPORTANT: avoid dumping raw JSON)

The results JSON can be large with multiple decimal places, so avoid dumping the raw JSON. Use jq to extract and round to see only what you need, for example:

# Count total results
cat ./results/results_bmk/*.json | jq 'length'

# List unique hardware/framework combinations
cat ./results/agg_bmk.json | jq -r '[.[] | "\(.hw)/\(.framework)"] | unique | .[]'

# Summary table: hw, model, isl/osl, throughput (rounded)
cat ./results/agg_bmk.json | jq -r '
  .[] | [.hw, .infmax_model_prefix, "\(.isl)/\(.osl)", (.tput_per_gpu | round)] 
  | @tsv' | column -t

# Filter to specific model
cat ./results/agg_bmk.json | jq '[.[] | select(.infmax_model_prefix == "gptoss")]'

# Get single best result by throughput
cat ./results/agg_bmk.json | jq 'max_by(.tput_per_gpu)'

# Compact view with rounded values
cat ./results/agg_bmk.json | jq '
  .[] | {
    hw, framework, model: .infmax_model_prefix, 
    isl, osl, tp, ep, conc,
    tput: (.tput_per_gpu | round),
    ttft_p99: (.p99_ttft | .*100 | round | ./100),
    e2e_mean: (.mean_e2el | .*100 | round | ./100)
  }'

Key Metrics

Field	Description
`tput_per_gpu`	Total throughput per GPU (tokens/sec)
`output_tput_per_gpu`	Output token throughput
`mean_ttft` / `p99_ttft`	Time to first token
`mean_tpot`	Time per output token
`mean_e2el`	End-to-end latency

Artifact Naming

Pattern	Contents
`results_bmk`	Aggregated benchmark results, `agg_bmk.json`
`results_all`	All results aggregated , might not exist
`eval_results_all`	Eval results, `agg_eval_all.json`, might not exist
`run-stats`	`run_stats.json`, run stats, which nodes were ran and succeeded

FilesExpand file tree

AGENTS.md

Latest commit

History