This file provides guidance for AI agents working with the InferenceX codebase.
InferenceX is an open-source, automated benchmarking system that continuously tracks LLM inference performance across different hardware platforms (NVIDIA B200/H100/H200/GB200, AMD MI300X/MI325X/MI355X) and software stacks (vLLM, SGLang, TensorRT-LLM, ATOM). Results are published to https://inferencex.com/.
├── benchmarks/ # Shell scripts for running benchmarks
│ ├── benchmark_lib.sh # Shared benchmarking/eval utilities
│ ├── dsr1_*.sh # Deepseek R1-specific benchmark scripts
│ └── gptoss_*.sh # gptoss-specific benchmark scripts
├── runners/ # Launch scripts for different hardware
│ ├── launch_b200/h100/h200-*.sh # NVIDIA launcher scripts
│ └── launch_mi*.sh # AMD launcher scripts
├── utils/ # Python utilities
│ ├── matrix_logic/ # Config generation and validation
│ │ ├── generate_sweep_configs.py # CLI for generating benchmark matrix
│ │ ├── validation.py # Pydantic validation models
│ │ └── test_*.py # Unit tests
│ ├── bench_serving/ # Benchmark serving client (upstreamed from vLLM)
│ │ ├── benchmark_serving.py # Main benchmark client script
│ │ ├── backend_request_func.py # Backend-specific request functions
│ │ └── benchmark_utils.py # Utility functions
│ ├── evals/ # Eval task definitions for lm-eval
│ │ ├── EVALS.md # Evals documentation
│ │ ├── gsm8k.yaml
│ │ └── gpqa_diamond.yaml
│ ├── collect_eval_results.py # Aggregates eval results
│ ├── process_result.py # Post-processes benchmark results
│ ├── process_changelog.py # Processes perf-changelog.yaml
│ └── summarize.py # Generates markdown summaries
├── .github/
│ ├── workflows/ # GitHub Actions CI/CD
│ │ ├── run-sweep.yml # Main performance sweep
│ │ ├── e2e-tests.yml # End-to-end testing
│ │ ├── benchmark-tmpl.yml # Benchmark job template
│ │ └── collect-evals.yml # Eval results collection
│ └── configs/ # Master configuration files
│ ├── nvidia-master.yaml
│ ├── amd-master.yaml
│ └── runners.yaml
└── perf-changelog.yaml # Triggers benchmarks on changes
- STP (Single Token Prediction): Standard autoregressive decoding where one token is generated per forward pass. No speculative decoding or MTP (Multi-Token Prediction) is used. When a benchmark is labeled "STP only", it means vanilla decoding without any speculation.
- MTP (Multi-Token Prediction): A technique where the model predicts multiple tokens per forward pass, typically using speculative decoding methods like EAGLE or NEXTN.
- Python 3.13: Core automation and config generation
- Pydantic: Configuration validation (V2 with strict mode)
- Bash: Benchmark execution and infrastructure orchestration
- YAML: Configuration files
- GitHub Actions: CI/CD workflows
- Evals: lm-eval validation of benchmark results
- pytest: Testing framework
cd utils
python -m pytest matrix_logic/ -v# Full sweep with all configs
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml
# Filter by model prefix (dsr1 or gptoss)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--model-prefix dsr1
# Filter by framework (sglang, trt, vllm, atom, dynamo-trt, dynamo-sglang)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--framework sglang
# Filter by precision (fp4, fp8)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--precision fp8
# Filter by runner type (b200, h100, h200, gb200, mi300x, mi325x, mi355x)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--runner-type b200python utils/process_result.py
python utils/summarize.pyWhen working with benchmark configurations, use these valid values:
Models (model-prefix):
dsr1- DeepSeek-R1-0528gptoss- GPT-OSS-120B
Precisions:
fp4fp8
Frameworks:
sglang- SGLang inference enginetrt- TensorRT-LLMvllm- vLLM inference engineatom- AMD ATOM frameworkdynamo-trt- NVIDIA Dynamo with TensorRT-LLM backenddynamo-sglang- NVIDIA Dynamo with SGLang backendsglang-disagg- SGLang disaggregated inference
Runners (NVIDIA):
b200- NVIDIA B200 GPUb200-trt- NVIDIA B200 with TensorRTh100- NVIDIA H100 GPUh200- NVIDIA H200 GPUgb200- NVIDIA GB200 (multi-node)
Runners (AMD):
mi300x- AMD MI300X GPUmi325x- AMD MI325X GPUmi355x- AMD MI355X GPU
Sequence Lengths (ISL/OSL):
1k1k- 1024 input / 1024 output8k1k- 8192 input / 1024 output
- Use type hints:
list[str],dict,Optional[int] - Pydantic models for validation with
extra='forbid' - Field aliases for YAML compatibility:
Field(alias="model-prefix") - Docstrings for functions
- Kebab-case for field names:
model-prefix,conc-start,dp-attn - Master configs define all benchmark configurations
perf-changelog.yamltriggers which configs to benchmark
- Source shared utilities:
source benchmark_lib.sh - Functions:
check_env_vars(),wait_for_server_ready(),run_benchmark_serving(),run_eval(),append_lm_eval_summary() - Parameters passed via environment variables
- Conventional commit messages
- Use
[skip-sweep]in commit message to skip benchmarks - Changes to
perf-changelog.yamltrigger benchmark runs
- Add entry to
.github/configs/nvidia-master.yamloramd-master.yaml - Add corresponding entry to
perf-changelog.yamlto trigger benchmark - Run validation:
python utils/matrix_logic/generate_sweep_configs.py full-sweep ...
- Add runner to
.github/configs/runners.yaml - Create launcher script in
runners/directory - Update relevant master config with new runner type
For disaggregated multi-node configurations (dynamo-sglang, dynamo-trt), recipes are stored in the external srtslurm repository. To stage these recipes in InferenceX:
1. Locate source recipes in srtslurm:
# Example: H200 sglang disagg recipes
ls /path/to/srtslurm/recipes/h200/
# 1k1k/ 8k1k/2. Analyze recipe structure: Each recipe YAML contains:
name: Recipe identifiermodel: Model path/container inforesources: GPU type, prefill/decode node/worker countsbackend.sglang_config: Prefill and decode configuration (tp-size, dp-size, ep-size, dp-attention, etc.)benchmark: ISL/OSL and concurrency settings
3. Add config to nvidia-master.yaml:
dsr1-fp8-h200-dynamo-sglang:
image: lmsysorg/sglang:v0.5.8-cu130-runtime
model: deepseek-ai/DeepSeek-R1-0528
model-prefix: dsr1
runner: h200-multinode-slurm
precision: fp8
framework: dynamo-sglang
multinode: true
disagg: true
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- conc-list: [1, 4, 16, 32, 64, 128, 256, 512]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "CONFIG_FILE=recipes/h200/1k1k/bs128-agg-tp.yaml"
decode:
num-worker: 0
tp: 8
ep: 1
dp-attn: false4. Key mapping from srtslurm to nvidia-master.yaml:
| srtslurm field | nvidia-master.yaml field |
|---|---|
resources.prefill_workers |
prefill.num-worker |
resources.decode_workers |
decode.num-worker |
sglang_config.prefill.tp-size |
prefill.tp |
sglang_config.prefill.ep-size |
prefill.ep |
sglang_config.prefill.enable-dp-attention |
prefill.dp-attn |
benchmark.concurrencies (parsed) |
conc-list |
| Recipe file path | additional-settings: CONFIG_FILE=... |
5. Common patterns:
- Aggregated (AGG): Single node,
num-worker: 1for prefill,num-worker: 0for decode - TEP (Tensor-Expert Parallel):
dp-attn: false,ep: 1 - DEP (Data-Expert Parallel):
dp-attn: true,ep: 8(typically) - Low latency: More decode workers (e.g., 9), lower concurrencies
- High throughput: Fewer decode workers, higher concurrencies
6. Add perf-changelog entry:
- config-keys:
- dsr1-fp8-h200-dynamo-sglang
description:
- "Add DSR1 FP8 H200 Dynamo SGLang disaggregated multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8-cu130-runtime"
- "Recipes sourced from srtslurm repo (recipes/h200/)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX7. Validate configuration:
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--framework dynamo-sglangWhen upgrading Docker images in benchmark scripts and master configs .yaml:
- Update the image tag in the relevant
.github/configs/*-master.yamland/orbenchmarks/*.shscript(s) - Update any related environment variables or configuration parameters
- MUST: Add an entry to
perf-changelog.yaml: for example:- config-keys: - dsr1-fp8-*-vllm # Use wildcards to match multiple configs description: - "Update vLLM image from v0.11.2 to v0.13.0" - "Add VLLM_MXFP4_USE_MARLIN=1 environment variable" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- This triggers benchmarks for affected configs and tracks performance changes
- Check GitHub Actions logs for the failed job
- Look at environment variables passed to benchmark script
- Review benchmark script in
benchmarks/directory - Check
wait_for_server_ready()logs for server startup issues
Evals run optional accuracy checks to ensure model outputs aren't degraded by inference optimizations. They can run alongside benchmarks or independently in eval-only mode.
Evals are off by default (RUN_EVAL=false). When enabled, they run at two concurrency levels per configuration group:
- Highest concurrency per (model, runner, framework, precision, ISL, OSL, spec-decoding, dp-attn)
- Lower-median concurrency per (model, runner, framework, precision, ISL, OSL, spec-decoding, dp-attn)
This selection logic is in mark_eval_entries() in utils/matrix_logic/generate_sweep_configs.py.
Note: Evals only run on 8k1k sequence length.
The default eval framework is lm-evaluation-harness (lm-eval).
# Generate configs (evals marked by default on 8k1k subset)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml
# Generate throughput-only configs (skip evals)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--no-evals
# Generate ONLY the eval subset (excludes non-eval configs)
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
--config-files .github/configs/nvidia-master.yaml \
--evals-onlyAll benchmark scripts in benchmarks/ follow one of two flows:
# Combined mode (benchmark + eval):
# 1. Start server
# 2. wait_for_server_ready
# 3. run_benchmark_serving (throughput)
# 4. Conditionally run evals:
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi
# Eval-only mode (EVAL_ONLY=true):
# 1. Compute expanded context via compute_eval_context_length
# 2. Start server with expanded context (--context-length or --max-model-len)
# 3. wait_for_server_ready
# 4. run_benchmark_serving returns immediately (skipped)
# 5. run_eval + append_lm_eval_summary| Function | Description |
|---|---|
run_eval |
Unified entrypoint - dispatches to framework-specific runner |
run_lm_eval |
Runs lm-eval harness against the OpenAI-compatible endpoint |
append_lm_eval_summary |
Writes meta_env.json and moves eval artifacts to workspace |
_install_lm_eval_deps |
Installs lm-eval dependencies |
_patch_lm_eval |
Patches lm-eval for reasoning tokens and TRT compatibility |
compute_eval_context_length |
Computes eval context length (5x benchmark context, capped at model native max) |
get_native_max_context_length |
Extracts model's native max context length from HF config |
Eval results are collected by .github/workflows/collect-evals.yml:
- Downloads all
eval_*artifacts - Runs
utils/collect_eval_results.pyto aggregate results - Outputs
agg_eval_<exp_name>.jsonwith all eval metrics - Publishes summary table to GitHub Step Summary
# Download eval results artifact
gh run download <RUN_ID> --repo SemiAnalysisAI/InferenceX -n eval_results_all -D ./evals
# View eval summary
cat ./evals/agg_eval_all.json | jq -r '
.[] | [.hw, .framework, .precision, .tp, .conc, .task, (.score * 100 | round | . / 100)]
| @tsv' | column -t
# Filter to specific hardware
cat ./evals/agg_eval_all.json | jq '[.[] | select(.hw == "B200")]'| Field | Description |
|---|---|
score |
Primary metric (exact match for GSM8K) |
em_strict |
Strict exact match (requires #### format) |
em_flexible |
Flexible extraction (looser number matching) |
n_eff |
Number of samples evaluated |
task |
Eval task name (e.g., gsm8k) |
| Variable | Default | Description |
|---|---|---|
RUN_EVAL |
false |
Enable eval after throughput benchmark |
EVAL_ONLY |
false |
Skip throughput, only run evals (set by workflow) |
EVAL_FRAMEWORK |
lm-eval |
Eval framework to use |
EVAL_TASKS_DIR |
utils/evals/gsm8k.yaml |
Path to lm-eval task YAML |
EVAL_RESULT_DIR |
/tmp/eval_out-* |
Output directory for eval results |
EVAL_MAX_MODEL_LEN |
16384 |
Max context for eval (set by compute_eval_context_length) |
EVAL_CONCURRENT_REQUESTS |
64 |
Concurrent requests during eval |
- Create a task YAML in
utils/evals/(follow lm-eval task format) - Set
EVAL_TASKS_DIR=utils/evals/<your_task>.yamlwhen running benchmarks - Update
utils/collect_eval_results.pyif new metrics need extraction
The codebase includes patches for lm-eval compatibility (_patch_lm_eval):
- Reasoning token handling: Extracts
reasoning_contentwhenmessage.contentis empty - TRT compatibility: Avoids injecting
{"type": "text"}for non-HF tokenizers
These patches are applied via sitecustomize.py in PYTHONPATH.
utils/matrix_logic/validation.py- Defines all configuration schemasutils/matrix_logic/generate_sweep_configs.py- Config generation logicutils/bench_serving/benchmark_serving.py- Benchmark client for measuring serving performance.github/configs/nvidia-master.yaml- NVIDIA benchmark definitions.github/workflows/run-sweep.yml- Main CI/CD workflow.github/workflows/collect-evals.yml- Eval results collection workflowbenchmarks/benchmark_lib.sh- Shared benchmark/eval utilitiesutils/evals/- Eval task definitions (gsm8k.yaml, math500.yaml)utils/collect_eval_results.py- Aggregates eval results into JSON/table
Tests are located in utils/matrix_logic/:
test_validation.py- Pydantic model validation teststest_generate_sweep_configs.py- Config generation teststest_process_result.py- Result processing tests
Run with: python -m pytest utils/matrix_logic/ -v
Markers available: slow, integration
- Make sure no new directories are created in
/workspaceduring the benchmark. Files are ok.
When asked to analyze benchmark results from a GitHub Actions run URL, use the gh CLI.
# List artifacts for a run
gh api /repos/SemiAnalysisAI/InferenceX/actions/runs/<RUN_ID>/artifacts --jq '.artifacts[].name'
# Download aggregated results
gh run download <RUN_ID> --repo SemiAnalysisAI/InferenceX -n results_bmk -D ./resultsThe results JSON can be large with multiple decimal places, so avoid dumping the raw JSON. Use jq to extract and round to see only what you need, for example:
# Count total results
cat ./results/results_bmk/*.json | jq 'length'
# List unique hardware/framework combinations
cat ./results/agg_bmk.json | jq -r '[.[] | "\(.hw)/\(.framework)"] | unique | .[]'
# Summary table: hw, model, isl/osl, throughput (rounded)
cat ./results/agg_bmk.json | jq -r '
.[] | [.hw, .infmax_model_prefix, "\(.isl)/\(.osl)", (.tput_per_gpu | round)]
| @tsv' | column -t
# Filter to specific model
cat ./results/agg_bmk.json | jq '[.[] | select(.infmax_model_prefix == "gptoss")]'
# Get single best result by throughput
cat ./results/agg_bmk.json | jq 'max_by(.tput_per_gpu)'
# Compact view with rounded values
cat ./results/agg_bmk.json | jq '
.[] | {
hw, framework, model: .infmax_model_prefix,
isl, osl, tp, ep, conc,
tput: (.tput_per_gpu | round),
ttft_p99: (.p99_ttft | .*100 | round | ./100),
e2e_mean: (.mean_e2el | .*100 | round | ./100)
}'| Field | Description |
|---|---|
tput_per_gpu |
Total throughput per GPU (tokens/sec) |
output_tput_per_gpu |
Output token throughput |
mean_ttft / p99_ttft |
Time to first token |
mean_tpot |
Time per output token |
mean_e2el |
End-to-end latency |
| Pattern | Contents |
|---|---|
results_bmk |
Aggregated benchmark results, agg_bmk.json |
results_all |
All results aggregated , might not exist |
eval_results_all |
Eval results, agg_eval_all.json, might not exist |
run-stats |
run_stats.json, run stats, which nodes were ran and succeeded |