This repository layout provides a reproducible workflow to:
- Create an isolated Python environment with OpenVINO GenAI + Optimum-Intel
- Export Hugging Face LLMs to OpenVINO IR with quantization/group size variants
- Smoke‑test exported model directories (simple text generation)
- Generate fixed-length prompt files per model for consistent benchmarking
- Run multi‑model, multi‑device performance benchmarks (CPU / GPU / NPU)
- Aggregate JSON benchmark outputs into a consolidated CSV report
- (Optional) Fast subset iteration with
--limit-modelsfor quick validation
cd src
# One-time and first-time setup
chmod +x setup.sh && ./setup.sh
# For subsequent runs, just activate py env
source ov-genai-env/bin/activate
./export-models.sh
# Smoke test (no benchmark)
python run-llm-bench-batch.py --models-root ov-models --devices CPU --prompt "The goal of AI is " --max-new-tokens 32
# Create fixed-length prompts for benchmarking
python create-prompts.py --models-root ov-models --prompt-length 64 --device CPU
# Full benchmark
python run-llm-bench-batch.py --models-root ov-models --benchmark -pf prompt_64_tokens.jsonl --bench-iters 3 --all-devices
# (Optional) generate-bench-summary.py runs automatically after run-llm-bench-batch.py.
# Aggregate results. Replace `<timestamp>` with the actual directory printed during benchmarking.
python generate-bench-summary.py --reports-dir benchmark-reports-<timestamp>Run once (adjust Python path if needed).
cd src
chmod +x setup.sh
./setup.shThis creates ov-genai-env/ virtual environment and installs:
- PyTorch (CPU wheels by default, edit for GPU if required)
optimum-intel(latest from GitHub)openvino-genai- llm_bench dependencies (sanitized
requirements-bench.txt)
Activate later with:
source ov-genai-env/bin/activate- Edit
export-models.shto add more MODEL_IDS / group sizes/weight formats. - Exported models saved in
ov-models/(changeOUT_DIR_ROOTin the script if desired).
chmod +x export-models.sh
# Default exports to ov-models directory
./export-models.sh 2>&1 | tee export-models-log.log
# OR pass the export dir
# ./export-models.sh ov-models-test 2>&1 | tee export-models-log.logUse the batch runner in non-benchmark mode (omit --benchmark) for a quick generation sanity check over all model directories:
python run-llm-bench-batch.py --models-root ov-models \
--devices CPU \
--max-new-tokens 32 \
--prompt "The goal of AI is "This prints a truncated sample output per model (and per device if multiple are specified). Failures (load/generation errors) are counted and summarized.
Speed up quick validations during development by limiting how many model directories are processed:
python run-llm-bench-batch.py --models-root ov-models \
--devices CPU \
--limit-models 2 \
--prompt "The goal of AI is " --max-new-tokens 16--limit-models (alias -lm) applies an alphabetical slice after discovering valid IR folders; useful for sanity checks before full runs.
Generates a JSONL like prompt_64_tokens.jsonl inside every valid model folder.
python create-prompts.py --models-root ov-models --prompt-length 64Resulting file per model: prompt_<N>_tokens.jsonl with structure:
{"prompt": "<text truncated to N tokens>"}If the base essay prompt is too short for --prompt-length, you'll get an error—pick a smaller value or extend BASE_PROMPT.
From repo root:
python run-llm-bench-batch.py \
--models-root ov-models \
--benchmark \
-pf prompt_64_tokens.jsonl \
--bench-iters 3 \
--all-devicesKey options:
--benchmarkswitch => use fullbenchmark.pyinstead of simple generation-pfprompt file basename (must exist inside each model dir)--bench-iterspassed as-niterations to underlying tool (first iteration may be warmup depending on config)- Device control:
--devices CPU(single),--devices CPU,GPU(list), or--all-devices - Subset run:
--limit-models 3(first 3 model dirs alphabetically) - Report directory auto-created:
benchmark-reports-<timestamp>/ --report-format json|csv(default json)
Each successful run produces one JSON file (or CSV file) per model/device.
Example produced file:
Llama-3.2-1B-Instruct#int4#asym#g_128#ov#CPU.json
Contains fields:
perfdata.results[0](iteration 0 metrics; we takeinput_size,infer_count)perfdata.results_averaged(averaged latencies & timing fields for final row)
From repo root:
python generate-bench-summary.py --reports-dir benchmark-reports-20250903-230104Outputs (by default):
benchmark-reports-20250903-230104.csv
Columns:
model_name,weight_format,sym_label,group_size,framework,device,\
input_size,infer_count,generation_time,latency,first_latency,second_avg_latency,\
first_infer_latency,second_infer_avg_latency,tokenization_time,detokenization_time
Warnings are printed for malformed filenames or JSON structures and those files are skipped.
- Add more models: edit
MODEL_IDSinexport-models.sh. - Add more quantization/group size combos: extend
ASYM_GROUP_SIZES/SYM_GROUP_SIZESarrays. - Change report format:
--report-format csv. - Single device only: omit
--all-devicesand use--devices GPU. - Skip prompt permutation control: add
--keep-prompt-permutation. - Faster iteration:
--limit-models Nto exercise only the first N model directories (alphabetical by name).
| Issue | Cause / Fix |
|---|---|
Models root not found |
Path typo; verify with ls and adjust --models-root |
| Missing prompt file warning | Run create-prompts.py or ensure correct -pf basename |
| Negative latency values in JSON | Upstream tool anomaly / clock source; verify environment & rerun |
| No GPU / NPU runs | Ensure drivers + OpenVINO plugin installed and visible (check benchmark.py -h) |
When prototyping changes (e.g., adjusting prompt length or verifying new export parameters), running the entire model matrix can be time-consuming. Use:
python run-llm-bench-batch.py --models-root ov-models --benchmark -pf prompt_64_tokens.jsonl \
--bench-iters 2 --devices CPU,GPU --limit-models 1Notes:
- Ordering is deterministic (sorted directory names); to test a specific model, ensure its name sorts early or temporarily move it to a separate root.
- The reported summary still reflects only the processed subset.
- Combine with fewer
--bench-itersfor even faster smoke performance validation of the benchmark pipeline.
- Add Markdown summary generation from CSV
- Track git commit + environment metadata inside each JSON / CSV row
- Add charts (latency vs group size)
| File | Purpose |
|---|---|
setup.sh |
Creates virtualenv, installs dependencies (PyTorch CPU wheels, optimum-intel, openvino-genai, llm_bench requirements) |
export-models.sh |
Batch exports HF models into OpenVINO IR (quantized variants) using optimum-cli export openvino |
create-prompts.py |
Builds a fixed length (N tokens) prompt JSONL inside each model folder |
run-llm-bench-batch.py |
Standalone batch benchmark driver (calls upstream benchmark.py) |
generate-bench-summary.py |
Standalone aggregation of JSON results to CSV (auto-runs after benchmarks) |
benchmark-reports-<timestamp>/ |
Auto-created directory containing per‑model JSON (or CSV) benchmark outputs |
Exported model directory naming convention (produced by export-models.sh):
${MODEL_NAME}#${WEIGHT_FORMAT}#${SYM_LABEL}#g_${GROUP_SIZE}#ov
Benchmark output file naming convention (added device + extension):
${MODEL_NAME}#${WEIGHT_FORMAT}#${SYM_LABEL}#g_${GROUP_SIZE}#ov#${DEVICE}.json
generate-bench-summary.py auto-runs after a benchmark session (and can be run manually) to split filenames on # and emit a CSV summary.