diff --git a/docs/developer-guide/benchmarking.md b/docs/developer-guide/benchmarking.md index e971db2b239..2e09c727dd7 100644 --- a/docs/developer-guide/benchmarking.md +++ b/docs/developer-guide/benchmarking.md @@ -1,14 +1,11 @@ # Benchmarking Vortex has two categories of benchmarks: microbenchmarks for individual operations, and SQL -benchmarks for end-to-end query performance. The `bench-orchestrator` tool coordinates running -SQL benchmarks across different engines without compiling them all into a single binary. +benchmarks for end-to-end query performance. ## Microbenchmarks -Microbenchmarks use the Divan framework and live in `benches/` directories within individual -crates. They cover low-level operations such as encoding, decoding, compute kernels, buffer -operations, and scalar access. +Microbenchmarks use the Divan framework and live in `benches/` directories within individual crates. Run microbenchmarks for a specific crate with: @@ -16,6 +13,146 @@ Run microbenchmarks for a specific crate with: cargo bench -p ``` +## Best Practices + +### Separate setup from profiled code + +Always use `bencher.with_inputs(|| ...)` so fixture construction is excluded from timing: + +```rust +bencher + .with_inputs(|| bench_fixture())) + .bench_refs(|(array, indices)| { + array.take(indices.to_array()).unwrap() + }); +``` + +### Exclude `Drop` from measurements + +Divan measures only the closure body, **not** the `Drop` of its return value. +Structure your benchmark so that expensive drops happen via the return value or +via bench_refs inputs. + +- **Return the value** from the closure — Divan will drop it after timing stops: + + ```rust + bencher + .with_inputs(|| make_big_vec()) + .bench_values(|v| transform(v)) // drop of the result is NOT timed + ``` + +- **Use `bench_refs`** — the input is dropped after the entire sample loop, not per-iteration: + + ```rust + bencher + .with_inputs(|| make_big_vec()) + .bench_refs(|v| v.sort()) // v is dropped outside the timed region + ``` + +Structure your benchmark so that expensive drops happen via the return value or via `bench_refs` inputs. + +### Black-box inputs to prevent compiler optimization + +The compiler can constant-fold or eliminate work if it can prove that inputs are known at +compile time. + +Values provided through `with_inputs` are automatically black-boxed by Divan — no action +needed: + +```rust +// ✓ `array` and `indices` are automatically black-boxed by Divan +bencher + .with_inputs(|| (&prebuilt_array, &prebuilt_indices)) + .bench_refs(|(array, indices)| array.take(indices.to_array()).unwrap()); +``` + +### Captured variables + +Variables captured from the surrounding scope are _not_ black-boxed. Wrap them with +`divan::black_box()` or pass them through `with_inputs` instead: + +```rust +let array = make_array(); + +// ✗ `array` is captured — the compiler may optimize based on its known contents +bencher.bench(|| process(&array)); + +// ✓ Option A: pass through with_inputs +bencher + .with_inputs(|| &array) + .bench_refs(|array| process(array)); + +// ✓ Option B: explicit black_box on the capture +bencher.bench(|| process(divan::black_box(&array))); +``` + +### Return values and manual loops + +Return values are automatically black-boxed. You only need explicit +`black_box` for side-effect-free results inside manual loops: + +```rust +bencher.with_inputs(|| &array).bench_refs(|array| { + for idx in 0..len { + divan::black_box(array.scalar_at(idx).unwrap()); + } +}); +``` + +### Use deterministic, seeded RNG + +Always use `StdRng::seed_from_u64(N)` for reproducible data generation: + +```rust +let mut rng = StdRng::seed_from_u64(0); +``` + +### Parameterize with `args`, `consts`, and `types` + +Use Divan's parameterization features and define parameter arrays as named constants: + +```rust +const NUM_INDICES: &[usize] = &[1_000, 10_000, 100_000]; +const VECTOR_SIZE: &[usize] = &[16, 256, 2048, 8192]; + +#[divan::bench(args = NUM_INDICES, consts = VECTOR_SIZE)] +fn my_bench(bencher: Bencher, num_indices: usize) { ... } +``` + +### Keep per-iteration execution time under ~1 ms + +Each individual iteration of the benchmarked closure should complete in +**less than 1ms**. This is to keep benchmarks snappy, locally and on CI. + +### Gate CodSpeed-incompatible benchmarks + +Use `#[cfg(not(codspeed))]` for benchmarks that are incompatible with CodSpeed. + +### CodSpeed's single-run model + +CI benchmarks run under [CodSpeed's CPU simulation](https://codspeed.io/docs/instruments/cpu), +which executes each benchmark **exactly once** and estimates CPU cycles from the instruction +trace — including cache and memory access costs. This has several implications: + +- **`sample_count` and `sample_size` have no effect** — CodSpeed always runs one iteration. +- **Results are deterministic** — the simulated cycle count is derived from the instruction + trace, not wall-clock time, so there is no noise from system load or scheduling. +- **System calls are excluded** — CodSpeed only measures user-space code. Benchmarks that + rely on I/O or kernel interactions will not reflect those costs, so they should use the + [walltime instrument](https://codspeed.io/docs/instruments/walltime) or be gated with + `#[cfg(not(codspeed))]`. + +### Prefer `mimalloc` for throughput benchmarks + +Throughput benchmarks should use `mimalloc` as the global allocator to reduce system allocator +noise: + +```rust +use mimalloc::MiMalloc; +#[global_allocator] +static GLOBAL: MiMalloc = MiMalloc; +``` + ## SQL Benchmarks SQL benchmarks measure end-to-end query performance across different engines and file formats. @@ -48,51 +185,11 @@ cargo run --release --bin duckdb-bench -- ## Orchestrator -The `bench-orchestrator` is a Python CLI tool (`vx-bench`) that coordinates running benchmarks -across multiple engines. It builds and invokes the per-engine binaries, stores results, and -provides comparison tooling. This avoids compiling all engines into a single binary, which -would be slow and create dependency conflicts. - -Install it with: - -```bash -uv tool install "bench_orchestrator @ ./bench-orchestrator/" -``` - -### Running Benchmarks - -```bash -# Run TPC-H on DataFusion and DuckDB, comparing Parquet and Vortex -vx-bench run tpch --engine datafusion,duckdb --format parquet,vortex - -# Run a subset of queries with fewer iterations -vx-bench run tpch -q 1,6,12 -i 3 - -# Run with memory tracking -vx-bench run tpch --track-memory - -# Run with CPU profiling -vx-bench run tpch --samply -``` - -### Comparing Results - -```bash -# Compare formats/engines within the most recent run -vx-bench compare --run latest - -# Compare across two labeled runs -vx-bench compare --runs baseline,feature -``` - -Comparison output is color-coded: green for improvements (>10%), yellow for neutral, red for -regressions. - -### Result Storage +The `bench-orchestrator` is a Python CLI tool (`vx-bench`) that coordinates running SQL +benchmarks across multiple engines, stores results, and provides comparison tooling. -Results are stored as JSON Lines files under `target/vortex-bench/runs/`, with each run -containing metadata (git commit, timestamp, configuration) and per-query timing data. The -`vx-bench list` command shows recent runs. +See [`bench-orchestrator/README.md`](https://github.com/vortex-data/vortex/blob/develop/bench-orchestrator/README.md) for installation, +commands, and example workflows. ## CI Benchmarks