Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 146 additions & 49 deletions docs/developer-guide/benchmarking.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,158 @@
# Benchmarking

Vortex has two categories of benchmarks: microbenchmarks for individual operations, and SQL
benchmarks for end-to-end query performance. The `bench-orchestrator` tool coordinates running
SQL benchmarks across different engines without compiling them all into a single binary.
benchmarks for end-to-end query performance.

## Microbenchmarks

Microbenchmarks use the Divan framework and live in `benches/` directories within individual
crates. They cover low-level operations such as encoding, decoding, compute kernels, buffer
operations, and scalar access.
Microbenchmarks use the Divan framework and live in `benches/` directories within individual crates.

Run microbenchmarks for a specific crate with:

```bash
cargo bench -p <crate-name>
```

## Best Practices

### Separate setup from profiled code

Always use `bencher.with_inputs(|| ...)` so fixture construction is excluded from timing:

```rust
bencher
.with_inputs(|| bench_fixture()))
.bench_refs(|(array, indices)| {
array.take(indices.to_array()).unwrap()
});
```

### Exclude `Drop` from measurements

Divan measures only the closure body, **not** the `Drop` of its return value.
Structure your benchmark so that expensive drops happen via the return value or
via bench_refs inputs.

- **Return the value** from the closure — Divan will drop it after timing stops:

```rust
bencher
.with_inputs(|| make_big_vec())
.bench_values(|v| transform(v)) // drop of the result is NOT timed
```

- **Use `bench_refs`** — the input is dropped after the entire sample loop, not per-iteration:

```rust
bencher
.with_inputs(|| make_big_vec())
.bench_refs(|v| v.sort()) // v is dropped outside the timed region
```

Structure your benchmark so that expensive drops happen via the return value or via `bench_refs` inputs.

### Black-box inputs to prevent compiler optimization

The compiler can constant-fold or eliminate work if it can prove that inputs are known at
compile time.

Values provided through `with_inputs` are automatically black-boxed by Divan — no action
needed:

```rust
// ✓ `array` and `indices` are automatically black-boxed by Divan
bencher
.with_inputs(|| (&prebuilt_array, &prebuilt_indices))
.bench_refs(|(array, indices)| array.take(indices.to_array()).unwrap());
```

### Captured variables

Variables captured from the surrounding scope are _not_ black-boxed. Wrap them with
`divan::black_box()` or pass them through `with_inputs` instead:

```rust
let array = make_array();

// ✗ `array` is captured — the compiler may optimize based on its known contents
bencher.bench(|| process(&array));

// ✓ Option A: pass through with_inputs
bencher
.with_inputs(|| &array)
.bench_refs(|array| process(array));

// ✓ Option B: explicit black_box on the capture
bencher.bench(|| process(divan::black_box(&array)));
```

### Return values and manual loops

Return values are automatically black-boxed. You only need explicit
`black_box` for side-effect-free results inside manual loops:

```rust
bencher.with_inputs(|| &array).bench_refs(|array| {
for idx in 0..len {
divan::black_box(array.scalar_at(idx).unwrap());
}
});
```

### Use deterministic, seeded RNG

Always use `StdRng::seed_from_u64(N)` for reproducible data generation:

```rust
let mut rng = StdRng::seed_from_u64(0);
```

### Parameterize with `args`, `consts`, and `types`

Use Divan's parameterization features and define parameter arrays as named constants:

```rust
const NUM_INDICES: &[usize] = &[1_000, 10_000, 100_000];
const VECTOR_SIZE: &[usize] = &[16, 256, 2048, 8192];

#[divan::bench(args = NUM_INDICES, consts = VECTOR_SIZE)]
fn my_bench<const N: usize>(bencher: Bencher, num_indices: usize) { ... }
```

### Keep per-iteration execution time under ~1 ms

Each individual iteration of the benchmarked closure should complete in
**less than 1ms**. This is to keep benchmarks snappy, locally and on CI.
Comment on lines +122 to +125
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to be hard no


### Gate CodSpeed-incompatible benchmarks

Use `#[cfg(not(codspeed))]` for benchmarks that are incompatible with CodSpeed.

### CodSpeed's single-run model

CI benchmarks run under [CodSpeed's CPU simulation](https://codspeed.io/docs/instruments/cpu),
which executes each benchmark **exactly once** and estimates CPU cycles from the instruction
trace — including cache and memory access costs. This has several implications:

- **`sample_count` and `sample_size` have no effect** — CodSpeed always runs one iteration.
- **Results are deterministic** — the simulated cycle count is derived from the instruction
trace, not wall-clock time, so there is no noise from system load or scheduling.
- **System calls are excluded** — CodSpeed only measures user-space code. Benchmarks that
rely on I/O or kernel interactions will not reflect those costs, so they should use the
[walltime instrument](https://codspeed.io/docs/instruments/walltime) or be gated with
`#[cfg(not(codspeed))]`.

### Prefer `mimalloc` for throughput benchmarks

Throughput benchmarks should use `mimalloc` as the global allocator to reduce system allocator
noise:

```rust
use mimalloc::MiMalloc;
#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;
```

## SQL Benchmarks

SQL benchmarks measure end-to-end query performance across different engines and file formats.
Expand Down Expand Up @@ -48,51 +185,11 @@ cargo run --release --bin duckdb-bench -- <benchmark>

## Orchestrator

The `bench-orchestrator` is a Python CLI tool (`vx-bench`) that coordinates running benchmarks
across multiple engines. It builds and invokes the per-engine binaries, stores results, and
provides comparison tooling. This avoids compiling all engines into a single binary, which
would be slow and create dependency conflicts.

Install it with:

```bash
uv tool install "bench_orchestrator @ ./bench-orchestrator/"
```

### Running Benchmarks

```bash
# Run TPC-H on DataFusion and DuckDB, comparing Parquet and Vortex
vx-bench run tpch --engine datafusion,duckdb --format parquet,vortex

# Run a subset of queries with fewer iterations
vx-bench run tpch -q 1,6,12 -i 3

# Run with memory tracking
vx-bench run tpch --track-memory

# Run with CPU profiling
vx-bench run tpch --samply
```

### Comparing Results

```bash
# Compare formats/engines within the most recent run
vx-bench compare --run latest

# Compare across two labeled runs
vx-bench compare --runs baseline,feature
```

Comparison output is color-coded: green for improvements (>10%), yellow for neutral, red for
regressions.

### Result Storage
The `bench-orchestrator` is a Python CLI tool (`vx-bench`) that coordinates running SQL
benchmarks across multiple engines, stores results, and provides comparison tooling.

Results are stored as JSON Lines files under `target/vortex-bench/runs/`, with each run
containing metadata (git commit, timestamp, configuration) and per-query timing data. The
`vx-bench list` command shows recent runs.
See [`bench-orchestrator/README.md`](https://github.com/vortex-data/vortex/blob/develop/bench-orchestrator/README.md) for installation,
commands, and example workflows.

## CI Benchmarks

Expand Down
Loading