Skip to content

Commit fc8a55f

Browse files
committed
updating benchmarks and tests
1 parent 6b4e7a0 commit fc8a55f

File tree

12 files changed

+1616
-346
lines changed

12 files changed

+1616
-346
lines changed

BENCHMARKS.md

Lines changed: 77 additions & 247 deletions
Original file line numberDiff line numberDiff line change
@@ -1,301 +1,131 @@
11
# SQLite-Vec C++ Benchmark Results
22

33
**Version**: 0.1.0
4-
**Date**: 2026-01-05
5-
**Platform**: x86_64, 48 cores @ 3.8GHz, 32KB L1, 512KB L2, 16MB L3 (Windows 11)
6-
**Compiler**: clang 21.1.6, C++23, Release mode (`-O3`)
7-
**Library**: Google Benchmark 1.9.4
4+
**Date**: 2026-01-19
5+
**Platform**: Apple M3 Max, 16 cores, 48 GB RAM (macOS 26.2)
6+
**Compiler**: Apple clang 17.0.0, C++23, Release mode (`-O3` via Meson `buildtype=release`)
7+
**SIMD**: NEON enabled, ARM DotProd enabled
8+
**Library**: Google Benchmark 1.8.3
89

10+
> Note: Google Benchmark reports “Library was built as DEBUG” even in this Release build; the Meson
11+
> configuration is `buildtype=release` with NEON/DotProd enabled.
912
1013
---
1114

12-
## Executive Summary
15+
## Archive
1316

14-
The C++ implementation achieves **~2.8M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at near performance parity. HNSW index recommended for >100K vector corpora.
17+
Previous benchmark runs are archived in `benchmarks/archive/`.
1518

1619
---
1720

18-
## Apple Silicon Results (M1 Pro)
19-
20-
**Date**: 2026-01-07
21-
**Platform**: Apple M1 Pro, 16 cores (8P+8E), 192KB L1, 12MB L2, 24MB SLC
22-
**Compiler**: Apple clang 16.0, C++20, Release mode (`-O3 -DNDEBUG`)
23-
**SIMD**: NEON enabled (`-DSQLITE_VEC_ENABLE_NEON`), DotProd enabled (`-march=armv8.2-a+dotprod`)
24-
25-
### HNSW Index Performance (dim=384, k=10, ef=50)
26-
27-
| Corpus | Insert Rate | Search QPS | Search Latency |
28-
|--------|-------------|------------|----------------|
29-
| 1,000 | 7,395/s | 16,565 | 60 µs |
30-
| 5,000 | 3,207/s | 8,981 | 111 µs |
31-
| 10,000 | 2,116/s | 6,400 | 156 µs |
32-
| 25,000 | 1,061/s | 3,510 | 285 µs |
33-
| 50,000 | 632/s | 2,369 | 422 µs |
34-
35-
### Prefetching Impact
36-
37-
Software prefetching (`__builtin_prefetch`) in beam search provides 9-32% improvement:
38-
39-
| Corpus | Without Prefetch | With Prefetch | Improvement |
40-
|--------|------------------|---------------|-------------|
41-
| 1,000 | 15,168 QPS | 16,540 QPS | +9% |
42-
| 5,000 | 9,249 QPS | 10,135 QPS | +10% |
43-
| 10,000 | 6,337 QPS | 7,761 QPS | **+22%** |
44-
| 25,000 | 3,140 QPS | 4,139 QPS | **+32%** |
45-
| 50,000 | 2,303 QPS | 2,744 QPS | +19% |
46-
| 100,000| 1,907 QPS | 2,179 QPS | +14% |
47-
48-
### Batch Search Scaling (10K corpus, 1000 queries)
49-
50-
Linear scaling with thread count using `search_batch()` API:
21+
## Batch Distance Benchmark
5122

52-
| Threads | QPS | Speedup |
53-
|---------|--------|---------|
54-
| 1 (seq) | 5,979 | 1.00x |
55-
| 2 | 12,713 | **2.13x** |
56-
| 4 | 24,377 | **4.08x** |
57-
| 8 | 50,554 | **8.46x** |
23+
### 1. Sequential vs Batch Comparison
5824

59-
### SIMD Distance Computation
25+
| Scenario | Time | Throughput |
26+
|----------|------|------------|
27+
| 100×384d (Sequential) | 2.383 µs | 41.96 M/s |
28+
| 100×384d (Batch) | 2.523 µs | 39.64 M/s |
29+
| 1K×384d (Sequential) | 25.50 µs | 39.21 M/s |
30+
| 1K×384d (Batch) | 26.01 µs | 38.45 M/s |
6031

61-
#### Float32 Cosine Distance (NEON)
32+
### 2. Memory Layout Optimization
6233

63-
| Dimensions | NEON (ns) | Scalar (ns) | Speedup |
64-
|------------|-----------|-------------|---------|
65-
| 384 | 29 | 347 | **12x** |
34+
| Layout | Time | Throughput |
35+
|--------|------|------------|
36+
| Contiguous (1K×384d) | 22.75 µs | 43.96 M/s |
6637

67-
#### Int8 Dot Product (DotProd Instruction)
38+
### 3. Top‑K Performance (1K×384d, K=10)
6839

69-
Using `vdotq_s32` (ARMv8.2+) for quantized vectors:
40+
- **Latency**: 26.85 µs
41+
- **Throughput**: 37.25 M/s
7042

71-
| Method | Time (ns) | Speedup |
72-
|--------------|-----------|---------|
73-
| Scalar | 4.8 | 1.00x |
74-
| NEON DotProd | 0.3 | **17.1x** |
43+
### 4. Quantization (1K×384d)
7544

76-
### Optimization Summary (M1 Pro)
45+
| Type | Time | Throughput |
46+
|------|------|------------|
47+
| int8 | 38.22 µs | 26.16 M/s |
7748

78-
| Optimization | Improvement | Notes |
79-
|--------------|-------------|-------|
80-
| NEON SIMD (float32) | 12x | Cosine distance |
81-
| Prefetching | +9% to +32% | Depends on corpus size |
82-
| Batch search | Linear | 8.46x with 8 threads |
83-
| DotProd (int8) | 17x | ARMv8.2+ required |
49+
### 5. Large Embeddings (1K×1536d)
8450

51+
- **Latency**: 110.9 µs
52+
- **Throughput**: 9.02 M/s
8553

8654
---
8755

8856
## RAG Pipeline Benchmark
8957

9058
### 1. Corpus Size Scaling (384d, K=5)
9159

92-
| Corpus | Latency | Throughput | QPS (single-thread) |
93-
|--------|---------|------------|---------------------|
94-
| 1K | 288 μs | 3.51 M/s | ~3,510 queries/sec |
95-
| 10K | 3.63 ms | 2.77 M/s | ~277 queries/sec |
96-
| 100K | 41.0 ms | 2.43 M/s | ~24 queries/sec |
97-
98-
99-
**Scaling**: Linear (10x corpus → 10x latency)
100-
**Bottleneck**: Compute-bound (memory bandwidth utilization ~5%)
101-
102-
### 2. K-Value Scaling (10K docs, 384d)
60+
| Corpus | Latency | Throughput |
61+
|--------|---------|------------|
62+
| 1K | 28.3 µs | 35.35 M/s |
63+
| 10K | 253 µs | 39.51 M/s |
64+
| 100K | 5.67 ms | 17.64 M/s |
10365

104-
| K | Latency | Delta |
105-
|----|---------|-------|
106-
| 1 | 3.92 ms | +7.8% |
107-
| 5 | 3.63 ms | baseline |
108-
| 10 | 3.64 ms | +0.2% |
109-
| 50 | 3.60 ms | -0.9% |
66+
### 2. K‑Value Scaling (10K docs, 384d)
11067

111-
112-
**Conclusion**: Partial sort overhead negligible; K-value has no meaningful impact.
68+
| K | Latency | Throughput |
69+
|----|---------|------------|
70+
| 1 | 305 µs | 32.82 M/s |
71+
| 5 | 253 µs | 39.51 M/s |
72+
| 10 | 254 µs | 39.43 M/s |
73+
| 50 | 342 µs | 29.20 M/s |
11374

11475
### 3. Embedding Dimension Scaling (10K docs, K=5)
11576

116-
| Dimensions | Latency | Throughput | Scaling Factor |
117-
|------------|----------|------------|----------------|
118-
| 384d | 3.63 ms | 2.77 M/s | 1.0x |
119-
| 768d | 6.78 ms | 1.53 M/s | 1.87x |
120-
| 1536d | 13.2 ms | 780k/s | 3.64x |
121-
122-
123-
**Scaling**: Near-linear (2x dim → 2.06x latency, 4x dim → 4.21x latency)
124-
**Conclusion**: Compute-bound; SIMD efficiency remains high across dimensions.
77+
| Dimensions | Latency | Throughput | Scaling Factor |
78+
|------------|---------|------------|----------------|
79+
| 384d | 253 µs | 39.51 M/s | 1.00x |
80+
| 768d | 740 µs | 13.51 M/s | 2.92x |
81+
| 1536d | 1122 µs | 8.91 M/s | 4.43x |
12582

12683
### 4. Quantization (10K docs, 384d, K=5)
12784

128-
| Type | Latency | Throughput | Storage | Overhead |
129-
|-------|---------|------------|---------|----------|
130-
| float | 3.63 ms | 2.77 M/s | 4 bytes | baseline |
131-
| int8 | 3.62 ms | 2.79 M/s | 1 byte | **-0.4%** |
132-
133-
134-
**Conclusion**: int8 quantization is **faster** while reducing storage 4x (memory bandwidth savings).
135-
136-
### 5. Multi-Query Throughput (10K docs, 384d)
137-
138-
- **10 queries**: 36.2 ms total (3.62 ms/query average)
139-
- **Sustained throughput**: 2.76 M vectors/second
140-
- **QPS**: ~276 queries/second (single-threaded)
141-
- **Parallelization potential**: 48 cores → ~13.2K QPS theoretical
142-
143-
144-
### 6. Sequential vs Batch (1K docs, 384d, K=5)
145-
146-
| Method | Latency | Throughput |
147-
|------------|---------|------------|
148-
| Sequential | 287 μs | 3.51 M/s |
149-
| Batch | 288 μs | 3.51 M/s |
150-
151-
152-
**Conclusion**: Batch API provides cleaner code at performance parity (memory-bandwidth bound).
153-
154-
---
155-
156-
## Batch Distance Benchmark
157-
158-
### 1. Sequential vs Batch Comparison
159-
160-
| Scenario | Sequential | Batch | Speedup |
161-
|----------|------------|-------|---------|
162-
| 100×384d | 27.8 μs | 28.3 μs | 0.98x |
163-
| 1K×384d | 289 μs | 283 μs | 1.02x |
164-
165-
166-
**Conclusion**: Parity performance; both memory-bandwidth limited.
85+
| Type | Latency | Throughput |
86+
|-------|---------|------------|
87+
| float | 253 µs | 39.51 M/s |
88+
| int8 | 414 µs | 24.13 M/s |
16789

168-
### 2. Memory Layout Optimization
169-
170-
| Layout | Latency | Throughput | Improvement |
171-
|-------------|---------|------------|-------------|
172-
| Scattered | 283 μs | 3.54 M/s | baseline |
173-
| Contiguous | 283 μs | 3.54 M/s | +0.0% |
174-
175-
176-
**Conclusion**: Marginal improvement; modern CPUs prefetch efficiently.
177-
178-
### 3. Top-K Performance (1K×384d, K=10)
179-
180-
- **Latency**: 290 μs (vs 287 μs full distance computation)
181-
- **Overhead**: ~1% for partial sort
182-
- **Conclusion**: `std::partial_sort` highly optimized; K << N has negligible cost.
183-
184-
185-
### 4. Large Embeddings (1K×1536d)
186-
187-
- **Latency**: 1.18 ms
188-
- **Throughput**: 833k vectors/second
189-
- **Scaling**: 4.18x slower than 384d (expected 4.0x)
90+
### 5. Multi‑Query Throughput (10K docs, 384d, 10 queries)
19091

92+
- **Total time**: 3.01 ms
93+
- **Throughput**: 33.23 M/s
19194

19295
---
19396

194-
## HNSW Decision Matrix
195-
196-
| Corpus Size | Brute-Force Latency | Recommendation |
197-
|-------------|---------------------|----------------|
198-
| <10K | <4ms | ✅ Brute-force optimal |
199-
| 10K-100K | 4-40ms | ⚠️ Brute-force acceptable for batch |
200-
| >100K | >40ms | ❌ HNSW required for real-time (<10ms) |
201-
202-
**HNSW Threshold**: 100K vectors (~41ms → >10ms target requires ANN index)
97+
## Filtered Search Benchmark (HNSW, 10K corpus)
20398

99+
| Scenario | Time | Throughput |
100+
|----------|------|------------|
101+
| No filter | 9.83 ms | 10.17 k/s |
102+
| Bitset filter 10% | 50.93 ms | 1.96 k/s |
103+
| Bitset filter 50% | 19.42 ms | 5.15 k/s |
104+
| Bitset filter 90% | 11.07 ms | 9.03 k/s |
105+
| Set filter 10% | 65.46 ms | 1.53 k/s |
106+
| Set filter 50% | 23.39 ms | 4.28 k/s |
107+
| Set filter 90% | 11.78 ms | 8.49 k/s |
204108

205109
---
206110

207-
## Platform-Specific Results
208-
209-
### SIMD Utilization
210-
211-
- **AVX**: Active (conditional compilation, `-mavx` detected)
212-
- **NEON**: Not tested (x86_64 platform)
213-
- **Scalar fallback**: Available for non-aligned/small vectors
214-
215-
### Cache Efficiency
111+
## HNSW Index Performance
216112

217-
- **L1 hit rate**: >95% (estimated from throughput consistency)
218-
- **Memory bandwidth**: ~11 GB/s per query (vs 200 GB/s L1 capacity)
219-
- **Conclusion**: Compute-bound, not memory-bound
113+
Full HNSW benchmark run was stopped due to long runtime. Partial results are logged in
114+
`benchmarks/logs/2026-01-19_release_neon/hnsw_benchmark.log`. We will update this section after
115+
optimizing the long‑running benchmark and re‑running.
220116

221117
---
222118

223-
## Comparison to Targets
224-
225-
| Metric | Target | Actual | Status |
226-
|--------|--------|--------|--------|
227-
| 1K corpus (<1ms) | 1000 μs | 288 μs |**3.5x better** |
228-
| 10K corpus (<5ms) | 5000 μs | 3633 μs |**1.4x better** |
229-
| 100K corpus (<50ms) | 50000 μs | 41000 μs |**1.2x better** |
230-
| int8 overhead (<20%) | 20% | -0.4% |**Faster** |
231-
| Dimension scaling | Linear | Near-linear |**Good** |
232-
233-
### Comparison to Previous Results (2025-11-02)
234-
235-
| Scenario | Previous | Current | Delta |
236-
|----------|----------|---------|-------|
237-
| 1K×384d (K=5) latency | 273 μs | 288 μs | +5.5% |
238-
| 10K×384d (K=5) latency | 2.78 ms | 3.63 ms | +30.6% |
239-
| 100K×384d (K=5) latency | 27.9 ms | 41.0 ms | +46.9% |
240-
| 10K×384d throughput | 3.60 M/s | 2.77 M/s | -23.1% |
241-
| int8 @10K×384d latency | 2.74 ms | 3.62 ms | +32.1% |
242-
243-
Notes:
244-
- Previous run header (2025-11-02): Linux (x86_64), GCC 15.2.0, Google Benchmark 1.9.1.
245-
- Current run header (2026-01-05): Windows 11 (x86_64), clang 21.1.6, Google Benchmark 1.9.4.
246-
- Treat deltas as environment differences rather than regressions unless measured on the same OS/toolchain.
119+
## Reproducibility
247120

121+
Release build with NEON and DotProd:
248122

249-
250-
---
251-
252-
## Reproduction
253-
254-
### Windows (Conan)
255-
256-
```powershell
257-
# From third_party/sqlite-vec-cpp
258-
259-
# Install dependencies (Conan 2)
260-
conan profile detect --force
261-
conan install . -of build_bench_conan -b missing -s build_type=Release -s compiler.cppstd=23 -s compiler.runtime=static
262-
263-
# Make Conan-generated .pc files visible to pkg-config for this shell
264-
$env:PKG_CONFIG_PATH = (Resolve-Path .\build_bench_conan)
265-
266-
# Configure + build
267-
meson setup build_bench --wipe -Denable_benchmarks=true -Dbuildtype=release
268-
ninja -C build_bench benchmarks/rag_pipeline_benchmark.exe benchmarks/batch_distance_benchmark.exe
269-
270-
# Run
271-
.\build_bench\benchmarks\rag_pipeline_benchmark.exe --benchmark_min_time=0.5s
272-
.\build_bench\benchmarks\batch_distance_benchmark.exe --benchmark_min_time=0.5s
273-
274-
# JSON output for analysis
275-
.\build_bench\benchmarks\rag_pipeline_benchmark.exe `
276-
--benchmark_out=results.json `
277-
--benchmark_out_format=json
278123
```
279-
280-
### Linux/macOS (system packages)
281-
282-
```bash
283-
# Build benchmarks
284-
cd third_party/sqlite-vec-cpp
285-
meson setup build_bench -Denable_benchmarks=true -Dbuildtype=release
286-
ninja -C build_bench
287-
288-
# Run RAG pipeline benchmark
289-
./build_bench/benchmarks/rag_pipeline_benchmark --benchmark_min_time=0.5s
290-
291-
# Run batch distance benchmark
292-
./build_bench/benchmarks/batch_distance_benchmark --benchmark_min_time=0.5s
293-
294-
# JSON output for analysis
295-
./build_bench/benchmarks/rag_pipeline_benchmark \
296-
--benchmark_out=results.json \
297-
--benchmark_out_format=json
124+
meson setup builddir-release -Dbuildtype=release -Denable_benchmarks=true -Denable_simd_neon=true
125+
meson compile -C builddir-release
126+
./builddir-release/benchmarks/batch_distance_benchmark
127+
./builddir-release/benchmarks/rag_pipeline_benchmark
128+
./builddir-release/benchmarks/filtered_search_benchmark
298129
```
299130

300-
301-
---
131+
Logs are stored under `benchmarks/logs/2026-01-19_release_neon/`.

0 commit comments

Comments
 (0)