|
1 | 1 | # SQLite-Vec C++ Benchmark Results |
2 | 2 |
|
3 | 3 | **Version**: 0.1.0 |
4 | | -**Date**: 2026-01-05 |
5 | | -**Platform**: x86_64, 48 cores @ 3.8GHz, 32KB L1, 512KB L2, 16MB L3 (Windows 11) |
6 | | -**Compiler**: clang 21.1.6, C++23, Release mode (`-O3`) |
7 | | -**Library**: Google Benchmark 1.9.4 |
| 4 | +**Date**: 2026-01-19 |
| 5 | +**Platform**: Apple M3 Max, 16 cores, 48 GB RAM (macOS 26.2) |
| 6 | +**Compiler**: Apple clang 17.0.0, C++23, Release mode (`-O3` via Meson `buildtype=release`) |
| 7 | +**SIMD**: NEON enabled, ARM DotProd enabled |
| 8 | +**Library**: Google Benchmark 1.8.3 |
8 | 9 |
|
| 10 | +> Note: Google Benchmark reports “Library was built as DEBUG” even in this Release build; the Meson |
| 11 | +> configuration is `buildtype=release` with NEON/DotProd enabled. |
9 | 12 |
|
10 | 13 | --- |
11 | 14 |
|
12 | | -## Executive Summary |
| 15 | +## Archive |
13 | 16 |
|
14 | | -The C++ implementation achieves **~2.8M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at near performance parity. HNSW index recommended for >100K vector corpora. |
| 17 | +Previous benchmark runs are archived in `benchmarks/archive/`. |
15 | 18 |
|
16 | 19 | --- |
17 | 20 |
|
18 | | -## Apple Silicon Results (M1 Pro) |
19 | | - |
20 | | -**Date**: 2026-01-07 |
21 | | -**Platform**: Apple M1 Pro, 16 cores (8P+8E), 192KB L1, 12MB L2, 24MB SLC |
22 | | -**Compiler**: Apple clang 16.0, C++20, Release mode (`-O3 -DNDEBUG`) |
23 | | -**SIMD**: NEON enabled (`-DSQLITE_VEC_ENABLE_NEON`), DotProd enabled (`-march=armv8.2-a+dotprod`) |
24 | | - |
25 | | -### HNSW Index Performance (dim=384, k=10, ef=50) |
26 | | - |
27 | | -| Corpus | Insert Rate | Search QPS | Search Latency | |
28 | | -|--------|-------------|------------|----------------| |
29 | | -| 1,000 | 7,395/s | 16,565 | 60 µs | |
30 | | -| 5,000 | 3,207/s | 8,981 | 111 µs | |
31 | | -| 10,000 | 2,116/s | 6,400 | 156 µs | |
32 | | -| 25,000 | 1,061/s | 3,510 | 285 µs | |
33 | | -| 50,000 | 632/s | 2,369 | 422 µs | |
34 | | - |
35 | | -### Prefetching Impact |
36 | | - |
37 | | -Software prefetching (`__builtin_prefetch`) in beam search provides 9-32% improvement: |
38 | | - |
39 | | -| Corpus | Without Prefetch | With Prefetch | Improvement | |
40 | | -|--------|------------------|---------------|-------------| |
41 | | -| 1,000 | 15,168 QPS | 16,540 QPS | +9% | |
42 | | -| 5,000 | 9,249 QPS | 10,135 QPS | +10% | |
43 | | -| 10,000 | 6,337 QPS | 7,761 QPS | **+22%** | |
44 | | -| 25,000 | 3,140 QPS | 4,139 QPS | **+32%** | |
45 | | -| 50,000 | 2,303 QPS | 2,744 QPS | +19% | |
46 | | -| 100,000| 1,907 QPS | 2,179 QPS | +14% | |
47 | | - |
48 | | -### Batch Search Scaling (10K corpus, 1000 queries) |
49 | | - |
50 | | -Linear scaling with thread count using `search_batch()` API: |
| 21 | +## Batch Distance Benchmark |
51 | 22 |
|
52 | | -| Threads | QPS | Speedup | |
53 | | -|---------|--------|---------| |
54 | | -| 1 (seq) | 5,979 | 1.00x | |
55 | | -| 2 | 12,713 | **2.13x** | |
56 | | -| 4 | 24,377 | **4.08x** | |
57 | | -| 8 | 50,554 | **8.46x** | |
| 23 | +### 1. Sequential vs Batch Comparison |
58 | 24 |
|
59 | | -### SIMD Distance Computation |
| 25 | +| Scenario | Time | Throughput | |
| 26 | +|----------|------|------------| |
| 27 | +| 100×384d (Sequential) | 2.383 µs | 41.96 M/s | |
| 28 | +| 100×384d (Batch) | 2.523 µs | 39.64 M/s | |
| 29 | +| 1K×384d (Sequential) | 25.50 µs | 39.21 M/s | |
| 30 | +| 1K×384d (Batch) | 26.01 µs | 38.45 M/s | |
60 | 31 |
|
61 | | -#### Float32 Cosine Distance (NEON) |
| 32 | +### 2. Memory Layout Optimization |
62 | 33 |
|
63 | | -| Dimensions | NEON (ns) | Scalar (ns) | Speedup | |
64 | | -|------------|-----------|-------------|---------| |
65 | | -| 384 | 29 | 347 | **12x** | |
| 34 | +| Layout | Time | Throughput | |
| 35 | +|--------|------|------------| |
| 36 | +| Contiguous (1K×384d) | 22.75 µs | 43.96 M/s | |
66 | 37 |
|
67 | | -#### Int8 Dot Product (DotProd Instruction) |
| 38 | +### 3. Top‑K Performance (1K×384d, K=10) |
68 | 39 |
|
69 | | -Using `vdotq_s32` (ARMv8.2+) for quantized vectors: |
| 40 | +- **Latency**: 26.85 µs |
| 41 | +- **Throughput**: 37.25 M/s |
70 | 42 |
|
71 | | -| Method | Time (ns) | Speedup | |
72 | | -|--------------|-----------|---------| |
73 | | -| Scalar | 4.8 | 1.00x | |
74 | | -| NEON DotProd | 0.3 | **17.1x** | |
| 43 | +### 4. Quantization (1K×384d) |
75 | 44 |
|
76 | | -### Optimization Summary (M1 Pro) |
| 45 | +| Type | Time | Throughput | |
| 46 | +|------|------|------------| |
| 47 | +| int8 | 38.22 µs | 26.16 M/s | |
77 | 48 |
|
78 | | -| Optimization | Improvement | Notes | |
79 | | -|--------------|-------------|-------| |
80 | | -| NEON SIMD (float32) | 12x | Cosine distance | |
81 | | -| Prefetching | +9% to +32% | Depends on corpus size | |
82 | | -| Batch search | Linear | 8.46x with 8 threads | |
83 | | -| DotProd (int8) | 17x | ARMv8.2+ required | |
| 49 | +### 5. Large Embeddings (1K×1536d) |
84 | 50 |
|
| 51 | +- **Latency**: 110.9 µs |
| 52 | +- **Throughput**: 9.02 M/s |
85 | 53 |
|
86 | 54 | --- |
87 | 55 |
|
88 | 56 | ## RAG Pipeline Benchmark |
89 | 57 |
|
90 | 58 | ### 1. Corpus Size Scaling (384d, K=5) |
91 | 59 |
|
92 | | -| Corpus | Latency | Throughput | QPS (single-thread) | |
93 | | -|--------|---------|------------|---------------------| |
94 | | -| 1K | 288 μs | 3.51 M/s | ~3,510 queries/sec | |
95 | | -| 10K | 3.63 ms | 2.77 M/s | ~277 queries/sec | |
96 | | -| 100K | 41.0 ms | 2.43 M/s | ~24 queries/sec | |
97 | | - |
98 | | - |
99 | | -**Scaling**: Linear (10x corpus → 10x latency) |
100 | | -**Bottleneck**: Compute-bound (memory bandwidth utilization ~5%) |
101 | | - |
102 | | -### 2. K-Value Scaling (10K docs, 384d) |
| 60 | +| Corpus | Latency | Throughput | |
| 61 | +|--------|---------|------------| |
| 62 | +| 1K | 28.3 µs | 35.35 M/s | |
| 63 | +| 10K | 253 µs | 39.51 M/s | |
| 64 | +| 100K | 5.67 ms | 17.64 M/s | |
103 | 65 |
|
104 | | -| K | Latency | Delta | |
105 | | -|----|---------|-------| |
106 | | -| 1 | 3.92 ms | +7.8% | |
107 | | -| 5 | 3.63 ms | baseline | |
108 | | -| 10 | 3.64 ms | +0.2% | |
109 | | -| 50 | 3.60 ms | -0.9% | |
| 66 | +### 2. K‑Value Scaling (10K docs, 384d) |
110 | 67 |
|
111 | | - |
112 | | -**Conclusion**: Partial sort overhead negligible; K-value has no meaningful impact. |
| 68 | +| K | Latency | Throughput | |
| 69 | +|----|---------|------------| |
| 70 | +| 1 | 305 µs | 32.82 M/s | |
| 71 | +| 5 | 253 µs | 39.51 M/s | |
| 72 | +| 10 | 254 µs | 39.43 M/s | |
| 73 | +| 50 | 342 µs | 29.20 M/s | |
113 | 74 |
|
114 | 75 | ### 3. Embedding Dimension Scaling (10K docs, K=5) |
115 | 76 |
|
116 | | -| Dimensions | Latency | Throughput | Scaling Factor | |
117 | | -|------------|----------|------------|----------------| |
118 | | -| 384d | 3.63 ms | 2.77 M/s | 1.0x | |
119 | | -| 768d | 6.78 ms | 1.53 M/s | 1.87x | |
120 | | -| 1536d | 13.2 ms | 780k/s | 3.64x | |
121 | | - |
122 | | - |
123 | | -**Scaling**: Near-linear (2x dim → 2.06x latency, 4x dim → 4.21x latency) |
124 | | -**Conclusion**: Compute-bound; SIMD efficiency remains high across dimensions. |
| 77 | +| Dimensions | Latency | Throughput | Scaling Factor | |
| 78 | +|------------|---------|------------|----------------| |
| 79 | +| 384d | 253 µs | 39.51 M/s | 1.00x | |
| 80 | +| 768d | 740 µs | 13.51 M/s | 2.92x | |
| 81 | +| 1536d | 1122 µs | 8.91 M/s | 4.43x | |
125 | 82 |
|
126 | 83 | ### 4. Quantization (10K docs, 384d, K=5) |
127 | 84 |
|
128 | | -| Type | Latency | Throughput | Storage | Overhead | |
129 | | -|-------|---------|------------|---------|----------| |
130 | | -| float | 3.63 ms | 2.77 M/s | 4 bytes | baseline | |
131 | | -| int8 | 3.62 ms | 2.79 M/s | 1 byte | **-0.4%** | |
132 | | - |
133 | | - |
134 | | -**Conclusion**: int8 quantization is **faster** while reducing storage 4x (memory bandwidth savings). |
135 | | - |
136 | | -### 5. Multi-Query Throughput (10K docs, 384d) |
137 | | - |
138 | | -- **10 queries**: 36.2 ms total (3.62 ms/query average) |
139 | | -- **Sustained throughput**: 2.76 M vectors/second |
140 | | -- **QPS**: ~276 queries/second (single-threaded) |
141 | | -- **Parallelization potential**: 48 cores → ~13.2K QPS theoretical |
142 | | - |
143 | | - |
144 | | -### 6. Sequential vs Batch (1K docs, 384d, K=5) |
145 | | - |
146 | | -| Method | Latency | Throughput | |
147 | | -|------------|---------|------------| |
148 | | -| Sequential | 287 μs | 3.51 M/s | |
149 | | -| Batch | 288 μs | 3.51 M/s | |
150 | | - |
151 | | - |
152 | | -**Conclusion**: Batch API provides cleaner code at performance parity (memory-bandwidth bound). |
153 | | - |
154 | | ---- |
155 | | - |
156 | | -## Batch Distance Benchmark |
157 | | - |
158 | | -### 1. Sequential vs Batch Comparison |
159 | | - |
160 | | -| Scenario | Sequential | Batch | Speedup | |
161 | | -|----------|------------|-------|---------| |
162 | | -| 100×384d | 27.8 μs | 28.3 μs | 0.98x | |
163 | | -| 1K×384d | 289 μs | 283 μs | 1.02x | |
164 | | - |
165 | | - |
166 | | -**Conclusion**: Parity performance; both memory-bandwidth limited. |
| 85 | +| Type | Latency | Throughput | |
| 86 | +|-------|---------|------------| |
| 87 | +| float | 253 µs | 39.51 M/s | |
| 88 | +| int8 | 414 µs | 24.13 M/s | |
167 | 89 |
|
168 | | -### 2. Memory Layout Optimization |
169 | | - |
170 | | -| Layout | Latency | Throughput | Improvement | |
171 | | -|-------------|---------|------------|-------------| |
172 | | -| Scattered | 283 μs | 3.54 M/s | baseline | |
173 | | -| Contiguous | 283 μs | 3.54 M/s | +0.0% | |
174 | | - |
175 | | - |
176 | | -**Conclusion**: Marginal improvement; modern CPUs prefetch efficiently. |
177 | | - |
178 | | -### 3. Top-K Performance (1K×384d, K=10) |
179 | | - |
180 | | -- **Latency**: 290 μs (vs 287 μs full distance computation) |
181 | | -- **Overhead**: ~1% for partial sort |
182 | | -- **Conclusion**: `std::partial_sort` highly optimized; K << N has negligible cost. |
183 | | - |
184 | | - |
185 | | -### 4. Large Embeddings (1K×1536d) |
186 | | - |
187 | | -- **Latency**: 1.18 ms |
188 | | -- **Throughput**: 833k vectors/second |
189 | | -- **Scaling**: 4.18x slower than 384d (expected 4.0x) |
| 90 | +### 5. Multi‑Query Throughput (10K docs, 384d, 10 queries) |
190 | 91 |
|
| 92 | +- **Total time**: 3.01 ms |
| 93 | +- **Throughput**: 33.23 M/s |
191 | 94 |
|
192 | 95 | --- |
193 | 96 |
|
194 | | -## HNSW Decision Matrix |
195 | | - |
196 | | -| Corpus Size | Brute-Force Latency | Recommendation | |
197 | | -|-------------|---------------------|----------------| |
198 | | -| <10K | <4ms | ✅ Brute-force optimal | |
199 | | -| 10K-100K | 4-40ms | ⚠️ Brute-force acceptable for batch | |
200 | | -| >100K | >40ms | ❌ HNSW required for real-time (<10ms) | |
201 | | - |
202 | | -**HNSW Threshold**: 100K vectors (~41ms → >10ms target requires ANN index) |
| 97 | +## Filtered Search Benchmark (HNSW, 10K corpus) |
203 | 98 |
|
| 99 | +| Scenario | Time | Throughput | |
| 100 | +|----------|------|------------| |
| 101 | +| No filter | 9.83 ms | 10.17 k/s | |
| 102 | +| Bitset filter 10% | 50.93 ms | 1.96 k/s | |
| 103 | +| Bitset filter 50% | 19.42 ms | 5.15 k/s | |
| 104 | +| Bitset filter 90% | 11.07 ms | 9.03 k/s | |
| 105 | +| Set filter 10% | 65.46 ms | 1.53 k/s | |
| 106 | +| Set filter 50% | 23.39 ms | 4.28 k/s | |
| 107 | +| Set filter 90% | 11.78 ms | 8.49 k/s | |
204 | 108 |
|
205 | 109 | --- |
206 | 110 |
|
207 | | -## Platform-Specific Results |
208 | | - |
209 | | -### SIMD Utilization |
210 | | - |
211 | | -- **AVX**: Active (conditional compilation, `-mavx` detected) |
212 | | -- **NEON**: Not tested (x86_64 platform) |
213 | | -- **Scalar fallback**: Available for non-aligned/small vectors |
214 | | - |
215 | | -### Cache Efficiency |
| 111 | +## HNSW Index Performance |
216 | 112 |
|
217 | | -- **L1 hit rate**: >95% (estimated from throughput consistency) |
218 | | -- **Memory bandwidth**: ~11 GB/s per query (vs 200 GB/s L1 capacity) |
219 | | -- **Conclusion**: Compute-bound, not memory-bound |
| 113 | +Full HNSW benchmark run was stopped due to long runtime. Partial results are logged in |
| 114 | +`benchmarks/logs/2026-01-19_release_neon/hnsw_benchmark.log`. We will update this section after |
| 115 | +optimizing the long‑running benchmark and re‑running. |
220 | 116 |
|
221 | 117 | --- |
222 | 118 |
|
223 | | -## Comparison to Targets |
224 | | - |
225 | | -| Metric | Target | Actual | Status | |
226 | | -|--------|--------|--------|--------| |
227 | | -| 1K corpus (<1ms) | 1000 μs | 288 μs | ✅ **3.5x better** | |
228 | | -| 10K corpus (<5ms) | 5000 μs | 3633 μs | ✅ **1.4x better** | |
229 | | -| 100K corpus (<50ms) | 50000 μs | 41000 μs | ✅ **1.2x better** | |
230 | | -| int8 overhead (<20%) | 20% | -0.4% | ✅ **Faster** | |
231 | | -| Dimension scaling | Linear | Near-linear | ✅ **Good** | |
232 | | - |
233 | | -### Comparison to Previous Results (2025-11-02) |
234 | | - |
235 | | -| Scenario | Previous | Current | Delta | |
236 | | -|----------|----------|---------|-------| |
237 | | -| 1K×384d (K=5) latency | 273 μs | 288 μs | +5.5% | |
238 | | -| 10K×384d (K=5) latency | 2.78 ms | 3.63 ms | +30.6% | |
239 | | -| 100K×384d (K=5) latency | 27.9 ms | 41.0 ms | +46.9% | |
240 | | -| 10K×384d throughput | 3.60 M/s | 2.77 M/s | -23.1% | |
241 | | -| int8 @10K×384d latency | 2.74 ms | 3.62 ms | +32.1% | |
242 | | - |
243 | | -Notes: |
244 | | -- Previous run header (2025-11-02): Linux (x86_64), GCC 15.2.0, Google Benchmark 1.9.1. |
245 | | -- Current run header (2026-01-05): Windows 11 (x86_64), clang 21.1.6, Google Benchmark 1.9.4. |
246 | | -- Treat deltas as environment differences rather than regressions unless measured on the same OS/toolchain. |
| 119 | +## Reproducibility |
247 | 120 |
|
| 121 | +Release build with NEON and DotProd: |
248 | 122 |
|
249 | | - |
250 | | ---- |
251 | | - |
252 | | -## Reproduction |
253 | | - |
254 | | -### Windows (Conan) |
255 | | - |
256 | | -```powershell |
257 | | -# From third_party/sqlite-vec-cpp |
258 | | -
|
259 | | -# Install dependencies (Conan 2) |
260 | | -conan profile detect --force |
261 | | -conan install . -of build_bench_conan -b missing -s build_type=Release -s compiler.cppstd=23 -s compiler.runtime=static |
262 | | -
|
263 | | -# Make Conan-generated .pc files visible to pkg-config for this shell |
264 | | -$env:PKG_CONFIG_PATH = (Resolve-Path .\build_bench_conan) |
265 | | -
|
266 | | -# Configure + build |
267 | | -meson setup build_bench --wipe -Denable_benchmarks=true -Dbuildtype=release |
268 | | -ninja -C build_bench benchmarks/rag_pipeline_benchmark.exe benchmarks/batch_distance_benchmark.exe |
269 | | -
|
270 | | -# Run |
271 | | -.\build_bench\benchmarks\rag_pipeline_benchmark.exe --benchmark_min_time=0.5s |
272 | | -.\build_bench\benchmarks\batch_distance_benchmark.exe --benchmark_min_time=0.5s |
273 | | -
|
274 | | -# JSON output for analysis |
275 | | -.\build_bench\benchmarks\rag_pipeline_benchmark.exe ` |
276 | | - --benchmark_out=results.json ` |
277 | | - --benchmark_out_format=json |
278 | 123 | ``` |
279 | | - |
280 | | -### Linux/macOS (system packages) |
281 | | - |
282 | | -```bash |
283 | | -# Build benchmarks |
284 | | -cd third_party/sqlite-vec-cpp |
285 | | -meson setup build_bench -Denable_benchmarks=true -Dbuildtype=release |
286 | | -ninja -C build_bench |
287 | | - |
288 | | -# Run RAG pipeline benchmark |
289 | | -./build_bench/benchmarks/rag_pipeline_benchmark --benchmark_min_time=0.5s |
290 | | - |
291 | | -# Run batch distance benchmark |
292 | | -./build_bench/benchmarks/batch_distance_benchmark --benchmark_min_time=0.5s |
293 | | - |
294 | | -# JSON output for analysis |
295 | | -./build_bench/benchmarks/rag_pipeline_benchmark \ |
296 | | - --benchmark_out=results.json \ |
297 | | - --benchmark_out_format=json |
| 124 | +meson setup builddir-release -Dbuildtype=release -Denable_benchmarks=true -Denable_simd_neon=true |
| 125 | +meson compile -C builddir-release |
| 126 | +./builddir-release/benchmarks/batch_distance_benchmark |
| 127 | +./builddir-release/benchmarks/rag_pipeline_benchmark |
| 128 | +./builddir-release/benchmarks/filtered_search_benchmark |
298 | 129 | ``` |
299 | 130 |
|
300 | | - |
301 | | ---- |
| 131 | +Logs are stored under `benchmarks/logs/2026-01-19_release_neon/`. |
0 commit comments