perf: Batch SQLite INSERTs for indexing pipeline#230
Conversation
19520f3 to
e7e804e
Compare
gvanrossum
left a comment
There was a problem hiding this comment.
Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.
Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.
4030379 to
82ba650
Compare
bmerkle
left a comment
There was a problem hiding this comment.
Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.
There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.
please let me know what you think.
Thanks! I'll take a closer look to your reviews this afternoon. |
|
Hi @KRRT7 can you please review the open comments from me ? Thanks :-) |
…move imports - Fix inverse_actions omission in add_metadata_to_index_from_list (regression) - Fix inverse_actions omission in add_metadata_to_index (pre-existing) - Delete duplicate add_entity_to_index, add_action_to_index, add_topic_to_index, text_range_from_location — unified into add_entity, add_action, add_topic - Update all callers and tests to use unified functions - Move function-level imports to top-level in sqlite/propindex.py per AGENTS.md
|
@bmerkle Thanks for the thorough review — all comments addressed in
Ready for re-review. |
**Stack: 4/4** — depends on #230. Merge #231, #229, #230, then this PR. --- - Five call sites used `get_item()` per scored ref — one SELECT and full deserialization per match (N+1 pattern) - Added `get_metadata_multiple` to `ISemanticRefCollection` that fetches only `semref_id, range_json, knowledge_type` in a single batch query - Replaced the N+1 loop with one `get_metadata_multiple` call at each site - Further optimized scope-filtering: binary search in `contains_range`, inline tuple comparisons in `TextRange`, skip pydantic validation in `get_metadata_multiple` ### Call sites optimized 1. `lookup_term_filtered` — batch metadata, filter by knowledge_type/range 2. `lookup_property_in_property_index` — batch metadata, filter by range scope 3. `SemanticRefAccumulator.group_matches_by_type` — batch metadata, group by knowledge_type 4. `SemanticRefAccumulator.get_matches_in_scope` — batch metadata, filter by range scope 5. `get_scored_semantic_refs_from_ordinals_iter` — two-phase: metadata filter then batch fetch ### Additional optimizations - **Binary search in `TextRangeCollection.contains_range`**: replaced O(n) linear scan with `bisect_right` keyed on `start`, reducing scope-filtering from ~25ms to ~9ms - **Inline tuple comparisons in `TextRange`**: replaced `TextLocation` allocations in `__eq__`/`__lt__`/`__contains__` with a shared `_effective_end` returning tuples - **Skip pydantic validation in `get_metadata_multiple`**: construct `TextLocation`/`TextRange` directly from JSON instead of going through `__pydantic_validator__` ## Benchmark ### Azure Standard_D2s_v5 — 2 vCPU, 8 GiB RAM, Python 3.13 #### Query (pytest-async-benchmark pedantic, 200 rounds) 200 matches against a 200-message indexed SQLite transcript. Only the function under test is timed. | Function | Before (median) | After (median) | Speedup | |:---|---:|---:|---:| | `lookup_term_filtered` | 2.650 ms | 1.184 ms | **2.24x** | | `group_matches_by_type` | 2.428 ms | 978 μs | **2.48x** | | `get_scored_semantic_refs_from_ordinals_iter` | 2.541 ms | 2.946 ms | 0.86x | | `lookup_property_in_property_index` | 25.306 ms | 9.365 ms | **2.70x** | | `get_matches_in_scope` | 25.011 ms | 9.160 ms | **2.73x** | <details> <summary><b>Reproduce the benchmark locally</b></summary> ```bash pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio python -m pytest tests/benchmarks/test_benchmark_query.py -v -s ``` </details> --- *Generated by codeflash optimization agent* --------- Co-authored-by: Bernhard Merkle <bernhard.merkle@gmail.com>
Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.
add_terms_batchandadd_properties_batchtoITermToSemanticRefIndexandIPropertyToSemanticRefIndexinterfacesexecutemanyinstead of individualcursor.execute()calls (~1000+ calls per indexing batch reduced to 2-3)add_metadata_to_index_from_listandadd_to_property_indexto collect all data first (pure functions), then batch-insertBenchmark
Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13
Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)
Only the hot path (
add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.add_messages_with_indexing(200 msgs)add_messages_with_indexing(50 msgs)Consistent ~14-16% improvement --
executemanyamortizes per-call overhead.Reproduce the benchmark locally
Save the benchmark file below as
tests/benchmarks/test_benchmark_indexing.py, then:Generated by codeflash optimization agent