perf: Batch SQLite INSERTs for indexing pipeline by KRRT7 · Pull Request #230 · microsoft/typeagent-py

KRRT7 · 2026-04-10T07:06:47Z

Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.

Add add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
SQLite backend uses executemany instead of individual cursor.execute() calls (~1000+ calls per indexing batch reduced to 2-3)
Restructure add_metadata_to_index_from_list and add_to_property_index to collect all data first (pure functions), then batch-insert
Memory backend implements batch methods as loops for interface compatibility

Benchmark

Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13

Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)

Only the hot path (add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.

Benchmark	Before (min)	After (min)	Speedup
`add_messages_with_indexing` (200 msgs)	28.8 ms	25.0 ms	1.16x
`add_messages_with_indexing` (50 msgs)	7.8 ms	6.7 ms	1.16x
VTT ingest (40 msgs)	6.9 ms	6.1 ms	1.14x

Consistent ~14-16% improvement -- executemany amortizes per-call overhead.

Reproduce the benchmark locally

Save the benchmark file below as tests/benchmarks/test_benchmark_indexing.py, then:

pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio

# Run on main
git checkout main
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

# Run on this branch
git checkout perf/batch-inserts
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

Generated by codeflash optimization agent

gvanrossum

Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.

Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

bmerkle

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.

There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.

please let me know what you think.

KRRT7 · 2026-04-11T19:34:05Z

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR

Thanks! I'll take a closer look to your reviews this afternoon.

bmerkle · 2026-04-14T13:00:25Z

Hi @KRRT7 can you please review the open comments from me ? Thanks :-)

bmerkle

@KRRT7 please check my comments

…move imports - Fix inverse_actions omission in add_metadata_to_index_from_list (regression) - Fix inverse_actions omission in add_metadata_to_index (pre-existing) - Delete duplicate add_entity_to_index, add_action_to_index, add_topic_to_index, text_range_from_location — unified into add_entity, add_action, add_topic - Update all callers and tests to use unified functions - Move function-level imports to top-level in sqlite/propindex.py per AGENTS.md

KRRT7 · 2026-04-14T23:54:15Z

@bmerkle Thanks for the thorough review — all comments addressed in 9ae667f. Summary:

Fixed inverse_actions bug — both add_metadata_to_index_from_list (regression) and add_metadata_to_index (pre-existing) now index inverse actions
Eliminated code duplication — deleted add_entity_to_index, add_action_to_index, add_topic_to_index, and text_range_from_location (were duplicates of add_entity/add_action/add_topic/text_range_from_message_chunk). Reduced from 3 parallel paths to 2. Updated all callers and tests.
Moved imports to top-level in sqlite/propindex.py per AGENTS.md

make check test format passes (479/479 offline tests green, 3 failures are API rate-limits).

Ready for re-review.

bmerkle

@KRRT7 sorry for the delay.... LGTM, thanks for the cleanup and the fixes.
all reviews comments have been covered and changes were implemented.

**Stack: 4/4** — depends on #230. Merge #231, #229, #230, then this PR. --- - Five call sites used `get_item()` per scored ref — one SELECT and full deserialization per match (N+1 pattern) - Added `get_metadata_multiple` to `ISemanticRefCollection` that fetches only `semref_id, range_json, knowledge_type` in a single batch query - Replaced the N+1 loop with one `get_metadata_multiple` call at each site - Further optimized scope-filtering: binary search in `contains_range`, inline tuple comparisons in `TextRange`, skip pydantic validation in `get_metadata_multiple` ### Call sites optimized 1. `lookup_term_filtered` — batch metadata, filter by knowledge_type/range 2. `lookup_property_in_property_index` — batch metadata, filter by range scope 3. `SemanticRefAccumulator.group_matches_by_type` — batch metadata, group by knowledge_type 4. `SemanticRefAccumulator.get_matches_in_scope` — batch metadata, filter by range scope 5. `get_scored_semantic_refs_from_ordinals_iter` — two-phase: metadata filter then batch fetch ### Additional optimizations - **Binary search in `TextRangeCollection.contains_range`**: replaced O(n) linear scan with `bisect_right` keyed on `start`, reducing scope-filtering from ~25ms to ~9ms - **Inline tuple comparisons in `TextRange`**: replaced `TextLocation` allocations in `__eq__`/`__lt__`/`__contains__` with a shared `_effective_end` returning tuples - **Skip pydantic validation in `get_metadata_multiple`**: construct `TextLocation`/`TextRange` directly from JSON instead of going through `__pydantic_validator__` ## Benchmark ### Azure Standard_D2s_v5 — 2 vCPU, 8 GiB RAM, Python 3.13 #### Query (pytest-async-benchmark pedantic, 200 rounds) 200 matches against a 200-message indexed SQLite transcript. Only the function under test is timed. | Function | Before (median) | After (median) | Speedup | |:---|---:|---:|---:| | `lookup_term_filtered` | 2.650 ms | 1.184 ms | **2.24x** | | `group_matches_by_type` | 2.428 ms | 978 μs | **2.48x** | | `get_scored_semantic_refs_from_ordinals_iter` | 2.541 ms | 2.946 ms | 0.86x | | `lookup_property_in_property_index` | 25.306 ms | 9.365 ms | **2.70x** | | `get_matches_in_scope` | 25.011 ms | 9.160 ms | **2.73x** | <details> <summary><b>Reproduce the benchmark locally</b></summary> ```bash pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio python -m pytest tests/benchmarks/test_benchmark_query.py -v -s ``` </details> --- *Generated by codeflash optimization agent* --------- Co-authored-by: Bernhard Merkle <bernhard.merkle@gmail.com>

KRRT7 force-pushed the perf/batch-inserts branch from 19520f3 to e7e804e Compare April 10, 2026 10:21

This was referenced Apr 10, 2026

Fix parse_azure_endpoint passing query string to AsyncAzureOpenAI #231

Merged

perf: Batch metadata query to avoid N+1 across 5 call sites #232

Merged

gvanrossum reviewed Apr 10, 2026

View reviewed changes

KRRT7 added 3 commits April 10, 2026 15:49

Remove underscore prefix from collect helper functions

fcc7c23

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Fix pyright errors: use Sequence for batch method signatures

82ba650

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

KRRT7 force-pushed the perf/batch-inserts branch from 4030379 to 82ba650 Compare April 10, 2026 20:50

Merge branch 'main' into perf/batch-inserts

544912a

bmerkle self-assigned this Apr 11, 2026

bmerkle reviewed Apr 11, 2026

View reviewed changes

Merge branch 'main' into perf/batch-inserts

f705657

bmerkle reviewed Apr 14, 2026

View reviewed changes

bmerkle approved these changes Apr 22, 2026

View reviewed changes

bmerkle merged commit 5ac7bdf into microsoft:main Apr 22, 2026
16 checks passed

bmerkle mentioned this pull request Apr 22, 2026

Fix endpoint parsing #241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Batch SQLite INSERTs for indexing pipeline#230

perf: Batch SQLite INSERTs for indexing pipeline#230
bmerkle merged 6 commits intomicrosoft:mainfrom
KRRT7:perf/batch-inserts

KRRT7 commented Apr 10, 2026 •

edited

Loading

Uh oh!

gvanrossum left a comment

Uh oh!

bmerkle left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KRRT7 commented Apr 11, 2026

Uh oh!

bmerkle commented Apr 14, 2026

Uh oh!

bmerkle left a comment

Uh oh!

KRRT7 commented Apr 14, 2026

Uh oh!

bmerkle left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KRRT7 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13

Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

bmerkle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KRRT7 commented Apr 11, 2026

Uh oh!

bmerkle commented Apr 14, 2026

Uh oh!

bmerkle left a comment

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Apr 14, 2026

Uh oh!

bmerkle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented Apr 10, 2026 •

edited

Loading