perf: Batch writes + pipeline overlap for streaming ingestion#260
Merged
gvanrossum merged 8 commits intomicrosoft:mainfrom Apr 29, 2026
Merged
perf: Batch writes + pipeline overlap for streaming ingestion#260gvanrossum merged 8 commits intomicrosoft:mainfrom
gvanrossum merged 8 commits intomicrosoft:mainfrom
Conversation
Collects all SemanticRefs and index terms in memory, then flushes via bulk extend() + add_terms_batch() instead of per-entity/per-term individual writes. Reduces hundreds of SQLite round-trips per batch to two. Benchmarked at ~31% faster end-to-end on the Adrian podcast (428s → 297s at concurrency 20, batch size 50).
Compares per-item writes (inlined pre-optimization logic) against bulk extend + add_terms_batch. No API keys needed — uses synthetic knowledge data and the test embedding model.
Split _ingest_batch_streaming into extract/apply/commit phases so batch N+1's LLM extraction runs concurrently with batch N's DB transaction via asyncio.create_task. Extraction is 95% of wall time, so this nearly doubles throughput for multi-batch ingestions.
Replace per-row execute() loops with executemany() in three hot paths: - mark_sources_ingested_batch: new bulk method on IStorageProvider - add_timestamps: single executemany instead of N UPDATEs - add_terms: batch embedding generation + single executemany
…return - Cancel pending_commit on exception to avoid "task destroyed" warnings - Make _ExtractionResult frozen (never mutated after creation) - Document single-connection assumption in _filter_ingested - Return embeddings from VectorBase.add_keys to avoid redundant cache lookup - Add test for messages with empty text_chunks during extraction
…d plumbing - add_messages_with_indexing now uses mark_sources_ingested_batch - add_batch_to_semantic_ref_index and _from_list use bulk writes via add_knowledge_batch_to_semantic_ref_index instead of per-item calls - Remove unused terms_added parameter from public batch functions
gvanrossum
approved these changes
Apr 29, 2026
gvanrossum
reviewed
Apr 29, 2026
Comment on lines
+225
to
+226
| concurrently. LLM extraction is typically 95% of wall time, so this | ||
| nearly doubles throughput for multi-batch ingestions. |
Collaborator
There was a problem hiding this comment.
This claim looks misleading. Have you verified this end-to-end reduction in time?
I'd think the old approach would do
[---extraction 95%---][db][---extraction 95%---][db][---extraction 95%---][db]...
where the new approach does (view this in a fixed-width font)
[---extraction 95%---][---extraction 95%---][---extraction 95%---]
[db] [db] [db]
So the overall wall time would be just ~5% faster.
Collaborator
There was a problem hiding this comment.
Despite this looking misleading, I have confirmed that with batchsize=50 and concurrency=20, my overall time for ingesting Adrian went down from 88 seconds to 32 seconds. Congrats!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Batched semref index writes (commits 1-2):
append()+add_term()calls with bulkextend()+add_terms_batch()during knowledge indexing_collect_knowledge_refs_and_terms()that gathers all SemanticRefs and terms in memory before writingadd_knowledge_batch_to_semantic_ref_index()for both streaming and non-streaming pathstools/benchmark_semref_writes.py) — no API keys neededPipeline overlap (commit 3):
_ingest_batch_streaminginto extract/apply/commit phasesasyncio.create_taskDB layer batching (commit 4):
mark_sources_ingested_batch(): new bulk method onIStorageProvider, replaces per-message INSERT loop withexecutemany()add_timestamps(): singleexecutemany()instead of N individual UPDATEsadd_terms(): batch embedding generation viaadd_keys()+ singleexecutemany()instead of per-term loopTests (commit 5):
on_batch_committedcallback fires per-batch, extraction across multiple batches, failure ordinals remapped correctly, exception in later batch preserves earliermark_sources_ingested_batchbasic/empty/idempotenttext_chunksskip extraction gracefullyReview fixes (commit 6):
pending_committask on exception (avoids "task destroyed" warnings)_ExtractionResultis nowfrozen=True_filter_ingestedVectorBase.add_keys()returns embeddings to avoid redundant cache lookupNon-streaming path (commit 7):
add_messages_with_indexingnow usesmark_sources_ingested_batchadd_batch_to_semantic_ref_indexand_from_listuse bulk writes viaadd_knowledge_batch_to_semantic_ref_indexterms_addedparameter from batch functionsReproducible benchmark (semref writes only)
The batched path wins at typical ingestion batch sizes (50-200 chunks) but regresses at 500+ chunks due to in-memory list allocation overhead.
End-to-end benchmark (Adrian podcast, 106 messages)
Azure config:
gpt-4oat 450K TPM (Standard SKU),text-embedding-3-smallat 120 capacity (Standard/regional SKU), both on<my-azure-resource>East US. Results scale with TPM — at 50K TPM the baseline was ~4.0s/msg; the speedup ratio should hold at any TPM level since the optimization is about overlapping I/O, not reducing API calls.The pipeline overlap is the big win: batch 0 takes 36.5s (cold start), batch 1 takes 12.2s (extraction overlapped with batch 0's commit), batch 2 takes 7.3s.
Test plan
uv run pytest tests/ -q)tools/benchmark_semref_writes.py)