perf: Batch writes + pipeline overlap for streaming ingestion by KRRT7 · Pull Request #260 · microsoft/typeagent-py

KRRT7 · 2026-04-29T08:29:59Z

Summary

Batched semref index writes (commits 1-2):

Replaces per-entity/per-term individual append() + add_term() calls with bulk extend() + add_terms_batch() during knowledge indexing
Adds _collect_knowledge_refs_and_terms() that gathers all SemanticRefs and terms in memory before writing
Adds add_knowledge_batch_to_semantic_ref_index() for both streaming and non-streaming paths
Includes a reproducible microbenchmark (tools/benchmark_semref_writes.py) — no API keys needed

Pipeline overlap (commit 3):

Splits _ingest_batch_streaming into extract/apply/commit phases
Batch N+1's LLM extraction runs concurrently with batch N's DB commit via asyncio.create_task
LLM extraction is ~95% of wall time, so overlapping it with the commit phase nearly eliminates commit latency from the critical path
Proper cleanup of orphaned tasks on exception (try/finally with cancel)

DB layer batching (commit 4):

mark_sources_ingested_batch(): new bulk method on IStorageProvider, replaces per-message INSERT loop with executemany()
add_timestamps(): single executemany() instead of N individual UPDATEs
add_terms(): batch embedding generation via add_keys() + single executemany() instead of per-term loop

Tests (commit 5):

Pipeline overlap: on_batch_committed callback fires per-batch, extraction across multiple batches, failure ordinals remapped correctly, exception in later batch preserves earlier
DB batching: mark_sources_ingested_batch basic/empty/idempotent
Edge case: messages with empty text_chunks skip extraction gracefully

Review fixes (commit 6):

Cancel orphaned pending_commit task on exception (avoids "task destroyed" warnings)
_ExtractionResult is now frozen=True
Document single-connection assumption in _filter_ingested
VectorBase.add_keys() returns embeddings to avoid redundant cache lookup

Non-streaming path (commit 7):

add_messages_with_indexing now uses mark_sources_ingested_batch
add_batch_to_semantic_ref_index and _from_list use bulk writes via add_knowledge_batch_to_semantic_ref_index
Removed dead terms_added parameter from batch functions

Reproducible benchmark (semref writes only)

uv run python tools/benchmark_semref_writes.py --chunks 200 --rounds 10

Chunks	Semrefs	Individual (mean)	Batched (mean)	Speedup
50	300	88ms	83ms	1.06x
200	1,200	450ms	368ms	1.22x
500	3,000	1,167ms	1,552ms	0.75x

The batched path wins at typical ingestion batch sizes (50-200 chunks) but regresses at 500+ chunks due to in-memory list allocation overhead.

End-to-end benchmark (Adrian podcast, 106 messages)

Azure config: gpt-4o at 450K TPM (Standard SKU), text-embedding-3-small at 120 capacity (Standard/regional SKU), both on <my-azure-resource> East US. Results scale with TPM — at 50K TPM the baseline was ~4.0s/msg; the speedup ratio should hold at any TPM level since the optimization is about overlapping I/O, not reducing API calls.

Config	Total	Per message	vs baseline
Baseline (batch 10, conc 4)	~428s	4.0s	—
Batch 50, conc 20	310s	2.9s	1.4x
+ batched writes	297s	2.8s	1.4x
+ pipeline overlap + DB batching	56s	0.53s	7.6x

The pipeline overlap is the big win: batch 0 takes 36.5s (cold start), batch 1 takes 12.2s (extraction overlapped with batch 0's commit), batch 2 takes 7.3s.

Test plan

All 687 tests passing (uv run pytest tests/ -q)
Reproducible microbenchmark (tools/benchmark_semref_writes.py)
End-to-end ingestion benchmark on Adrian podcast

Collects all SemanticRefs and index terms in memory, then flushes via bulk extend() + add_terms_batch() instead of per-entity/per-term individual writes. Reduces hundreds of SQLite round-trips per batch to two. Benchmarked at ~31% faster end-to-end on the Adrian podcast (428s → 297s at concurrency 20, batch size 50).

Compares per-item writes (inlined pre-optimization logic) against bulk extend + add_terms_batch. No API keys needed — uses synthetic knowledge data and the test embedding model.

Split _ingest_batch_streaming into extract/apply/commit phases so batch N+1's LLM extraction runs concurrently with batch N's DB transaction via asyncio.create_task. Extraction is 95% of wall time, so this nearly doubles throughput for multi-batch ingestions.

Replace per-row execute() loops with executemany() in three hot paths: - mark_sources_ingested_batch: new bulk method on IStorageProvider - add_timestamps: single executemany instead of N UPDATEs - add_terms: batch embedding generation + single executemany

…arking

…return - Cancel pending_commit on exception to avoid "task destroyed" warnings - Make _ExtractionResult frozen (never mutated after creation) - Document single-connection assumption in _filter_ingested - Return embeddings from VectorBase.add_keys to avoid redundant cache lookup - Add test for messages with empty text_chunks during extraction

…d plumbing - add_messages_with_indexing now uses mark_sources_ingested_batch - add_batch_to_semantic_ref_index and _from_list use bulk writes via add_knowledge_batch_to_semantic_ref_index instead of per-item calls - Remove unused terms_added parameter from public batch functions

gvanrossum · 2026-04-29T19:14:52Z

+        concurrently.  LLM extraction is typically 95% of wall time, so this
+        nearly doubles throughput for multi-batch ingestions.


This claim looks misleading. Have you verified this end-to-end reduction in time?

I'd think the old approach would do

[---extraction 95%---][db][---extraction 95%---][db][---extraction 95%---][db]...

where the new approach does (view this in a fixed-width font)

[---extraction 95%---][---extraction 95%---][---extraction 95%---] [db] [db] [db]

So the overall wall time would be just ~5% faster.

Despite this looking misleading, I have confirmed that with batchsize=50 and concurrency=20, my overall time for ingesting Adrian went down from 88 seconds to 32 seconds. Congrats!

KRRT7 added 4 commits April 29, 2026 03:29

Add reproducible microbenchmark for semref write strategies

8f00c0e

Compares per-item writes (inlined pre-optimization logic) against bulk extend + add_terms_batch. No API keys needed — uses synthetic knowledge data and the test embedding model.

KRRT7 changed the title ~~perf: Batch semref index writes during ingestion~~ perf: Batch writes + pipeline overlap for streaming ingestion Apr 29, 2026

KRRT7 added 4 commits April 29, 2026 04:41

Add tests for pipeline overlap, ordinal remapping, and batch source m…

ccf3c80

…arking

Fix pyright: use Sequence[IMessage] for covariant container typing

e78643e

gvanrossum approved these changes Apr 29, 2026

View reviewed changes

gvanrossum merged commit 115cb0d into microsoft:main Apr 29, 2026
16 checks passed

gvanrossum reviewed Apr 29, 2026

View reviewed changes

KRRT7 mentioned this pull request May 1, 2026

perf: Knowledge extraction concurrency bottleneck on large ingestion jobs #250

Closed

KRRT7 deleted the perf/ingestion-throughput branch May 1, 2026 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Batch writes + pipeline overlap for streaming ingestion#260

perf: Batch writes + pipeline overlap for streaming ingestion#260
gvanrossum merged 8 commits intomicrosoft:mainfrom
KRRT7:perf/ingestion-throughput

KRRT7 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

gvanrossum Apr 29, 2026

Uh oh!

gvanrossum Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		concurrently. LLM extraction is typically 95% of wall time, so this
		nearly doubles throughput for multi-batch ingestions.

Conversation

KRRT7 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproducible benchmark (semref writes only)

End-to-end benchmark (Adrian podcast, 106 messages)

Test plan

Uh oh!

Uh oh!

gvanrossum Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gvanrossum Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KRRT7 commented Apr 29, 2026 •

edited

Loading