Skip to content

perf: Batch writes + pipeline overlap for streaming ingestion#260

Merged
gvanrossum merged 8 commits intomicrosoft:mainfrom
KRRT7:perf/ingestion-throughput
Apr 29, 2026
Merged

perf: Batch writes + pipeline overlap for streaming ingestion#260
gvanrossum merged 8 commits intomicrosoft:mainfrom
KRRT7:perf/ingestion-throughput

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented Apr 29, 2026

Summary

Batched semref index writes (commits 1-2):

  • Replaces per-entity/per-term individual append() + add_term() calls with bulk extend() + add_terms_batch() during knowledge indexing
  • Adds _collect_knowledge_refs_and_terms() that gathers all SemanticRefs and terms in memory before writing
  • Adds add_knowledge_batch_to_semantic_ref_index() for both streaming and non-streaming paths
  • Includes a reproducible microbenchmark (tools/benchmark_semref_writes.py) — no API keys needed

Pipeline overlap (commit 3):

  • Splits _ingest_batch_streaming into extract/apply/commit phases
  • Batch N+1's LLM extraction runs concurrently with batch N's DB commit via asyncio.create_task
  • LLM extraction is ~95% of wall time, so overlapping it with the commit phase nearly eliminates commit latency from the critical path
  • Proper cleanup of orphaned tasks on exception (try/finally with cancel)

DB layer batching (commit 4):

  • mark_sources_ingested_batch(): new bulk method on IStorageProvider, replaces per-message INSERT loop with executemany()
  • add_timestamps(): single executemany() instead of N individual UPDATEs
  • add_terms(): batch embedding generation via add_keys() + single executemany() instead of per-term loop

Tests (commit 5):

  • Pipeline overlap: on_batch_committed callback fires per-batch, extraction across multiple batches, failure ordinals remapped correctly, exception in later batch preserves earlier
  • DB batching: mark_sources_ingested_batch basic/empty/idempotent
  • Edge case: messages with empty text_chunks skip extraction gracefully

Review fixes (commit 6):

  • Cancel orphaned pending_commit task on exception (avoids "task destroyed" warnings)
  • _ExtractionResult is now frozen=True
  • Document single-connection assumption in _filter_ingested
  • VectorBase.add_keys() returns embeddings to avoid redundant cache lookup

Non-streaming path (commit 7):

  • add_messages_with_indexing now uses mark_sources_ingested_batch
  • add_batch_to_semantic_ref_index and _from_list use bulk writes via add_knowledge_batch_to_semantic_ref_index
  • Removed dead terms_added parameter from batch functions

Reproducible benchmark (semref writes only)

uv run python tools/benchmark_semref_writes.py --chunks 200 --rounds 10
Chunks Semrefs Individual (mean) Batched (mean) Speedup
50 300 88ms 83ms 1.06x
200 1,200 450ms 368ms 1.22x
500 3,000 1,167ms 1,552ms 0.75x

The batched path wins at typical ingestion batch sizes (50-200 chunks) but regresses at 500+ chunks due to in-memory list allocation overhead.

End-to-end benchmark (Adrian podcast, 106 messages)

Azure config: gpt-4o at 450K TPM (Standard SKU), text-embedding-3-small at 120 capacity (Standard/regional SKU), both on <my-azure-resource> East US. Results scale with TPM — at 50K TPM the baseline was ~4.0s/msg; the speedup ratio should hold at any TPM level since the optimization is about overlapping I/O, not reducing API calls.

Config Total Per message vs baseline
Baseline (batch 10, conc 4) ~428s 4.0s
Batch 50, conc 20 310s 2.9s 1.4x
+ batched writes 297s 2.8s 1.4x
+ pipeline overlap + DB batching 56s 0.53s 7.6x

The pipeline overlap is the big win: batch 0 takes 36.5s (cold start), batch 1 takes 12.2s (extraction overlapped with batch 0's commit), batch 2 takes 7.3s.

Test plan

  • All 687 tests passing (uv run pytest tests/ -q)
  • Reproducible microbenchmark (tools/benchmark_semref_writes.py)
  • End-to-end ingestion benchmark on Adrian podcast

KRRT7 added 4 commits April 29, 2026 03:29
Collects all SemanticRefs and index terms in memory, then flushes via
bulk extend() + add_terms_batch() instead of per-entity/per-term
individual writes. Reduces hundreds of SQLite round-trips per batch
to two. Benchmarked at ~31% faster end-to-end on the Adrian podcast
(428s → 297s at concurrency 20, batch size 50).
Compares per-item writes (inlined pre-optimization logic) against bulk
extend + add_terms_batch. No API keys needed — uses synthetic knowledge
data and the test embedding model.
Split _ingest_batch_streaming into extract/apply/commit phases so
batch N+1's LLM extraction runs concurrently with batch N's DB
transaction via asyncio.create_task. Extraction is 95% of wall time,
so this nearly doubles throughput for multi-batch ingestions.
Replace per-row execute() loops with executemany() in three hot paths:
- mark_sources_ingested_batch: new bulk method on IStorageProvider
- add_timestamps: single executemany instead of N UPDATEs
- add_terms: batch embedding generation + single executemany
@KRRT7 KRRT7 changed the title perf: Batch semref index writes during ingestion perf: Batch writes + pipeline overlap for streaming ingestion Apr 29, 2026
KRRT7 added 4 commits April 29, 2026 04:41
…return

- Cancel pending_commit on exception to avoid "task destroyed" warnings
- Make _ExtractionResult frozen (never mutated after creation)
- Document single-connection assumption in _filter_ingested
- Return embeddings from VectorBase.add_keys to avoid redundant cache lookup
- Add test for messages with empty text_chunks during extraction
…d plumbing

- add_messages_with_indexing now uses mark_sources_ingested_batch
- add_batch_to_semantic_ref_index and _from_list use bulk writes via
  add_knowledge_batch_to_semantic_ref_index instead of per-item calls
- Remove unused terms_added parameter from public batch functions
@gvanrossum gvanrossum merged commit 115cb0d into microsoft:main Apr 29, 2026
16 checks passed
Comment on lines +225 to +226
concurrently. LLM extraction is typically 95% of wall time, so this
nearly doubles throughput for multi-batch ingestions.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claim looks misleading. Have you verified this end-to-end reduction in time?

I'd think the old approach would do

[---extraction 95%---][db][---extraction 95%---][db][---extraction 95%---][db]...

where the new approach does (view this in a fixed-width font)

[---extraction 95%---][---extraction 95%---][---extraction 95%---]
                      [db]                  [db]                  [db]

So the overall wall time would be just ~5% faster.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite this looking misleading, I have confirmed that with batchsize=50 and concurrency=20, my overall time for ingesting Adrian went down from 88 seconds to 32 seconds. Congrats!

@KRRT7 KRRT7 deleted the perf/ingestion-throughput branch May 1, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants