Separate batching from concurrency during ingestion by gvanrossum-ms · Pull Request #252 · microsoft/typeagent-py

gvanrossum-ms · 2026-04-25T18:29:59Z

Batch size is now just something the ingestion tools can use to control transaction size/frequency(*). Concurrency is how many concurrent knowledge extraction tasks to start concurrently.

To maximize throughput if you have paid your provider to allow you to use lots of tokens/minute, use a huge batch size and set concurrency just high enough that you don't overload your provider. However, since a single failure throws away your entire batch, you are advised not to set your batch size too large. Tune based on experiments.

(*) My next plan is to add a new top-level API that you feed with an iterator, where you can separately control concurrency and transaction batch size. ~~Not sure yet whether I'll make that part of this PR; probably not.~~ Wait for those commits to come soon.

@KRRT7

gvanrossum-ms · 2026-04-25T18:31:30Z

~~My laptop battery is about to die.~~ Feel free to review, I'll continue later.

…ders - IMessage gains an optional `source_id: str | None` field; mirrored on ConversationMessage and EmailMessage. Lets ingestion pipelines carry the external source identifier (email id, file path, URL) on the message itself instead of via parallel arrays. - New ChunkFailures table in the SQLite schema, keyed by (msg_id, chunk_ordinal), recording error_class/error_message/failed_at for chunks whose knowledge extraction failed. - New ChunkFailure dataclass and three IStorageProvider methods: record_chunk_failure, clear_chunk_failure, get_chunk_failures. Implemented for both SQLite and in-memory providers. Groundwork for a streaming, restartable add_messages API that records per-chunk extraction failures without aborting the run.

But only if session_ids is None.

New method on ConversationBase that accepts an AsyncIterable of messages and processes them in commit-per-batch transactions. Designed for million-message ingestion where a single all-or-nothing transaction is impractical. Key behaviors: - Buffers messages into configurable batch_size (default 100) - Skips messages whose source_id is already ingested - Records chunk-level extraction Failures via record_chunk_failure instead of raising (processing continues with remaining chunks) - Raised exceptions (HTTP/timeout/auth) stop the run; the current batch rolls back but previously committed batches survive Also adds _ingest_batch_streaming (single-batch transaction logic) and _add_llm_knowledge_streaming (extraction with failure recording). Tests: 8 new tests in test_add_messages_streaming.py using a ControlledExtractor that can return Failure or raise on specific calls.

- Add chunks_added field to AddMessagesResult (making all its fields default to 0) - Add on_batch_committed callback parameter to add_messages_streaming - Use callback in podcast_ingest.py for per-batch progress reporting

gvanrossum · 2026-04-26T06:54:51Z

@KRRT7, @bmerkle: This is now ready for review. It should keep the pipeline busy continuously. See podcast_ingest.py for example usage.

bmerkle · 2026-04-26T10:08:27Z

+        cursor = self.db.cursor()
+        cursor.execute(
+            """
+            INSERT OR REPLACE INTO ChunkFailures


existing databases possibly fail, because with schema error, because they did not have the table.
Either add a migration step (similar to how INGESTED_SOURCES_SCHEMA may have been added) or document that a fresh DB is required.

bmerkle · 2026-04-26T10:12:32Z

+        # Filter out already-ingested sources
+        filtered: list[TMessage] = []
+        for msg in batch:
+            if msg.source_id is not None and await storage.is_source_ingested(


is_source_ingested called outside the transaction
In a multi-process scenario two workers could both pass the check and both ingest the same source.

I don't believe such a scenario is valid anyway, so let's not move the check into the transaction (it will just slow down other tasks that want to write to the db).

gvanrossum-ms added 2 commits April 25, 2026 09:18

Switch from batch size to concurrency

0b1de22

Delete unused batching helpers

422c249

gvanrossum mentioned this pull request Apr 25, 2026

perf: Knowledge extraction concurrency bottleneck on large ingestion jobs #250

Closed

gvanrossum marked this pull request as draft April 26, 2026 05:44

gvanrossum-ms added 3 commits April 25, 2026 22:48

Change add_messages_with_indexing to look for message.session_id

808ca5a

But only if session_ids is None.

Add chunks_added field and on_batch_committed callback

de24a39

- Add chunks_added field to AddMessagesResult (making all its fields default to 0) - Add on_batch_committed callback parameter to add_messages_streaming - Use callback in podcast_ingest.py for per-batch progress reporting

gvanrossum marked this pull request as ready for review April 26, 2026 06:53

Merge branch 'main' into batch-conc

8c81cf7

bmerkle reviewed Apr 26, 2026

View reviewed changes

Address bmerkle's review

aecfcde

gvanrossum approved these changes Apr 26, 2026

View reviewed changes

gvanrossum merged commit 038be09 into microsoft:main Apr 26, 2026
16 checks passed

gvanrossum deleted the batch-conc branch April 26, 2026 16:58

KRRT7 mentioned this pull request Apr 29, 2026

Remove dead extract_knowledge_for_text_batch_q function #259

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate batching from concurrency during ingestion#252

Separate batching from concurrency during ingestion#252
gvanrossum merged 8 commits intomicrosoft:mainfrom
gvanrossum:batch-conc

gvanrossum-ms commented Apr 25, 2026 •

edited by gvanrossum

Loading

Uh oh!

gvanrossum-ms commented Apr 25, 2026 •

edited by gvanrossum

Loading

Uh oh!

gvanrossum commented Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmerkle Apr 26, 2026

Uh oh!

bmerkle Apr 26, 2026

Uh oh!

gvanrossum Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gvanrossum-ms commented Apr 25, 2026 • edited by gvanrossum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum-ms commented Apr 25, 2026 • edited by gvanrossum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmerkle Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

bmerkle Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gvanrossum Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gvanrossum-ms commented Apr 25, 2026 •

edited by gvanrossum

Loading

gvanrossum-ms commented Apr 25, 2026 •

edited by gvanrossum

Loading