Use streaming in ingest_email.py#265
Conversation
|
Here's what I caught (confirmed via
Posted a formal review with ````suggestion` blocks for each item. |
KRRT7
left a comment
There was a problem hiding this comment.
Ran coverage.py against the streaming tests (32/32 pass). Coverage on the key files: conversation_base.py 78%, sqlite/provider.py 57%, model_adapters.py 46%. Below are findings with suggested fixes — most can be applied directly.
Co-authored-by: Kevin Turcios <106575910+KRRT7@users.noreply.github.com>
Co-authored-by: Kevin Turcios <106575910+KRRT7@users.noreply.github.com>
|
@KRRT7 -- I've addressed all your feedback except for the backward compatibility (it's too early for that). |
|
PS. Please don't use "Request changes". Leaving comments has the same effect and I don't get the nagging message in the PR. |
Since stamina installs a default hook that just prints 'stamina.retry_scheduled', the condition '-if not get_on_retry_hooks():' was always false.
sounds good, do you prefer the flow where I make the exact code suggestions and you can click "Apply suggestions" instead? |
sounds good to me, just wasn't sure of your policy around that. |
KRRT7
left a comment
There was a problem hiding this comment.
Re-reviewed the updated commits (551d407..ec4085d). All previous suggestions addressed — chunked IN clause, += for messages_skipped, dedup comment, all-skipped test all look good.
Ran full suite locally: make format (no changes), make check (0 errors/warnings), make test (697 passed). Four small follow-ups below, all click-to-apply.
| async def _submit_batch(filtered: list[TMessage], skipped: int) -> None: | ||
| nonlocal pending_commit, pending_skipped | ||
| if not filtered and not skipped: | ||
| return |
There was a problem hiding this comment.
Dead guard. All three call sites guarantee batch is non-empty before calling _filter_ingested, so len(filtered) + skipped == len(batch) > 0. This branch can never fire. The if not filtered: check at line 290 already handles the all-skipped case.
| return | |
| if filtered and should_extract: |
There was a problem hiding this comment.
@KRRT7 I am confused by your suggestions. This one, when applied, generates invalid syntax. Maybe you are using a tool to generate them and it's got an off-by-one error or is working from an outdated file version? It's the case for all suggestions in this batch.
There was a problem hiding this comment.
You're right. If we ever revisit this code we can delete it. Bernhard merged it before I got a chance to look at your feedback, in part confused by the misaligned suggestions. It's harmless so I see no need to fix it in a separate PR.
| def on_batch_committed(result: AddMessagesResult) -> None: | ||
| nonlocal last_batch_time | ||
| counters["ingested"] += result.messages_added | ||
| counters["batches"] += 1 |
There was a problem hiding this comment.
Noisy output for all-skipped batches. When _submit_batch handles an all-skipped batch, this prints +0 messages, +0 chunks, +0 semrefs | 0.0s (0.00s/chunk). Early return keeps the batch counter and timing accurate for real batches.
| counters["batches"] += 1 | |
| counters["ingested"] += result.messages_added | |
| if not result.messages_added: | |
| return | |
| counters["batches"] += 1 |
There was a problem hiding this comment.
Honestly, that behavior sounds reasonable to me, given that this should be extremely rare. I think to trigger it you would have to run two instances of import_email in parallel with the same input files, or pass the same file multiple times but in different batches. Neither of which you should do. And if you do the latter, the batch lookup of source IDs might still bite you if the file occurs twice in the same batch.
| print(f"Failed to import {failed_count} email(s)") | ||
| print(f"Successfully ingested {result.messages_added} email(s)") | ||
| if counters["skipped"]: | ||
| print(f"Skipped {counters['skipped']} already-ingested email(s)") |
There was a problem hiding this comment.
Skip accounting gap. counters["skipped"] counts generator-level pre-parse dedup, but result.messages_skipped (batch-level transactional dedup from _filter_ingested) is never displayed. The two populations are disjoint — a source caught by the generator is never yielded to the batch layer — so summing is safe.
| print(f"Skipped {counters['skipped']} already-ingested email(s)") | |
| if counters["skipped"] + result.messages_skipped: | |
| print(f"Skipped {counters['skipped'] + result.messages_skipped} already-ingested email(s)") |
There was a problem hiding this comment.
Weirdly the line numbers in the file listing don't match those in the suggestion (506 --> 528).
You may be right about the accounting, it's hard to follow for me TBH.
| print(f"Skipped: {skipped_count} (already ingested)") | ||
| if failed_count: | ||
| print(f"Failed: {failed_count}") | ||
| if counters["skipped"]: |
There was a problem hiding this comment.
Same skip accounting fix for non-verbose output, plus capitalize "Ingested" to match the verbose branch.
| if counters["skipped"]: | |
| f"Ingested {result.messages_added} emails to {database} " | |
| f"({semref_count} refs, {elapsed:.1f}s)" | |
| ) | |
| if counters["skipped"] + result.messages_skipped: | |
| print(f"Skipped: {counters['skipped'] + result.messages_skipped} (already ingested)") |
There was a problem hiding this comment.
re: the Capitalization, I know it's a minor detail but I care about them.
There was a problem hiding this comment.
I do too and it was already fixed by some last-minute improvement to the output.
demoing it here. |
|
@KRRT7 -- have another look (mostly spit and polish) |
|
Reviewed the 3 new commits ( A few things still open from the last round: Dead guard in All-skipped batch output. Skip accounting gap.
|
Yes. Though in this case I found some issues with your suggestions -- it's still better, and you get credit. |
|
Just ping me when you are finshed...so i can merge |
There was a problem hiding this comment.
Pull request overview
This PR updates the email ingestion pipeline to use the streaming ingestion API and to size commit batches by text chunks (rather than message count), with improved progress reporting and additional streaming/batching test coverage.
Changes:
- Update
tools/ingest_email.pyto stream messages into the DB with chunk-based batching, optional concurrency override, and richer batch progress reporting. - Extend the storage provider API with batch source-id dedup (
are_sources_ingested) and use it from streaming ingestion. - Expand streaming ingestion tests to cover multi-chunk messages, chunk-based batching behavior, and skipped/duplicate reporting.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/ingest_email.py | Switches email ingest to streaming, adds CLI controls for concurrency/batching, adds pre-parse dedup and chunk truncation, improves progress + summary output. |
| tests/test_add_messages_streaming.py | Adds extensive tests for multi-chunk extraction and chunk-based batching + skipped reporting. |
| src/typeagent/storage/sqlite/provider.py | Implements are_sources_ingested() for efficient batch dedup lookups in SQLite. |
| src/typeagent/storage/memory/provider.py | Implements are_sources_ingested() for in-memory provider. |
| src/typeagent/knowpro/interfaces_storage.py | Adds are_sources_ingested() to the storage provider interface. |
| src/typeagent/knowpro/interfaces_core.py | Extends AddMessagesResult with messages_skipped. |
| src/typeagent/knowpro/conversation_base.py | Changes streaming batching semantics to chunk-based; adds skipped accounting and uses batch dedup API. |
| src/typeagent/aitools/model_adapters.py | Adds global stamina retry logging hook and adjusts embed retrier defaults. |
|
Hm, I had wanted to respond to some of Copilot's and @KRRT7's review comments. But I'm happy that at least this much is now on main, we can iterate further in follow-up PRs. |
|
I already answered the first two issues. (Sadly your suggestions demo misfired.)
Can you talk me through this in more detail? I'm lost about what messages_skipped and skipped stand for at this point.
This one's new. I think it just requires the same treatment in the except block? |
|
Copilot didn't find anything new. |
|
Hey Guido — quick note on why those click-to-apply suggestions generated invalid syntax. The suggestions in my second review were generated against commit The suggestion content itself was correct for |
There are two independent dedup layers, each with its own skip counter: Layer 1 — Generator-level ( Layer 2 — Batch-level ( The gap: The final summary in
On ^C it's worse: The fix adds a
Yes. The fix promotes
Both fixes are in #266. |
## Summary Follow-up to #265. Fixes two issues identified during review: - **Skip accounting gap.** `ingest_email.py` only reported generator-level skips (`counters["skipped"]`). Batch-level skips from `_filter_ingested` (`result.messages_skipped`) were never surfaced in the final summary. The two populations are disjoint — a source caught by the generator never reaches the batch layer. The summary now reports `total_skipped = counters["skipped"] + counters["batch_skipped"]`. Survives ^C since `on_batch_committed` fires per committed batch. - **`next_extraction` task leak.** In `_submit_batch`, if `_drain_commit()` raises after `next_extraction = asyncio.create_task(...)` but before it's awaited, the task leaked. Promoted `next_extraction` to a `nonlocal` (`pending_extraction`) so the `except BaseException` block can cancel it alongside `pending_commit`. ## Changes - `tools/ingest_email.py`: Add `batch_skipped` counter, track `result.messages_skipped` in `on_batch_committed`, report combined total in summary - `src/typeagent/knowpro/conversation_base.py`: Track `pending_extraction`, cancel in except block - `tests/test_add_messages_streaming.py`: 4 new tests covering both cancellation paths and edge cases ## Test plan - [x] `make format check test` passes (701 tests) - [x] Coverage for `conversation_base.py`: 94% → 96% - [x] New test: `pending_extraction` cancelled when prior commit raises during `_drain_commit` - [x] New test: `pending_commit` cancelled when message iterator raises - [x] New test: empty iterator returns zeros - [x] New test: messages with empty `text_chunks` skip extraction entirely
|
Thanks! And merged. I never encountered that GitHub issue before but it makes sense. |
We now count chunks to determine batch size. Reporting is much improved.