Remove redundant _filter_ingested from streaming pipeline#271
Merged
gvanrossum merged 1 commit intomicrosoft:mainfrom May 5, 2026
Merged
Remove redundant _filter_ingested from streaming pipeline#271gvanrossum merged 1 commit intomicrosoft:mainfrom
gvanrossum merged 1 commit intomicrosoft:mainfrom
Conversation
The framework no longer does per-batch dedup queries inside add_messages_streaming. Callers are responsible for filtering duplicates before yielding messages into the stream. - ingest_email.py already pre-filters via is_source_ingested - ingest_vtt.py enforces a fresh DB (refuses existing) - podcast_ingest.py uses unique source_ids by construction This eliminates ~N unnecessary are_sources_ingested DB round-trips (one per batch) that always returned empty sets. Closes microsoft#269
1c71230 to
3875789
Compare
KRRT7
added a commit
to KRRT7/typeagent-py
that referenced
this pull request
May 5, 2026
The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.
This was referenced May 5, 2026
gvanrossum
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_filter_ingestedmethod and all per-batchare_sources_ingestedDB queries fromadd_messages_streamingingest_email.pyalready pre-filters viais_source_ingestedbefore yieldingingest_vtt.pyenforces a fresh DB (refuses to run if it exists)ingest_podcast.pyuses unique source_ids by constructionbatch_skippedcounter fromingest_email.pyWhy: on a 10k-email re-ingest with batch_size=100, the framework was issuing ~100
are_sources_ingestedqueries that always returned empty sets (because the caller already filtered). Pure waste.Stack
This is the base of a 3-PR stack. Merge order: #271 → #267 → #268
fix/remove-filter-ingestedfeat/vtt-streaming-ingestionrefactor/email-dedup-consolidationTest plan
make checkpasses (pyright 0 errors)pytest tests/passes (696 tests — 5 removed that tested framework-level dedup)Closes #269