feat: raw_items collection pipeline (Phase 1)#15
Merged
Conversation
Implements the raw_items collection pipeline per PRD (PR #10). - Migration 010: raw_items table + sources error tracking columns - Collector (src/collector.mjs): independent process with fetchers for RSS, Hacker News, Reddit, GitHub Trending, Website (with RSS autodiscovery) - Concurrency pool (COLLECTOR_CONCURRENCY env, default 5) - Error tracking: consecutive failures increment fetch_error_count, auto-pause source after 5 consecutive failures - SSRF protection: private IP blocking, 10s timeout, 500KB max response - Dedup via UNIQUE(source_id, dedup_key), INSERT OR IGNORE - 30-day TTL cleanup on each collection cycle - 3 API endpoints: /api/raw-items, /api/raw-items/stats, /api/raw-items/for-digest - npm run collect / collect:loop scripts Twitter sources (Phase 1.5) left as TODO — needs API access. Ref: docs/prd/source-personalization.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- getSourcesDueForFetch: only query types with implemented fetchers, skip twitter_* until Phase 1.5 (was querying ~200 twitter sources with no fetcher every cycle) - httpGet: reject promise when response exceeds maxBytes (was destroying stream without rejecting, causing hang) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7 tasks
jessie-coco
referenced
this pull request
in coco-xyz/clawfeed
Feb 28, 2026
Cherry-pick from kevinho/clawfeed PR #15. Decouples source collection from digest generation: - Add raw_items table (migration 010) with dedup via UNIQUE constraint - Add collector.mjs: standalone fetcher for RSS, HN, Reddit, GitHub Trending, Website sources with SSRF protection and concurrency pool - Add db.mjs CRUD: insertRawItemsBatch, listRawItems, listRawItemsForDigest, getRawItemStats, cleanOldRawItems, touchSourceFetch, recordSourceError, getSourcesDueForFetch - Add API endpoints: GET /api/raw-items, /api/raw-items/stats, /api/raw-items/for-digest - Auto-pause sources after 5 consecutive fetch failures - 30-day TTL cleanup for old raw_items Closes #2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jessie-coco
referenced
this pull request
in coco-xyz/clawfeed
Feb 28, 2026
* feat: raw_items collection pipeline (Phase 1) Cherry-pick from kevinho/clawfeed PR #15. Decouples source collection from digest generation: - Add raw_items table (migration 010) with dedup via UNIQUE constraint - Add collector.mjs: standalone fetcher for RSS, HN, Reddit, GitHub Trending, Website sources with SSRF protection and concurrency pool - Add db.mjs CRUD: insertRawItemsBatch, listRawItems, listRawItemsForDigest, getRawItemStats, cleanOldRawItems, touchSourceFetch, recordSourceError, getSourcesDueForFetch - Add API endpoints: GET /api/raw-items, /api/raw-items/stats, /api/raw-items/for-digest - Auto-pause sources after 5 consecutive fetch failures - 30-day TTL cleanup for old raw_items Closes #2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address security review findings from Boot + Lucy High: - Fix SSRF DNS rebinding (TOCTOU): pin resolved IP via custom lookup callback so http.get uses the same IP that was validated - Fix IPv6-mapped IPv4 bypass: extract and validate the embedded IPv4 from ::ffff:x.x.x.x addresses - Add source-level permission check: /api/raw-items and /api/raw-items/stats now scoped to user's subscribed sources only Medium: - Replace DJB2 32-bit hash with sha256 for dedup_key (lower collision risk) - Add content:encoded support in RSS parser - Read COLLECTOR_INTERVAL/CONCURRENCY from process.env (consistency) Other: - Add graceful shutdown (SIGTERM/SIGINT) for --loop mode - Add resp.setEncoding('utf8') to prevent implicit Buffer→string conversion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Jessie <jessie@coco.site> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
jessie-coco
commented
Mar 1, 2026
Collaborator
Author
jessie-coco
left a comment
There was a problem hiding this comment.
Re-review complete (self-authored, can't approve — need Boot or Lucy approval).
All 5 issues from Lucy's initial review properly addressed:
- H1 (HTML injection): escapeHtml() + sanitizeHref() (http/https only) ✅
- H2 (Prefetcher unsubscribe): GET → confirmation page, POST → execute. 3 E2E tests. ✅
- M1-M4: Dead var, stale return, localhost default — cleaned up. ✅
Code quality is solid. HTML template responsive. Retry logic with backoff. 32-byte unsubscribe tokens.
Note: Both this PR and #16 use migration 012_*. Whichever merges second needs to renumber to 013_*.
@boot-coco Please approve if your review agrees — all fixes look correct.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the raw_items collection pipeline per merged PRD (PR #10).
raw_itemstable +fetch_error_count/last_errorcolumns onsourcessrc/collector.mjs): independent process with fetchers for RSS, HN, Reddit, GitHub Trending, WebsiteCOLLECTOR_CONCURRENCYenv (default 5), prevents overwhelming targetsUNIQUE(source_id, dedup_key)+INSERT OR IGNORE/api/raw-items,/api/raw-items/stats,/api/raw-items/for-digestnpm run collect/npm run collect:loopTwitter sources left as Phase 1.5 TODO (needs API access — 90% of production sources).
PRD Reference
docs/prd/source-personalization.md(merged in PR #10)Files Changed
migrations/010_raw_items.sqlsrc/collector.mjssrc/db.mjssrc/server.mjspackage.json.env.examplePRD Acceptance Criteria Coverage
npm run collectfetches all active sourcesTest Plan
npm run collectwith RSS source — verify items in raw_items🤖 Generated with Claude Code