Skip to content

feat: raw_items collection pipeline (Phase 1)#15

Merged
kevinho merged 4 commits intodevelopfrom
feat/raw-items-collector-v2
Feb 24, 2026
Merged

feat: raw_items collection pipeline (Phase 1)#15
kevinho merged 4 commits intodevelopfrom
feat/raw-items-collector-v2

Conversation

@jessie-coco
Copy link
Copy Markdown
Collaborator

Summary

Implements the raw_items collection pipeline per merged PRD (PR #10).

  • Migration 010: raw_items table + fetch_error_count/last_error columns on sources
  • Collector (src/collector.mjs): independent process with fetchers for RSS, HN, Reddit, GitHub Trending, Website
  • Concurrency pool: COLLECTOR_CONCURRENCY env (default 5), prevents overwhelming targets
  • Error tracking: consecutive failures auto-pause source after 5 failures
  • SSRF protection: private IP blocking, 10s timeout, 500KB max response
  • Dedup: UNIQUE(source_id, dedup_key) + INSERT OR IGNORE
  • 30-day TTL cleanup on each collection cycle
  • 3 API endpoints: /api/raw-items, /api/raw-items/stats, /api/raw-items/for-digest
  • Scripts: npm run collect / npm run collect:loop

Twitter sources left as Phase 1.5 TODO (needs API access — 90% of production sources).

PRD Reference

docs/prd/source-personalization.md (merged in PR #10)

Files Changed

File Change
migrations/010_raw_items.sql New: raw_items table + sources error columns
src/collector.mjs New: collection pipeline + all fetchers
src/db.mjs Add: raw_items CRUD + error tracking + fetch scheduling
src/server.mjs Add: 3 API endpoints
package.json Add: collect/collect:loop scripts
.env.example Add: COLLECTOR_INTERVAL, COLLECTOR_CONCURRENCY, DEFAULT_SOURCE_ID

PRD Acceptance Criteria Coverage

  • 1. raw_items table via migration
  • 2. npm run collect fetches all active sources
  • 3. Dedup (INSERT OR IGNORE)
  • 4. RSS parsing (title, URL, content, pubdate)
  • 5. HN min_score filter + metadata
  • 6. Reddit subreddit posts
  • 7. GitHub Trending repos
  • 8. Website RSS autodiscovery + fallback
  • 9. SSRF protection
  • 10. /api/raw-items/stats
  • 11. /api/raw-items/for-digest (user subscription filter)
  • 12. 30-day TTL cleanup
  • 13. Non-logged-in default digest (Phase 2 — needs digest generation integration)
  • 14. New user auto-subscribe default source (Phase 2)
  • 15. Auto-pause after 5 consecutive failures
  • 16. Concurrency limit

Test Plan

  • Run migration on clean DB — verify raw_items table + sources columns created
  • npm run collect with RSS source — verify items in raw_items
  • Run collect twice — verify dedup (0 new inserts)
  • Add HN source with min_score — verify score filtering
  • Verify SSRF: source with localhost URL gets blocked
  • Hit /api/raw-items/stats — verify JSON response
  • Hit /api/raw-items/for-digest with subscribed user — verify filtering
  • Verify auto-pause: artificially fail 5 times, confirm is_active=0

🤖 Generated with Claude Code

jessie-coco and others added 4 commits February 24, 2026 06:11
Implements the raw_items collection pipeline per PRD (PR #10).

- Migration 010: raw_items table + sources error tracking columns
- Collector (src/collector.mjs): independent process with fetchers for
  RSS, Hacker News, Reddit, GitHub Trending, Website (with RSS autodiscovery)
- Concurrency pool (COLLECTOR_CONCURRENCY env, default 5)
- Error tracking: consecutive failures increment fetch_error_count,
  auto-pause source after 5 consecutive failures
- SSRF protection: private IP blocking, 10s timeout, 500KB max response
- Dedup via UNIQUE(source_id, dedup_key), INSERT OR IGNORE
- 30-day TTL cleanup on each collection cycle
- 3 API endpoints: /api/raw-items, /api/raw-items/stats, /api/raw-items/for-digest
- npm run collect / collect:loop scripts

Twitter sources (Phase 1.5) left as TODO — needs API access.

Ref: docs/prd/source-personalization.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- getSourcesDueForFetch: only query types with implemented fetchers,
  skip twitter_* until Phase 1.5 (was querying ~200 twitter sources
  with no fetcher every cycle)
- httpGet: reject promise when response exceeds maxBytes (was destroying
  stream without rejecting, causing hang)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kevinho kevinho merged commit 0aa1f86 into develop Feb 24, 2026
3 checks passed
jessie-coco referenced this pull request in coco-xyz/clawfeed Feb 28, 2026
Cherry-pick from kevinho/clawfeed PR #15. Decouples source collection
from digest generation:

- Add raw_items table (migration 010) with dedup via UNIQUE constraint
- Add collector.mjs: standalone fetcher for RSS, HN, Reddit, GitHub
  Trending, Website sources with SSRF protection and concurrency pool
- Add db.mjs CRUD: insertRawItemsBatch, listRawItems,
  listRawItemsForDigest, getRawItemStats, cleanOldRawItems,
  touchSourceFetch, recordSourceError, getSourcesDueForFetch
- Add API endpoints: GET /api/raw-items, /api/raw-items/stats,
  /api/raw-items/for-digest
- Auto-pause sources after 5 consecutive fetch failures
- 30-day TTL cleanup for old raw_items

Closes #2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jessie-coco referenced this pull request in coco-xyz/clawfeed Feb 28, 2026
* feat: raw_items collection pipeline (Phase 1)

Cherry-pick from kevinho/clawfeed PR #15. Decouples source collection
from digest generation:

- Add raw_items table (migration 010) with dedup via UNIQUE constraint
- Add collector.mjs: standalone fetcher for RSS, HN, Reddit, GitHub
  Trending, Website sources with SSRF protection and concurrency pool
- Add db.mjs CRUD: insertRawItemsBatch, listRawItems,
  listRawItemsForDigest, getRawItemStats, cleanOldRawItems,
  touchSourceFetch, recordSourceError, getSourcesDueForFetch
- Add API endpoints: GET /api/raw-items, /api/raw-items/stats,
  /api/raw-items/for-digest
- Auto-pause sources after 5 consecutive fetch failures
- 30-day TTL cleanup for old raw_items

Closes #2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address security review findings from Boot + Lucy

High:
- Fix SSRF DNS rebinding (TOCTOU): pin resolved IP via custom lookup
  callback so http.get uses the same IP that was validated
- Fix IPv6-mapped IPv4 bypass: extract and validate the embedded IPv4
  from ::ffff:x.x.x.x addresses
- Add source-level permission check: /api/raw-items and /api/raw-items/stats
  now scoped to user's subscribed sources only

Medium:
- Replace DJB2 32-bit hash with sha256 for dedup_key (lower collision risk)
- Add content:encoded support in RSS parser
- Read COLLECTOR_INTERVAL/CONCURRENCY from process.env (consistency)

Other:
- Add graceful shutdown (SIGTERM/SIGINT) for --loop mode
- Add resp.setEncoding('utf8') to prevent implicit Buffer→string conversion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Jessie <jessie@coco.site>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@jessie-coco jessie-coco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete (self-authored, can't approve — need Boot or Lucy approval).

All 5 issues from Lucy's initial review properly addressed:

  • H1 (HTML injection): escapeHtml() + sanitizeHref() (http/https only) ✅
  • H2 (Prefetcher unsubscribe): GET → confirmation page, POST → execute. 3 E2E tests. ✅
  • M1-M4: Dead var, stale return, localhost default — cleaned up. ✅

Code quality is solid. HTML template responsive. Retry logic with backoff. 32-byte unsubscribe tokens.

Note: Both this PR and #16 use migration 012_*. Whichever merges second needs to renumber to 013_*.

@boot-coco Please approve if your review agrees — all fixes look correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants