Intelligent web scraping system that combines deterministic XPath extraction with LLM-powered field-level fallback. The LLM acts as a "compiler" that generates extraction rules, not as a runtime extractor -- dramatically reducing cost while maintaining high accuracy.
# Install dependencies
pip install -e ".[dev]"
# Run the demo (no API key needed)
make demoThe demo runs three scenarios against bundled HTML fixtures:
- Quotes page -- multi-item XPath extraction (quotes, authors, tags)
- Book detail -- single-item extraction (title, price, UPC, description)
- Broken selector -- shows field-level LLM fallback when one XPath breaks
To run against live websites instead of fixtures:
make demo-liveXPath extraction (fast, free)
|
v
All fields OK? ──yes──> Done
|
no
v
Send ONLY missing fields to LLM (cheap)
|
v
Merge results, update coverage stats
The key insight: when a selector breaks, only that field goes to the LLM, not the entire page. At scale this saves ~85% vs page-level LLM extraction.
| Component | Status | Details |
|---|---|---|
| Domain models | Done | Pydantic models, custom exceptions |
| Ports (interfaces) | Done | 5 ABCs: Fetcher, Extractor, LLM, Storage, Alerting |
| httpx fetcher | Done | Async HTTP GET |
| lxml extractor | Done | XPath + CSS selector extraction |
| Direct LLM provider | Done | Anthropic API (extract_fields only) |
| Local storage | Done | Filesystem with coverage tracking |
| Console alerting | Done | Stdout alerts |
| ExtractWithFallback | Done | Field-level LLM fallback use case |
| ScrapePage | Done | End-to-end orchestration |
| CLI | Done | scraper scrape URL, scraper info |
| Tests | Done | 42 tests passing (33 unit + 9 integration) |
- Multi-sample config generation (3-5 pages for robust selectors)
propose_selectors()/repair_selectors()in LLM provider- Validation + repair loops
- GenerateConfig use case
- Playwright fetcher (JavaScript-heavy sites)
- S3 storage, Knock alerting
- PocketFlow LLM orchestration
Clean Architecture with dependency injection. Swap any provider via .env with zero code changes.
scraper/
├── domain/ # Models, exceptions (no external deps)
├── ports/ # Abstract interfaces (5 ABCs)
├── adapters/ # Concrete implementations
│ ├── fetchers/ # httpx (done), playwright (Phase 3)
│ ├── extractors/ # lxml (done)
│ ├── llm/ # direct Anthropic/OpenAI (done)
│ ├── storage/ # local filesystem (done), S3 (Phase 3)
│ └── alerting/ # console (done), Knock (Phase 3)
├── use_cases/ # Business workflows
│ ├── scrape_page.py # End-to-end orchestration
│ └── extract_with_fallback.py # Field-level LLM fallback
├── config.py # DI container
└── cli.py # Click CLI
demo/ # Demo fixtures and script
tests/
├── unit/ # 33 tests, mocked deps
└── integration/ # 9 tests, real adapters
make test # Unit tests
make test-integration # Integration tests
make test-all # Everything
make lint # Ruff + mypy
make format # Black
make demo # Run demo offlinecp .env.example .env| Variable | Default | Options |
|---|---|---|
FETCHER |
httpx |
httpx, playwright |
STORAGE |
local |
local, s3 |
LLM_PROVIDER |
direct |
direct, pocketflow |
ALERTING |
console |
console, knock |
ANTHROPIC_API_KEY |
-- | Required for LLM fallback |
At 100 domains, 10k pages/month each:
| Approach | Annual Cost |
|---|---|
| Page-level LLM (v1.0) | $18,000 |
| Field-level fallback (v2.0) | $2,820 |
| Savings | $15,180/yr |
- System Design v2.0 -- Complete technical spec
- ParserGPT Comparison -- Design rationale
- Update Summary -- What changed from v1.0
- ADR-006: ParserGPT Enhancements
Version: 2.0 | Author: Diego | License: Apache 2.0