LLM Web Scraper v2.0

Intelligent web scraping system that combines deterministic XPath extraction with LLM-powered field-level fallback. The LLM acts as a "compiler" that generates extraction rules, not as a runtime extractor -- dramatically reducing cost while maintaining high accuracy.

Quick Start

# Install dependencies
pip install -e ".[dev]"

# Run the demo (no API key needed)
make demo

The demo runs three scenarios against bundled HTML fixtures:

Quotes page -- multi-item XPath extraction (quotes, authors, tags)
Book detail -- single-item extraction (title, price, UPC, description)
Broken selector -- shows field-level LLM fallback when one XPath breaks

To run against live websites instead of fixtures:

make demo-live

How It Works

XPath extraction (fast, free)
        |
        v
  All fields OK? ──yes──> Done
        |
       no
        v
Send ONLY missing fields to LLM (cheap)
        |
        v
  Merge results, update coverage stats

The key insight: when a selector breaks, only that field goes to the LLM, not the entire page. At scale this saves ~85% vs page-level LLM extraction.

Project Status

Phase 1: Core System -- Complete

Component	Status	Details
Domain models	Done	Pydantic models, custom exceptions
Ports (interfaces)	Done	5 ABCs: Fetcher, Extractor, LLM, Storage, Alerting
httpx fetcher	Done	Async HTTP GET
lxml extractor	Done	XPath + CSS selector extraction
Direct LLM provider	Done	Anthropic API (`extract_fields` only)
Local storage	Done	Filesystem with coverage tracking
Console alerting	Done	Stdout alerts
ExtractWithFallback	Done	Field-level LLM fallback use case
ScrapePage	Done	End-to-end orchestration
CLI	Done	`scraper scrape URL`, `scraper info`
Tests	Done	42 tests passing (33 unit + 9 integration)

Phase 2: ParserGPT Enhancements -- Not Started

Multi-sample config generation (3-5 pages for robust selectors)
propose_selectors() / repair_selectors() in LLM provider
Validation + repair loops
GenerateConfig use case

Phase 3: Scale -- Not Started

Playwright fetcher (JavaScript-heavy sites)
S3 storage, Knock alerting
PocketFlow LLM orchestration

Architecture

Clean Architecture with dependency injection. Swap any provider via .env with zero code changes.

scraper/
├── domain/             # Models, exceptions (no external deps)
├── ports/              # Abstract interfaces (5 ABCs)
├── adapters/           # Concrete implementations
│   ├── fetchers/       #   httpx (done), playwright (Phase 3)
│   ├── extractors/     #   lxml (done)
│   ├── llm/            #   direct Anthropic/OpenAI (done)
│   ├── storage/        #   local filesystem (done), S3 (Phase 3)
│   └── alerting/       #   console (done), Knock (Phase 3)
├── use_cases/          # Business workflows
│   ├── scrape_page.py          # End-to-end orchestration
│   └── extract_with_fallback.py # Field-level LLM fallback
├── config.py           # DI container
└── cli.py              # Click CLI

demo/                   # Demo fixtures and script
tests/
├── unit/               # 33 tests, mocked deps
└── integration/        # 9 tests, real adapters

Development

make test               # Unit tests
make test-integration   # Integration tests
make test-all           # Everything
make lint               # Ruff + mypy
make format             # Black
make demo               # Run demo offline

Configuration

cp .env.example .env

Variable	Default	Options
`FETCHER`	`httpx`	`httpx`, `playwright`
`STORAGE`	`local`	`local`, `s3`
`LLM_PROVIDER`	`direct`	`direct`, `pocketflow`
`ALERTING`	`console`	`console`, `knock`
`ANTHROPIC_API_KEY`	--	Required for LLM fallback

Cost at Scale

At 100 domains, 10k pages/month each:

Approach	Annual Cost
Page-level LLM (v1.0)	$18,000
Field-level fallback (v2.0)	$2,820
Savings	$15,180/yr

Documentation

System Design v2.0 -- Complete technical spec
ParserGPT Comparison -- Design rationale
Update Summary -- What changed from v1.0
ADR-006: ParserGPT Enhancements

Version: 2.0 | Author: Diego | License: Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
demo		demo
docs		docs
scraper		scraper
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Web Scraper v2.0

Quick Start

How It Works

Project Status

Phase 1: Core System -- Complete

Phase 2: ParserGPT Enhancements -- Not Started

Phase 3: Scale -- Not Started

Architecture

Development

Configuration

Cost at Scale

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Web Scraper v2.0

Quick Start

How It Works

Project Status

Phase 1: Core System -- Complete

Phase 2: ParserGPT Enhancements -- Not Started

Phase 3: Scale -- Not Started

Architecture

Development

Configuration

Cost at Scale

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages