WebReaper is a workspace-first web crawling and content filing platform for teams that need to collect, organize, and review website data quickly.
The current product direction is centered on four things:
- broad crawling with browser-render fallback
- deep page extraction and structured inventory
- workspace libraries for filing, labeling, and exporting collected pages
- an analyst-facing UI backed by FastAPI and SQLite/Postgres-compatible models
Proxy, repeater, and intruder tooling remain in the product and are useful for analyst workflows, but the primary story is now crawling plus structured content operations.
- Async crawl execution with resumable job tracking
- Browser-render fallback for pages that need client-side rendering
- Page metadata, headings, contacts, technology, and content extraction
- Endpoint and parameter inventory derived from links, forms, and observed browser requests
- Duplicate-content, link-health, and content-analysis views
- Workspace-scoped crawl boundaries and scope rules
- Auto-filing suggestions for category, folder, and labels
- Manual filing controls for starring, notes, labels, and folder/category overrides
- Workspace-level summaries, recent pages, and filtered library views
- JSON and CSV export of library datasets
- Proxy session management with HTTP history and intercept queue
- Repeater for replaying saved requests and comparing responses
- Intruder for queued payload fuzzing with result triage
- On-demand security findings, triage metadata, and report export
- Next.js dashboard with static export support
- FastAPI backend with SSE streams and WebSocket/chat plumbing
- Alembic migrations and async SQLAlchemy data layer
- SQLite by default, with a Postgres-compatible schema layout
- Backend: FastAPI, SQLAlchemy async ORM, Alembic
- Frontend: Next.js App Router, TypeScript
- Storage: SQLite local default, Postgres-friendly schema
- Streaming: SSE for metrics/logs/progress, WebSocket support for chat/gateway features
- Background execution: in-process async job queue
- Python 3.10+
- Node.js 20+ with
npm/npx pnpmis preferred, butstart.shwill fall back tonpx pnpmifpnpmis not installed globally
./start.shThis script:
- creates a local virtual environment if needed
- installs Python and frontend dependencies
- initializes the SQLite database
- runs migrations
- starts the FastAPI backend on
http://localhost:8000 - starts the dashboard on
http://localhost:3000
Useful endpoints:
- Dashboard:
http://localhost:3000 - API docs:
http://localhost:8000/docs - Health check:
http://localhost:8000/health
export PYTHONPATH=.
export DATABASE_URL='sqlite+aiosqlite:////tmp/webreaper_demo.db'
./.venv/bin/python scripts/seed_demo_data.pyFor a fuller walkthrough, see docs/demo-flow.md.
./.venv/bin/pytest testscd web
npx pnpm build
npx pnpm testThe dashboard is configured with output: 'export'.
cd web
NEXT_PUBLIC_API_URL='http://127.0.0.1:8000' \
NEXT_PUBLIC_WS_URL='ws://127.0.0.1:8000' \
NEXT_PUBLIC_SSE_URL='http://127.0.0.1:8000' \
npx pnpm build
npx pnpm startpnpm start serves the generated web/out bundle.
- Open the dashboard and verify the live metrics stream.
- Start or inspect a crawl from Jobs.
- Review extracted content in Data.
- Open a workspace and review or edit library filings.
- Inspect captured traffic in Proxy.
- Replay a request in Repeater.
- Review a fuzzing job in Intruder.
- Review findings and exports in Security.
- Local/self-hosted usage does not require license enforcement by default.
- To enable the legacy gated behavior explicitly, set
WEBREAPER_REQUIRE_LICENSE=1. - Missing Supabase or Stripe configuration will degrade those related features, but the local crawler/library workflow still runs in development mode.
Use interception, replay, fuzzing, or active security testing features only against systems you own or are explicitly authorized to assess.