BioScanCast

BioScanCast is an open source pipeline for biosecurity forecasting using LLMs and automated web retrieval.

The system retrieves internet sources, filters relevant documents, extracts structured information, produces probabilistic forecasts, and supports evaluation against human forecasters.

Current repository contents include:

modular pipeline stages
shared schemas and LLM abstractions
retrieval and extraction tooling
benchmarking and evaluation infrastructure
smoke-test and operational scripts

All five pipeline stages (search, filtering, extraction, insight, forecasting) are implemented and wired into the end-to-end orchestrator.

Project Goals

Build an open source forecasting system for biosecurity questions.
Benchmark model forecasts against human forecasters.
Provide a reproducible research pipeline suitable for publication.
Produce accessible technical and public-facing outputs.

Pipeline Status

Implemented:

Search -> Filtering -> Extraction -> Insight -> Forecasting

Current capabilities include:

LLM query decomposition
web/news retrieval via Tavily
heuristic + optional LLM filtering
HTML/PDF extraction and chunking
hybrid BM25 + embedding retrieval
structured fact extraction with provenance tracking

Pipeline Overview

1. Search Stage

Collect candidate internet sources.

Features:

LLM query decomposition
Tavily retrieval backend
source tier scoring
dashboard injection
URL normalization + deduplication

Output:

List[SearchResult]

2. Filtering Stage

Identify credible and relevant sources.

Features:

heuristic relevance scoring
source credibility scoring
duplicate removal
optional LLM review
extraction-priority assignment

Output:

List[FilteredDocument]

3. Extraction Stage

Fetch and normalize source content.

Features:

HTML/PDF fetching
HTML/PDF parsing
table extraction
chunk normalization
metadata extraction

Output:

List[Document]

4. Insight Stage

Convert extracted text into structured facts.

Features:

BM25 retrieval
embedding retrieval
hybrid reranking
structured fact extraction
provenance tracking
hallucination filtering
cross-document deduplication

Output:

List[InsightRecord]

Design principle:

one chunk -> one extraction call

Each fact must include:

supporting quote
source chunk
source URL

Facts failing substring verification are discarded.

Current limitations:

disconnected from extraction outputs in smoke tests
no temporal reasoning layer

5. Forecasting Stage

Turn insight records into a probabilistic forecast over a question's answer options.

Features:

ensemble of superforecaster-style reasoning samples (per-call temperature + varied seed for diversity)
evidence digest built from insight records, tagged by count basis ([cumulative], [incident/<window>], [active]) and any data-quality caveat
log opinion pool (geometric mean of odds) aggregation
confidence-gated extremizing (sharpen only already-decisive forecasts; default extremize=1.0, gate 0.5, tuned on the n=11 BFG benchmark)
retrieval-free baseline forecast for comparison
token/cost accounting via the shared budget tracker

Output:

ForecastResult  # one distribution per forecast_source, over the options

Backtest/calibration tooling: scripts/run_historical_trajectory.py (as-of-date replay), scripts/eval_forecast_calibration.py (offline extremize/calibration analysis), scripts/demo_forecast.py (live demo).

Repository Structure

bioscancast/
├── bioscancast/
│   ├── datasets/        Source registries and tier definitions
│   ├── llm/             LLM client abstractions
│   ├── orchestration/   End-to-end run orchestration and persistence
│   ├── schemas/         Shared data models
│   ├── stages/
│   │   ├── searching/   Search stage
│   │   ├── filtering/   Filtering stage
│   │   ├── extraction/  Extraction stage
│   │   ├── insight/     Insight stage
│   │   ├── forecasting/ Forecasting stage
│   │   └── evaluation/  Evaluation tooling
│   └── tests/           Unit and integration tests
├── data/
│   ├── docling_eval/
│   └── investigations/
├── scripts/
├── pyproject.toml
├── requirements.txt
└── README.md

Core Modules

Module	Purpose
`datasets/`	curated source registries and source tiers
`llm/`	model abstractions
`orchestration/`	end-to-end run orchestration + persistence
`schemas/`	shared structured contracts
`stages/searching/`	retrieval stage
`stages/filtering/`	source filtering and ranking
`stages/extraction/`	fetching, parsing, chunking
`stages/insight/`	retrieval and fact extraction
`stages/forecasting/`	probabilistic forecasting from insights
`stages/evaluation/`	evaluation tooling

Stage Details

Search Stage

bioscancast/stages/searching/

Implemented modules:

File	Purpose
`pipeline.py`	orchestration
`query_decomposition.py`	LLM sub-query generation
`tier_resolution.py`	source credibility scoring
`dashboard_lookup.py`	dashboard injection
`url_normalization.py`	canonicalization + dedup
`backends/tavily_backend.py`	Tavily backend

Current features:

5–8 LLM-generated subqueries
backend abstraction via SearchBackend
source tier + freshness scoring
aggregator-domain flagging
non-content URL filtering

Known limitations:

English-only retrieval
hardcoded dashboard mappings
no multilingual retrieval

Filtering Stage

bioscancast/stages/filtering/

Implemented modules:

File	Purpose
`pipeline.py`	orchestration
`heuristics.py`	heuristic scoring
`llm_filter.py`	LLM adjudication
`reranker.py`	borderline reranking
`deduplication.py`	duplicate handling
`postprocess.py`	extraction-priority assignment

Current features:

heuristic relevance scoring
source credibility scoring
optional LLM review
domain caps
extraction-mode assignment

Extraction Stage

bioscancast/stages/extraction/

Implemented modules:

File	Purpose
`pipeline.py`	orchestration
`fetcher.py`	network retrieval
`chunking.py`	chunk normalization
`parsers/html_parser.py`	HTML extraction
`parsers/pdf_parser.py`	PDF extraction
`docling_refiner.py`	optional table refinement

Current features:

browser-fingerprinted fetching via curl_cffi
BeautifulSoup + trafilatura HTML parsing
PyMuPDF PDF parsing
pdfplumber table fallback
chunk normalization
metadata extraction
document-level provenance tracking

PDF Table Extraction (Docling Refiner)

The default PDF pipeline uses PyMuPDF + pdfplumber with an optional Docling TableFormer refinement pass.

The first refinement run downloads Docling models (~40 MB) into:

~/.cache/huggingface/

Models remain resident in memory (~1.5 GB) for the process lifetime.

Controlled via:

ExtractionConfig.enable_docling_refiner

When disabled, no Docling imports occur.

Current limitations:

OCR not implemented
scanned PDFs return requires_ocr
no persistent document store
extraction is currently in-memory only

Insight Stage

bioscancast/stages/insight/

Implemented modules:

File	Purpose
`pipeline.py`	orchestration
`retrieval/bm25.py`	lexical retrieval
`retrieval/embeddings.py`	embedding retrieval
`retrieval/hybrid.py`	hybrid reranking
`text_extraction/chunk_extractor.py`	fact extraction

Current features:

BM25 retrieval
embedding similarity retrieval
hybrid scoring
keyword reranking
chunk-level extraction
quote-based hallucination guards
provenance linking
cross-document deduplication

Evaluation

bioscancast/stages/evaluation/

Implemented modules:

File	Purpose
`evaluator.py`	orchestration
`scoring.py`	forecast scoring
`calibration.py`	calibration metrics
`compare.py`	model vs human comparison
`visualisation.py`	plots and reporting

Repository datasets:

bioscancast_forecasts.csv
bioscancast_questions.csv

Schemas

bioscancast/schemas/

Shared stage contracts.

Key schemas:

File	Purpose
`document.py`	extracted documents + chunks
`insight_record.py`	extracted facts

Additional filtering models live in:

bioscancast/stages/filtering/models.py

including:

ForecastQuestion
SearchResult
FilteredDocument

Stages should communicate through schemas rather than raw dictionaries.

LLM Integration

bioscancast/llm/

Current files:

File	Purpose
`base.py`	shared protocol + token accounting
`client.py`	legacy/simple OpenAI wrapper
`openai_client.py`	structured extraction client
`fake_client.py`	testing client

The repository currently contains two partially overlapping interfaces:

bioscancast/llm/base.py
bioscancast/llm/client.py

These should eventually be unified.

Historical-replay mode (benchmarking against human forecasters)

When benchmarking the pipeline against human forecasters on past questions, the model must not be allowed to see sources that didn't exist (or contained different content) at the time the human forecasted. Historical-replay mode enforces this by reading a single per-question field, ForecastQuestion.as_of_date:

When as_of_date is None (default), the pipeline behaves exactly as in live mode. No code paths change.
When as_of_date is set, the search backend receives end_date=as_of_date, the cache key incorporates the cutoff, post-retrieval filtering drops any result dated after the cutoff (and any undated result whose date cannot be cheaply recovered), dashboard URLs are rewritten to the closest Wayback snapshot at or before the cutoff (or suppressed if none exists), and the extraction stage fetches from Wayback. Wayback fallback to live is logged at INFO and recorded in Document.fetch_strategy, never silent.

The LLM "historical roleplay" prompt is not automatically enabled by as_of_date; it lives behind a separate historical_roleplay=True flag on SearchStagePipeline because its effect on query quality is harder to predict. Turn it on for the benchmark and off for production.

What this mode does NOT fix: the LLMs themselves were trained on data that postdates many of our benchmark questions. Retrieval fairness ≠ model fairness. The retrieval_free_baseline_forecast metric in bioscancast/stages/evaluation/contamination.py reports how well the LLM forecasts with no evidence at all; a small gap between that and the full pipeline is itself evidence of training-data leakage and must be reported alongside the headline Brier/log scores.

filter_caught_contamination_rate is also exposed by the same module. It is a lower bound on contamination — it only counts post-cutoff results whose published_date is known. Undated results and results whose content changed post-cutoff are invisible to it. Reports MUST surface this caveat; the metric's docstring repeats it for the same reason.

Datasets

bioscancast/datasets/

Curated source definitions and credibility tiers.

File	Purpose
`sources.yaml.`	curated source registry
`source_tiers.py`	source credibility tiers

Scripts

scripts/

Operational and smoke-test utilities.

Script	Purpose
`run_searching.py`	run search stage
`run_filtering.py`	run filtering stage
`run_extraction.py`	run extraction stage
`run_insight.py`	run insight smoke test
`eval_docling.py`	Docling evaluation
`eval_hybrid_pdf.py`	PDF extraction benchmarking

Scripts are intended for operational workflows rather than reusable library APIs.

Running the Pipeline

Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Additional packages:

pip install openai tavily-python python-dotenv

Environment Variables

Create .env:

OPENAI_API_KEY=sk-...
TAVILY_API_KEY=tvly-...

Search Stage

python scripts/run_searching.py \
  "Will H5N1 cause more than 100 human cases in the US by December 2026?" \
  --pathogen h5n1 \
  --region "United States"

Optional JSON output:

python scripts/run_searching.py \
  "How many mpox cases will be reported globally by June 2026?" \
  --pathogen mpox \
  --output data/search_results.json

Filtering Stage

python scripts/run_filtering.py

Current limitation:

uses hardcoded sample inputs rather than automatic search-stage ingestion

Output:

data/filtered_results.json

Extraction Stage

Smoke-test mode:

python scripts/run_extraction.py

Using filtered-document JSON:

python scripts/run_extraction.py \
  --input data/filtered_results.json

Output:

data/extraction_results.json

Insight Stage

python scripts/run_insight.py

Current limitation:

uses synthetic documents rather than extraction outputs

End-to-End Orchestrator

bioscancast/main.py runs all four stages — search → filter → extract → insight — against a single forecast question. Each stage's output plus a running manifest.json are persisted under data/runs/{question_id}/{run_id}/, so a crashed run still leaves partial artifacts for debugging.

Run by question ID (looked up in the question CSV):

python -m bioscancast.main q7

Or via the scripts/run_* wrapper, matching the per-stage runner convention:

python scripts/run_pipeline.py q7

Historical-replay mode pins the information cutoff so no post-cutoff evidence leaks in:

python scripts/run_pipeline.py q7 --as-of-date 2025-02-28 -v

Common flags (--help for the full list):

Flag	Purpose
`--as-of-date Y-M-D`	Historical-replay cutoff. Omit for live mode.
`--csv PATH`	Question CSV. Default: `bioscancast/stages/evaluation/bioscancast_questions.csv`
`--out-root PATH`	Run-artifact root directory. Default: `data/runs`
`--run-id NAME`	Override the UTC-timestamp run directory name.
`--target-date Y-M-D`	Override the CSV-derived target date.
`--region` / `--pathogen` / `--event-type`	Override the corresponding question fields.
`--no-cache`	Disable the search-stage cache.
`--max-input-tokens N`	Override `InsightConfig.max_input_tokens_per_run`.
`-v`, `--verbose`	Set log level to INFO.

Each run prints per-stage timings and an estimated token cost, and writes question.json, search.json, filtered.json, documents.json, insight.json, and manifest.json to the run directory.

Tests

bioscancast/tests/

Includes:

extraction tests
retrieval tests
pipeline tests
schema validation
search-stage integration tests

Run all tests:

pytest

Run selected tests:

pytest bioscancast/tests/test_extraction_pipeline.py
pytest bioscancast/tests/test_insight_pipeline.py

Live fetch tests are marked:

@pytest.mark.live

and skipped by default.

Run with:

pytest --live

Dependencies

Important dependencies:

Dependency	Usage
`curl_cffi`	browser-fingerprinted HTTP fetching
`rank_bm25`	lexical retrieval
`PyMuPDF`	primary PDF parsing
`pdfplumber`	fallback PDF table extraction

curl_cffi is used in:

bioscancast/stages/extraction/fetcher.py

The impersonation profile is configurable via:

ExtractionConfig.impersonate

Development Principles

Keep pipeline stages modular.
Use schemas between stages.
Prefer structured interfaces over raw dictionaries.
Keep experimental workflows in scripts or notebooks.
Prioritize reproducibility.
Treat provenance and auditability as first-class concerns.

Known Architectural Gaps

Major missing components:

OCR support
forecasting stage implementation
persistent storage/vector DB layer
unified LLM abstraction
multilingual retrieval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BioScanCast

Project Goals

Pipeline Status

Pipeline Overview

1. Search Stage

2. Filtering Stage

3. Extraction Stage

4. Insight Stage

5. Forecasting Stage

Repository Structure

Core Modules

Stage Details

Search Stage

Filtering Stage

Extraction Stage

PDF Table Extraction (Docling Refiner)

Insight Stage

Evaluation

Schemas

LLM Integration

Historical-replay mode (benchmarking against human forecasters)

Datasets

Scripts

Running the Pipeline

Environment Setup

Environment Variables

Search Stage

Filtering Stage

Extraction Stage

Insight Stage

End-to-End Orchestrator

Tests

Dependencies

Development Principles

Known Architectural Gaps

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

BioScanCast

Project Goals

Pipeline Status

Pipeline Overview

1. Search Stage

2. Filtering Stage

3. Extraction Stage

4. Insight Stage

5. Forecasting Stage

Repository Structure

Core Modules

Stage Details

Search Stage

Filtering Stage

Extraction Stage

PDF Table Extraction (Docling Refiner)

Insight Stage

Evaluation

Schemas

LLM Integration

Historical-replay mode (benchmarking against human forecasters)

Datasets

Scripts

Running the Pipeline

Environment Setup

Environment Variables

Search Stage

Filtering Stage

Extraction Stage

Insight Stage

End-to-End Orchestrator

Tests

Dependencies

Development Principles

Known Architectural Gaps