BioScanCast is an open source pipeline for biosecurity forecasting using LLMs and automated web retrieval.
The system retrieves internet sources, filters relevant documents, extracts structured information, produces probabilistic forecasts, and supports evaluation against human forecasters.
Current repository contents include:
- modular pipeline stages
- shared schemas and LLM abstractions
- retrieval and extraction tooling
- benchmarking and evaluation infrastructure
- smoke-test and operational scripts
All five pipeline stages (search, filtering, extraction, insight, forecasting) are implemented and wired into the end-to-end orchestrator.
- Build an open source forecasting system for biosecurity questions.
- Benchmark model forecasts against human forecasters.
- Provide a reproducible research pipeline suitable for publication.
- Produce accessible technical and public-facing outputs.
Implemented:
Search -> Filtering -> Extraction -> Insight -> Forecasting
Current capabilities include:
- LLM query decomposition
- web/news retrieval via Tavily
- heuristic + optional LLM filtering
- HTML/PDF extraction and chunking
- hybrid BM25 + embedding retrieval
- structured fact extraction with provenance tracking
Collect candidate internet sources.
Features:
- LLM query decomposition
- Tavily retrieval backend
- source tier scoring
- dashboard injection
- URL normalization + deduplication
Output:
List[SearchResult]Identify credible and relevant sources.
Features:
- heuristic relevance scoring
- source credibility scoring
- duplicate removal
- optional LLM review
- extraction-priority assignment
Output:
List[FilteredDocument]Fetch and normalize source content.
Features:
- HTML/PDF fetching
- HTML/PDF parsing
- table extraction
- chunk normalization
- metadata extraction
Output:
List[Document]Convert extracted text into structured facts.
Features:
- BM25 retrieval
- embedding retrieval
- hybrid reranking
- structured fact extraction
- provenance tracking
- hallucination filtering
- cross-document deduplication
Output:
List[InsightRecord]Design principle:
one chunk -> one extraction call
Each fact must include:
- supporting quote
- source chunk
- source URL
Facts failing substring verification are discarded.
Current limitations:
- disconnected from extraction outputs in smoke tests
- no temporal reasoning layer
Turn insight records into a probabilistic forecast over a question's answer options.
Features:
- ensemble of superforecaster-style reasoning samples (per-call temperature + varied seed for diversity)
- evidence digest built from insight records, tagged by count basis
(
[cumulative],[incident/<window>],[active]) and any data-quality caveat - log opinion pool (geometric mean of odds) aggregation
- confidence-gated extremizing (sharpen only already-decisive forecasts;
default
extremize=1.0, gate0.5, tuned on the n=11 BFG benchmark) - retrieval-free baseline forecast for comparison
- token/cost accounting via the shared budget tracker
Output:
ForecastResult # one distribution per forecast_source, over the optionsBacktest/calibration tooling: scripts/run_historical_trajectory.py
(as-of-date replay), scripts/eval_forecast_calibration.py (offline
extremize/calibration analysis), scripts/demo_forecast.py (live demo).
bioscancast/
├── bioscancast/
│ ├── datasets/ Source registries and tier definitions
│ ├── llm/ LLM client abstractions
│ ├── orchestration/ End-to-end run orchestration and persistence
│ ├── schemas/ Shared data models
│ ├── stages/
│ │ ├── searching/ Search stage
│ │ ├── filtering/ Filtering stage
│ │ ├── extraction/ Extraction stage
│ │ ├── insight/ Insight stage
│ │ ├── forecasting/ Forecasting stage
│ │ └── evaluation/ Evaluation tooling
│ └── tests/ Unit and integration tests
├── data/
│ ├── docling_eval/
│ └── investigations/
├── scripts/
├── pyproject.toml
├── requirements.txt
└── README.md
| Module | Purpose |
|---|---|
datasets/ |
curated source registries and source tiers |
llm/ |
model abstractions |
orchestration/ |
end-to-end run orchestration + persistence |
schemas/ |
shared structured contracts |
stages/searching/ |
retrieval stage |
stages/filtering/ |
source filtering and ranking |
stages/extraction/ |
fetching, parsing, chunking |
stages/insight/ |
retrieval and fact extraction |
stages/forecasting/ |
probabilistic forecasting from insights |
stages/evaluation/ |
evaluation tooling |
bioscancast/stages/searching/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
query_decomposition.py |
LLM sub-query generation |
tier_resolution.py |
source credibility scoring |
dashboard_lookup.py |
dashboard injection |
url_normalization.py |
canonicalization + dedup |
backends/tavily_backend.py |
Tavily backend |
Current features:
- 5–8 LLM-generated subqueries
- backend abstraction via
SearchBackend - source tier + freshness scoring
- aggregator-domain flagging
- non-content URL filtering
Known limitations:
- English-only retrieval
- hardcoded dashboard mappings
- no multilingual retrieval
bioscancast/stages/filtering/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
heuristics.py |
heuristic scoring |
llm_filter.py |
LLM adjudication |
reranker.py |
borderline reranking |
deduplication.py |
duplicate handling |
postprocess.py |
extraction-priority assignment |
Current features:
- heuristic relevance scoring
- source credibility scoring
- optional LLM review
- domain caps
- extraction-mode assignment
bioscancast/stages/extraction/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
fetcher.py |
network retrieval |
chunking.py |
chunk normalization |
parsers/html_parser.py |
HTML extraction |
parsers/pdf_parser.py |
PDF extraction |
docling_refiner.py |
optional table refinement |
Current features:
- browser-fingerprinted fetching via
curl_cffi - BeautifulSoup + trafilatura HTML parsing
- PyMuPDF PDF parsing
- pdfplumber table fallback
- chunk normalization
- metadata extraction
- document-level provenance tracking
The default PDF pipeline uses PyMuPDF + pdfplumber with an optional Docling TableFormer refinement pass.
The first refinement run downloads Docling models (~40 MB) into:
~/.cache/huggingface/
Models remain resident in memory (~1.5 GB) for the process lifetime.
Controlled via:
ExtractionConfig.enable_docling_refinerWhen disabled, no Docling imports occur.
Current limitations:
- OCR not implemented
- scanned PDFs return
requires_ocr - no persistent document store
- extraction is currently in-memory only
bioscancast/stages/insight/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
retrieval/bm25.py |
lexical retrieval |
retrieval/embeddings.py |
embedding retrieval |
retrieval/hybrid.py |
hybrid reranking |
text_extraction/chunk_extractor.py |
fact extraction |
Current features:
- BM25 retrieval
- embedding similarity retrieval
- hybrid scoring
- keyword reranking
- chunk-level extraction
- quote-based hallucination guards
- provenance linking
- cross-document deduplication
bioscancast/stages/evaluation/
Implemented modules:
| File | Purpose |
|---|---|
evaluator.py |
orchestration |
scoring.py |
forecast scoring |
calibration.py |
calibration metrics |
compare.py |
model vs human comparison |
visualisation.py |
plots and reporting |
Repository datasets:
bioscancast_forecasts.csv
bioscancast_questions.csv
bioscancast/schemas/
Shared stage contracts.
Key schemas:
| File | Purpose |
|---|---|
document.py |
extracted documents + chunks |
insight_record.py |
extracted facts |
Additional filtering models live in:
bioscancast/stages/filtering/models.py
including:
ForecastQuestionSearchResultFilteredDocument
Stages should communicate through schemas rather than raw dictionaries.
bioscancast/llm/
Current files:
| File | Purpose |
|---|---|
base.py |
shared protocol + token accounting |
client.py |
legacy/simple OpenAI wrapper |
openai_client.py |
structured extraction client |
fake_client.py |
testing client |
The repository currently contains two partially overlapping interfaces:
bioscancast/llm/base.py
bioscancast/llm/client.py
These should eventually be unified.
When benchmarking the pipeline against human forecasters on past questions,
the model must not be allowed to see sources that didn't exist (or contained
different content) at the time the human forecasted. Historical-replay mode
enforces this by reading a single per-question field, ForecastQuestion.as_of_date:
- When
as_of_dateisNone(default), the pipeline behaves exactly as in live mode. No code paths change. - When
as_of_dateis set, the search backend receivesend_date=as_of_date, the cache key incorporates the cutoff, post-retrieval filtering drops any result dated after the cutoff (and any undated result whose date cannot be cheaply recovered), dashboard URLs are rewritten to the closest Wayback snapshot at or before the cutoff (or suppressed if none exists), and the extraction stage fetches from Wayback. Wayback fallback to live is logged at INFO and recorded inDocument.fetch_strategy, never silent.
The LLM "historical roleplay" prompt is not automatically enabled by
as_of_date; it lives behind a separate historical_roleplay=True flag on
SearchStagePipeline because its effect on query quality is harder to
predict. Turn it on for the benchmark and off for production.
What this mode does NOT fix: the LLMs themselves were trained on data that
postdates many of our benchmark questions. Retrieval fairness ≠ model
fairness. The retrieval_free_baseline_forecast metric in
bioscancast/stages/evaluation/contamination.py reports how well the LLM
forecasts with no evidence at all; a small gap between that and the full
pipeline is itself evidence of training-data leakage and must be reported
alongside the headline Brier/log scores.
filter_caught_contamination_rate is also exposed by the same module. It
is a lower bound on contamination — it only counts post-cutoff results
whose published_date is known. Undated results and results whose content
changed post-cutoff are invisible to it. Reports MUST surface this caveat;
the metric's docstring repeats it for the same reason.
bioscancast/datasets/
Curated source definitions and credibility tiers.
| File | Purpose |
|---|---|
sources.yaml. |
curated source registry |
source_tiers.py |
source credibility tiers |
scripts/
Operational and smoke-test utilities.
| Script | Purpose |
|---|---|
run_searching.py |
run search stage |
run_filtering.py |
run filtering stage |
run_extraction.py |
run extraction stage |
run_insight.py |
run insight smoke test |
eval_docling.py |
Docling evaluation |
eval_hybrid_pdf.py |
PDF extraction benchmarking |
Scripts are intended for operational workflows rather than reusable library APIs.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtAdditional packages:
pip install openai tavily-python python-dotenvCreate .env:
OPENAI_API_KEY=sk-...
TAVILY_API_KEY=tvly-...python scripts/run_searching.py \
"Will H5N1 cause more than 100 human cases in the US by December 2026?" \
--pathogen h5n1 \
--region "United States"Optional JSON output:
python scripts/run_searching.py \
"How many mpox cases will be reported globally by June 2026?" \
--pathogen mpox \
--output data/search_results.jsonpython scripts/run_filtering.pyCurrent limitation:
- uses hardcoded sample inputs rather than automatic search-stage ingestion
Output:
data/filtered_results.json
Smoke-test mode:
python scripts/run_extraction.pyUsing filtered-document JSON:
python scripts/run_extraction.py \
--input data/filtered_results.jsonOutput:
data/extraction_results.json
python scripts/run_insight.pyCurrent limitation:
- uses synthetic documents rather than extraction outputs
bioscancast/main.py runs all four stages — search → filter → extract →
insight — against a single forecast question. Each stage's output plus a
running manifest.json are persisted under
data/runs/{question_id}/{run_id}/, so a crashed run still leaves partial
artifacts for debugging.
Run by question ID (looked up in the question CSV):
python -m bioscancast.main q7Or via the scripts/run_* wrapper, matching the per-stage runner convention:
python scripts/run_pipeline.py q7Historical-replay mode pins the information cutoff so no post-cutoff evidence leaks in:
python scripts/run_pipeline.py q7 --as-of-date 2025-02-28 -vCommon flags (--help for the full list):
| Flag | Purpose |
|---|---|
--as-of-date Y-M-D |
Historical-replay cutoff. Omit for live mode. |
--csv PATH |
Question CSV. Default: bioscancast/stages/evaluation/bioscancast_questions.csv |
--out-root PATH |
Run-artifact root directory. Default: data/runs |
--run-id NAME |
Override the UTC-timestamp run directory name. |
--target-date Y-M-D |
Override the CSV-derived target date. |
--region / --pathogen / --event-type |
Override the corresponding question fields. |
--no-cache |
Disable the search-stage cache. |
--max-input-tokens N |
Override InsightConfig.max_input_tokens_per_run. |
-v, --verbose |
Set log level to INFO. |
Each run prints per-stage timings and an estimated token cost, and writes
question.json, search.json, filtered.json, documents.json,
insight.json, and manifest.json to the run directory.
bioscancast/tests/
Includes:
- extraction tests
- retrieval tests
- pipeline tests
- schema validation
- search-stage integration tests
Run all tests:
pytestRun selected tests:
pytest bioscancast/tests/test_extraction_pipeline.py
pytest bioscancast/tests/test_insight_pipeline.pyLive fetch tests are marked:
@pytest.mark.liveand skipped by default.
Run with:
pytest --liveImportant dependencies:
| Dependency | Usage |
|---|---|
curl_cffi |
browser-fingerprinted HTTP fetching |
rank_bm25 |
lexical retrieval |
PyMuPDF |
primary PDF parsing |
pdfplumber |
fallback PDF table extraction |
curl_cffi is used in:
bioscancast/stages/extraction/fetcher.py
The impersonation profile is configurable via:
ExtractionConfig.impersonate- Keep pipeline stages modular.
- Use schemas between stages.
- Prefer structured interfaces over raw dictionaries.
- Keep experimental workflows in scripts or notebooks.
- Prioritize reproducibility.
- Treat provenance and auditability as first-class concerns.
Major missing components:
- OCR support
- forecasting stage implementation
- persistent storage/vector DB layer
- unified LLM abstraction
- multilingual retrieval