PDF → Markdown Evaluation

Website: https://jiyouhai.github.io/pdf-to-markdown/ Repo: https://github.com/jiyouhai/pdf-to-markdown

Reproducible, parser-agnostic benchmarks for turning PDFs into Markdown—and measuring downstream usefulness with retrieval-QA, not just visual fidelity.

Goals
Methods at a Glance
Repository Layout
Quick Start
Metrics & Thresholds
Evaluation Pipeline (Automated QA)
Results Snapshot (Aug 16, 2025)
Add a New PDF Dataset
Question Set (JSONL) Format
Parsers & Output Conventions
Publishing with GitHub Pages
Optional: Build a PDF Report via LaTeX + CI
Roadmap • FAQ • Contributing • License • Citation

Goals

Apples-to-apples comparison of multiple PDF→Markdown parsers.
Emphasize downstream quality (retrieval/QA over Markdown) alongside structural/visual fidelity.
Keep everything simple, auditable, reproducible in a public repo.

Methods at a Glance

We use two human-in-the-loop methods plus one automated benchmark:

LLM Desk Review (“Deep Research”) Zip each parser’s outputs, upload to ChatGPT (GPT-5 Thinking / Pro), and have it draft a report against a rubric (structure, formatting, tables/math, cleanliness, post-processing, automation readiness).
Side-by-Side PDF ↔ Markdown Review Open the original PDF next to the Markdown for each parser (9 windows for 9 docs). The LLM scores against the rubric while you spot-check.
Automated Retrieval-QA Benchmark (this repo’s code) Treat each Markdown set as a knowledge base. Run BM25 retrieval + grounded QA + rubric judge to get accuracy and overconfidence that reflect real RAG usage. This is the main quantitative signal we publish.

Repository Layout

pdf-to-markdown/
├─ docs/                      # Served via GitHub Pages
│  ├─ index.html              # Landing page
│  ├─ pdfs/                   # Original PDFs & rendered reports
│  ├─ md/                     # Parser outputs (Markdown)
│  │  └─ <parser>/<doc_id>/   # Many per-page .md OR a single .md
│  └─ questions/              # One JSONL per dataset (up to 100 Qs)
├─ report/                    # (Optional) LaTeX sources for a printable report
├─ runs_rubric/               # Outputs from automated QA runs
└─ .github/workflows/         # (Optional) CI for LaTeX & site checks

Naming • doc_id: short, URL-safe (e.g., TP-MVD8MV2-rotated). • Parser names are free (e.g., textin, Reducto, md_mathpix, md_marker, own tool)—be consistent.

Quick Start

Prepare data • Put PDFs in docs/pdfs/<doc_id>.pdf. • Put each parser’s Markdown in docs/md/<parser>/<doc_id>/ (either per-page 001.md, 002.md, … or a single full.md). • Author up to 100 questions in docs/questions/<doc_id>.jsonl (format below).
Install

python -m venv .venv && source .venv/bin/activate pip install -U pandas jsonlines rapidfuzz rank_bm25 openai tiktoken export OPENAI_API_KEY=sk-... # required for answering/judging
Run the automated QA benchmark

python eval_pdfmd_rubric.py --parsers md_marker md_mathpix Reducto textin "own tool" --chunk 1000 --overlap 200 --topk 5 --answer-model gpt-4.1 --judge-model o3-mini --ok-threshold 0.75
Inspect outputs • Per-Q details: runs_rubric/per_question.csv • Aggregates: runs_rubric/summary.csv • Static site: enable GitHub Pages for /docs and browse datasets/outputs online.

Reference run (“what we did”) • 9 PDFs; 100 questions per PDF (JSONL). • Read Markdown for each parser from docs/md/<parser>/<doc_id>/ (dir or single-file). • Chunk: 1000 tokens with 200-token overlap; BM25 per doc; retrieve top-k=5. • Answer with gpt-4.1 using only retrieved text (may return NOT_FOUND). • Fast rules: exact; numeric≈ (±2% or 1e-6 abs); fuzzy≥90 (RapidFuzz). • Else judge with o3-mini (evidence-only rubric) → score∈[0,1]. • Derive per-Q flags and aggregate per (parser, doc): acc, score_mean, overconf_rate.

Metrics & Thresholds

For parser p, document d, question i, the judge returns score[p,d,i]∈[0,1] from an evidence-only rubric.

• Accuracy (acc): pass rate at threshold 0.75 acc[p,d] = (1/n_d) ∑_i 1[ score[p,d,i] ≥ 0.75 ] Why 0.75? Most questions decompose into 4–6 atomic facts. 0.75 ≈ “clear majority supported” (≥3/4 or ≥4/5). In pilot sweeps (0.60–0.90) it best balanced: (i) low false-accepts on manuals, (ii) not punishing concise correct answers, (iii) stable rankings across 0.70–0.80.

• Mean score (score_mean): continuous quality score_mean[p,d] = (1/n_d) ∑_i score[p,d,i] Useful to see near-misses beyond the binary cut.

• Overconfidence (overconf_rate): wrong assertions vs. abstain overconf_rate[p,d] = (1/n_d) ∑_i 1[ ok=0 and prediction≠NOT_FOUND ] A wrong assertion is worse than abstaining. This measures that.

• Penalized score (optional, calibration-aware): penalized_λ = acc_mean − λ · overconf_mean Default λ=0.5; λ∈[0.25,1.0] yields the same ordering on our 9-doc set.

Fast rule matches (auto full credit before judging) • Exact normalized match • Numeric≈: all gold numbers within 2% (or 1e-6 abs) • Fuzzy: RapidFuzz partial ratio ≥90 • Gold empty + prediction = NOT_FOUND

Evaluation Pipeline (Automated QA)

Inputs → read Markdown (dir or single-file), concatenate & normalize
Chunk 1000 tokens + 200 overlap
BM25 index (per document)
For each question: retrieve top-k=5 chunks
gpt-4.1 answers using only retrieved text (may return NOT_FOUND)
Fast rules (exact / numeric≈ / fuzzy≥90); else o3-mini rubric judge (score∈[0,1])
Compute per-Q flags: ok=(score≥0.75); overconf=(ok=0 and pred≠NOT_FOUND)
Aggregate per (parser, doc): n, acc, score_mean, overconf_rate
Write CSVs and publish tables on the site

Defaults: --chunk 1000, --overlap 200, --topk 5, --ok-threshold 0.75, --answer-model gpt-4.1, --judge-model o3-mini.

Results Snapshot (Aug 16, 2025)

Ranking by mean accuracy (9 docs):

textin (0.578) → 2) md_mathpix (0.568) → 3) md_marker (0.564) → 4) Reducto (0.543) → 5) own tool (0.454)

Overall by parser (macro≈micro; n constant per doc)

Parser	Docs	Total n	acc_mean	acc_median	acc_min	acc_max	overconf_mean
textin	9	900	0.578	0.600	0.100	0.930	0.379
md_mathpix	9	900	0.568	0.680	0.120	0.940	0.373
md_marker	9	900	0.564	0.610	0.120	0.940	0.372
Reducto	9	900	0.543	0.570	0.110	0.940	0.389
own tool	9	900	0.454	0.400	0.090	0.950	0.392

Penalized (λ=0.5) — ordering unchanged

Parser	penalized_0.5
textin	0.388
md_mathpix	0.381
md_marker	0.378
Reducto	0.349
own tool	0.258

Per-doc winners are split (textin/mathpix/marker/own-tool: 2 each; Reducto: 1). CSVs are under runs_rubric/.

Add a New PDF Dataset

Put the PDF: docs/pdfs/<doc_id>.pdf
Add each parser’s Markdown: • docs/md/<parser>/<doc_id>/001.md, 002.md, … • or docs/md/<parser>/<doc_id>/full.md
Author up to 100 questions: docs/questions/<doc_id>.jsonl
Commit:

git add docs/pdfs/<doc_id>.pdf docs/md/** docs/questions/<doc_id>.jsonl git commit -m "docs: add <doc_id> (PDF, parser MD, questions)" git push

Question Set (JSONL) Format

One JSON object per line.

Minimal

{"id":"Q001","q":"What is the warranty term?"}
{"id":"Q002","q":"Report the rated voltage from the spec table."}

With reference answers (preferred for rubric)

{"id":"Q001","q":"What is the warranty term?","a":"5 years"}
{"id":"Q002","q":"Report the rated voltage from the spec table.","a":"56 V"}

Loaders also accept legacy keys (qid, question, answer, gold). The script normalizes them.

Parsers & Output Conventions

Place each parser’s output under docs/md/<parser>/<doc_id>/. Both multi-file (per page) and single-file are supported. The reader also recognizes internal-tool names like livex_<doc_id>_markdown.md.

Tips • If a parser emits tables as images, retrieval degrades (no cells to index). Prefer Markdown/HTML tables. • Keep figure captions and references as text near images for better grounding. • Normalize headings (#, ##, ###) and math ( $...$ , $$...$$) to ease downstream parsing.

Publishing with GitHub Pages

• Settings → Pages → Source: “Deploy from a branch” → Branch: main → folder: /docs • After pushing to docs/, your site updates automatically: – docs/pdfs/ — PDFs and generated reports – docs/questions/ — JSONL question sets – docs/md/ — parser outputs – docs/index.html — homepage (list datasets/parsers)

Optional: Build a PDF Report via LaTeX + CI

Place LaTeX here:

report/main.tex
report/assets/...

Example CI (save as .github/workflows/latex.yml), compiles on push and publishes to docs/pdfs/evaluation-report.pdf:

name: build-report
on:
  push:
    paths:
      - "report/**"
      - ".github/workflows/latex.yml"
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: xu-cheng/latex-action@v3
        with:
          root_file: main.tex
          working_directory: report
      - name: Publish PDF into docs/pdfs
        run: |
          mkdir -p docs/pdfs
          cp report/main.pdf docs/pdfs/evaluation-report.pdf
      - name: Commit report
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add docs/pdfs/evaluation-report.pdf
          git commit -m "ci: publish evaluation report" || echo "No changes"
          git push

Roadmap

• Side-by-side PDF vs. Markdown diff/preview in the site • Built-in retrieval test harness (simple configs per parser) • Question harvesting helpers (tables → value Qs; figure → caption Qs) • Weighted scoring profiles (table-heavy vs. math-heavy docs) • Dataset metadata cards to drive the homepage automatically

FAQ

Do I need one Markdown file per page? No. Per-page is helpful for debugging; single-file also works.

Why do QA results differ across parsers for the same PDF? Structural losses (tables as images, headings flattened, math dropped) harm retrieval and precision.

Which models are used? Default: gpt-4.1 to answer (context-only, can return NOT_FOUND), o3-mini to judge. Both are pluggable via CLI flags.

Can I change the 0.75 threshold? Yes (--ok-threshold). We chose 0.75 after sweeps for a “clear majority of facts” bar that keeps false accepts low.

Contributing

• Add datasets via “Add a New PDF Dataset”. • Add parser outputs under docs/md/<parser>/<doc_id>/. • PRs/issues welcome for homepage, metrics, pipeline, or CI improvements.

License

• Code & configs: MIT (LICENSE) • Datasets & generated content: use an appropriate content license (e.g., CC BY 4.0) depending on source material.

Citation

@misc{pdf_to_markdown_eval_2025,
  title        = {PDF→Markdown Evaluation},
  author       = {Repository Contributors},
  year         = {2025},
  howpublished = {https://github.com/jiyouhai/pdf-to-markdown}
}

Tip: Keep docs/ as the public snapshot; keep experimental runs in runs_rubric/. Small, auditable, reproducible.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
docs		docs
questions		questions
report		report
runs_rubric_Reducto		runs_rubric_Reducto
runs_rubric_md_marker_patch		runs_rubric_md_marker_patch
runs_rubric_md_mathpix		runs_rubric_md_mathpix
runs_rubric_own_tool		runs_rubric_own_tool
runs_rubric_textin		runs_rubric_textin
tools		tools
LICENSE		LICENSE
README.md		README.md
all_summary.csv		all_summary.csv
eval_pdfmd_rubric.py		eval_pdfmd_rubric.py
new_table_acc_pivot.csv		new_table_acc_pivot.csv
new_table_overall_by_parser.csv		new_table_overall_by_parser.csv
new_table_per_doc_winners.csv		new_table_per_doc_winners.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF → Markdown Evaluation

Table of Contents

Goals

Methods at a Glance

Repository Layout

Quick Start

Metrics & Thresholds

Evaluation Pipeline (Automated QA)

Results Snapshot (Aug 16, 2025)

Add a New PDF Dataset

Question Set (JSONL) Format

Parsers & Output Conventions

Publishing with GitHub Pages

Optional: Build a PDF Report via LaTeX + CI

Roadmap

FAQ

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

jiyouhai/pdf-to-markdown

Folders and files

Latest commit

History

Repository files navigation

PDF → Markdown Evaluation

Table of Contents

Goals

Methods at a Glance

Repository Layout

Quick Start

Metrics & Thresholds

Evaluation Pipeline (Automated QA)

Results Snapshot (Aug 16, 2025)

Add a New PDF Dataset

Question Set (JSONL) Format

Parsers & Output Conventions

Publishing with GitHub Pages

Optional: Build a PDF Report via LaTeX + CI

Roadmap

FAQ

Contributing

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages