yin
feat(mvp): wire router β†’ mupdf parser β†’ OCR quality scorer closed loop
d423504

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

bench/ β€” PDF processing pipeline evaluation set

This directory is the canonical test set for evaluating the end-to-end PDF processing pipeline (layout β†’ OCR β†’ markdown / structured text). It bundles two complementary, pre-sampled subsets so that runs are reproducible and cheap to iterate on.

Subset PDFs Source benchmark Focus
olmocr_bench_50/ 50 olmOCR-bench Fine-grained unit tests on text presence / absence, reading order, tables, math
omnidocbench_100/ 100 OmniDocBench Holistic document-level eval with layout / language / special-issue coverage

Total footprint: ~108 MB, 150 PDFs.

Subset details

olmocr_bench_50/

Stratified sample drawn from the 1,403-PDF olmOCR-bench with the script scripts/sample_olmocr_subset.py (seed 20260411). Covers all 7 document sources with a minimum floor of 3 PDFs per category plus largest-remainder proportional allocation, and diversifies by source document inside each category (at most one page per arXiv paper / scan ID before any repeat).

olmocr_bench_50/
β”œβ”€β”€ pdfs/
β”‚   β”œβ”€β”€ arxiv_math/         (14)
β”‚   β”œβ”€β”€ headers_footers/    (8)
β”‚   β”œβ”€β”€ long_tiny_text/     (4)
β”‚   β”œβ”€β”€ multi_column/       (8)
β”‚   β”œβ”€β”€ old_scans/          (5)
β”‚   β”œβ”€β”€ old_scans_math/     (4)
β”‚   └── tables/             (7)
β”œβ”€β”€ subset_tests.jsonl      # 283 olmOCR-bench unit tests for these 50 PDFs
└── subset_manifest.json    # seed, quotas, selected file list, source bench_dir

The subset_tests.jsonl file is a filtered copy of the original per-category *.jsonl test files merged into one; each row keeps the exact schema used by the upstream olmOCR-bench evaluator (pdf, type, max_diffs, checked, and type-specific fields like math, cell, before/after, …).

Regenerate or resize:

python3 scripts/sample_olmocr_subset.py --target 50             # default β†’ bench/olmocr_bench_50
python3 scripts/sample_olmocr_subset.py --target 100 --seed 42  # alt subset
python3 scripts/sample_olmocr_subset.py --dry-run               # plan only

omnidocbench_100/

Pre-built 100-PDF subset of OmniDocBench v2 with full stratified coverage across every categorical axis in the upstream dataset.

omnidocbench_100/
β”œβ”€β”€ pdfs/                   # 100 single-page PDFs
β”œβ”€β”€ img/                    # matching rendered JPGs (1 per PDF)
β”œβ”€β”€ subset_100.json         # full OmniDocBench annotations for the 100 samples
β”œβ”€β”€ subset_100_stats.json   # coverage & distribution stats vs. full 981-doc set
β”œβ”€β”€ subset_100_pdfs.txt     # flat list of selected PDF filenames
└── subset_100_images.txt   # flat list of selected image filenames

Coverage (from subset_100_stats.json) β€” every bucket of every axis is hit:

  • data_source 9/9 Β· language 3/3 Β· layout 5/5
  • special_issue 13/13 Β· stratum 67/67

Using the bench

These two subsets are intended to be run as a pair β€” olmOCR-bench gives you sharp per-feature pass/fail signals and OmniDocBench gives you an aggregate quality score across real-world document types. For each new pipeline version, run both subsets, record per-subset metrics, and diff against the previous run.

Common entry points (to be wired up by the pipeline evaluator):

bench/olmocr_bench_50/pdfs/**/*.pdf      # inputs
bench/olmocr_bench_50/subset_tests.jsonl # ground truth unit tests

bench/omnidocbench_100/pdfs/*.pdf        # inputs
bench/omnidocbench_100/subset_100.json   # ground truth annotations

Do not manually edit files under bench/. Regenerate with the sampling script (for olmocr) or re-export from the upstream builder (for omnidoc) so results stay reproducible.