A newer version of the Gradio SDK is available: 6.15.2
bench/ β PDF processing pipeline evaluation set
This directory is the canonical test set for evaluating the end-to-end PDF processing pipeline (layout β OCR β markdown / structured text). It bundles two complementary, pre-sampled subsets so that runs are reproducible and cheap to iterate on.
| Subset | PDFs | Source benchmark | Focus |
|---|---|---|---|
olmocr_bench_50/ |
50 | olmOCR-bench | Fine-grained unit tests on text presence / absence, reading order, tables, math |
omnidocbench_100/ |
100 | OmniDocBench | Holistic document-level eval with layout / language / special-issue coverage |
Total footprint: ~108 MB, 150 PDFs.
Subset details
olmocr_bench_50/
Stratified sample drawn from the 1,403-PDF olmOCR-bench with the script
scripts/sample_olmocr_subset.py (seed 20260411). Covers all 7 document
sources with a minimum floor of 3 PDFs per category plus largest-remainder
proportional allocation, and diversifies by source document inside each
category (at most one page per arXiv paper / scan ID before any repeat).
olmocr_bench_50/
βββ pdfs/
β βββ arxiv_math/ (14)
β βββ headers_footers/ (8)
β βββ long_tiny_text/ (4)
β βββ multi_column/ (8)
β βββ old_scans/ (5)
β βββ old_scans_math/ (4)
β βββ tables/ (7)
βββ subset_tests.jsonl # 283 olmOCR-bench unit tests for these 50 PDFs
βββ subset_manifest.json # seed, quotas, selected file list, source bench_dir
The subset_tests.jsonl file is a filtered copy of the original per-category
*.jsonl test files merged into one; each row keeps the exact schema used by
the upstream olmOCR-bench evaluator (pdf, type, max_diffs, checked,
and type-specific fields like math, cell, before/after, β¦).
Regenerate or resize:
python3 scripts/sample_olmocr_subset.py --target 50 # default β bench/olmocr_bench_50
python3 scripts/sample_olmocr_subset.py --target 100 --seed 42 # alt subset
python3 scripts/sample_olmocr_subset.py --dry-run # plan only
omnidocbench_100/
Pre-built 100-PDF subset of OmniDocBench v2 with full stratified coverage across every categorical axis in the upstream dataset.
omnidocbench_100/
βββ pdfs/ # 100 single-page PDFs
βββ img/ # matching rendered JPGs (1 per PDF)
βββ subset_100.json # full OmniDocBench annotations for the 100 samples
βββ subset_100_stats.json # coverage & distribution stats vs. full 981-doc set
βββ subset_100_pdfs.txt # flat list of selected PDF filenames
βββ subset_100_images.txt # flat list of selected image filenames
Coverage (from subset_100_stats.json) β every bucket of every axis is hit:
- data_source 9/9 Β· language 3/3 Β· layout 5/5
- special_issue 13/13 Β· stratum 67/67
Using the bench
These two subsets are intended to be run as a pair β olmOCR-bench gives you sharp per-feature pass/fail signals and OmniDocBench gives you an aggregate quality score across real-world document types. For each new pipeline version, run both subsets, record per-subset metrics, and diff against the previous run.
Common entry points (to be wired up by the pipeline evaluator):
bench/olmocr_bench_50/pdfs/**/*.pdf # inputs
bench/olmocr_bench_50/subset_tests.jsonl # ground truth unit tests
bench/omnidocbench_100/pdfs/*.pdf # inputs
bench/omnidocbench_100/subset_100.json # ground truth annotations
Do not manually edit files under bench/. Regenerate with the sampling
script (for olmocr) or re-export from the upstream builder (for omnidoc) so
results stay reproducible.