File size: 3,939 Bytes
d423504
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# bench/ β€” PDF processing pipeline evaluation set

This directory is the **canonical test set** for evaluating the end-to-end PDF
processing pipeline (layout β†’ OCR β†’ markdown / structured text). It bundles
two complementary, pre-sampled subsets so that runs are reproducible and
cheap to iterate on.

| Subset | PDFs | Source benchmark | Focus |
|---|---:|---|---|
| [`olmocr_bench_50/`](./olmocr_bench_50) | 50 | [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Fine-grained unit tests on text presence / absence, reading order, tables, math |
| [`omnidocbench_100/`](./omnidocbench_100) | 100 | [OmniDocBench](https://github.com/opendatalab/OmniDocBench) | Holistic document-level eval with layout / language / special-issue coverage |

Total footprint: ~108 MB, 150 PDFs.

## Subset details

### `olmocr_bench_50/`
Stratified sample drawn from the 1,403-PDF olmOCR-bench with the script
`scripts/sample_olmocr_subset.py` (seed `20260411`). Covers all 7 document
sources with a minimum floor of 3 PDFs per category plus largest-remainder
proportional allocation, and diversifies by source document inside each
category (at most one page per arXiv paper / scan ID before any repeat).

```
olmocr_bench_50/
β”œβ”€β”€ pdfs/
β”‚   β”œβ”€β”€ arxiv_math/         (14)
β”‚   β”œβ”€β”€ headers_footers/    (8)
β”‚   β”œβ”€β”€ long_tiny_text/     (4)
β”‚   β”œβ”€β”€ multi_column/       (8)
β”‚   β”œβ”€β”€ old_scans/          (5)
β”‚   β”œβ”€β”€ old_scans_math/     (4)
β”‚   └── tables/             (7)
β”œβ”€β”€ subset_tests.jsonl      # 283 olmOCR-bench unit tests for these 50 PDFs
└── subset_manifest.json    # seed, quotas, selected file list, source bench_dir
```

The `subset_tests.jsonl` file is a filtered copy of the original per-category
`*.jsonl` test files merged into one; each row keeps the exact schema used by
the upstream olmOCR-bench evaluator (`pdf`, `type`, `max_diffs`, `checked`,
and type-specific fields like `math`, `cell`, `before`/`after`, …).

Regenerate or resize:
```bash
python3 scripts/sample_olmocr_subset.py --target 50             # default β†’ bench/olmocr_bench_50
python3 scripts/sample_olmocr_subset.py --target 100 --seed 42  # alt subset
python3 scripts/sample_olmocr_subset.py --dry-run               # plan only
```

### `omnidocbench_100/`
Pre-built 100-PDF subset of OmniDocBench v2 with full stratified coverage
across every categorical axis in the upstream dataset.

```
omnidocbench_100/
β”œβ”€β”€ pdfs/                   # 100 single-page PDFs
β”œβ”€β”€ img/                    # matching rendered JPGs (1 per PDF)
β”œβ”€β”€ subset_100.json         # full OmniDocBench annotations for the 100 samples
β”œβ”€β”€ subset_100_stats.json   # coverage & distribution stats vs. full 981-doc set
β”œβ”€β”€ subset_100_pdfs.txt     # flat list of selected PDF filenames
└── subset_100_images.txt   # flat list of selected image filenames
```

Coverage (from `subset_100_stats.json`) β€” every bucket of every axis is hit:
- **data_source** 9/9 Β· **language** 3/3 Β· **layout** 5/5
- **special_issue** 13/13 Β· **stratum** 67/67

## Using the bench

These two subsets are intended to be run as a pair β€” olmOCR-bench gives you
sharp per-feature pass/fail signals and OmniDocBench gives you an aggregate
quality score across real-world document types. For each new pipeline
version, run both subsets, record per-subset metrics, and diff against the
previous run.

Common entry points (to be wired up by the pipeline evaluator):

```text
bench/olmocr_bench_50/pdfs/**/*.pdf      # inputs
bench/olmocr_bench_50/subset_tests.jsonl # ground truth unit tests

bench/omnidocbench_100/pdfs/*.pdf        # inputs
bench/omnidocbench_100/subset_100.json   # ground truth annotations
```

Do **not** manually edit files under `bench/`. Regenerate with the sampling
script (for olmocr) or re-export from the upstream builder (for omnidoc) so
results stay reproducible.