Spaces:
Runtime error
Runtime error
File size: 2,572 Bytes
81fef9c 8f9204b 81fef9c 8f9204b 81fef9c 69dc570 81fef9c 69dc570 81fef9c 69dc570 81fef9c 69dc570 81fef9c 69dc570 81fef9c 69dc570 81fef9c 69dc570 81fef9c 69dc570 e34d289 69dc570 81fef9c 69dc570 e34d289 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | ---
title: Apparatus Ocr
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: gpl-3.0
short_description: 'Benchmarking OCR of critical text editions '
sdk_version: 5.43.1
tags:
- leaderboard
---
# OCR leaderboard
This Space is customized for a two-level OCR benchmark on a single critical-edition page.
Inputs and gold outputs live under `data/lloyd-jones-soph-170/`:
- `png/lloyd-jones-fullpage.png`: hard task input
- `png/lloyd-jones-text.png`: easy task text crop
- `png/lloyd-jones-apparatus.png`: easy task apparatus crop
- `ocr/lloyd-jones-text.json`: gold main-text output
- `ocr/lloyd-jones-apparatus.json`: gold apparatus output
The leaderboard expects result files in the following format:
```json
{
"config": {
"model_dtype": "torch.float16",
"model_name": "org/model",
"model_sha": "main"
},
"results": {
"easy_levenshtein": {
"score": 91.23
},
"easy_bleu": {
"score": 84.56
},
"hard_levenshtein": {
"score": 79.10
},
"hard_bleu": {
"score": 70.42
}
}
}
```
The Space is local-first:
- If HF backend datasets are configured via env vars, it will sync from them.
- Otherwise it reads seeded queue/results data from `data/leaderboard/`.
Useful files:
- `src/about.py`: task definitions and benchmark copy
- `src/evaluation/metrics.py`: local OCR metric helpers
- `src/evaluation/build_result.py`: CLI to turn predicted OCR JSON files into a leaderboard result JSON
- `src/evaluation/run_granite_pipeline.py`: end-to-end Granite Vision runner for the benchmark images
- `src/leaderboard/read_evals.py`: result ingestion
- `src/populate.py`: leaderboard and queue dataframe assembly
Example:
```bash
python -m src.evaluation.build_result \
--model-name ibm-granite/granite-vision-3.3-2b \
--easy-text path/to/easy-text.json \
--easy-apparatus path/to/easy-apparatus.json \
--hard-text path/to/hard-text.json \
--hard-apparatus path/to/hard-apparatus.json \
--output data/leaderboard/results/ibm-granite/results_2026-03-28T00-00-00Z.json
```
To run the first baseline model directly:
```bash
python -m src.evaluation.run_granite_pipeline \
--model-name ibm-granite/granite-vision-3.3-2b \
--output-dir data/leaderboard/runs/granite-vision-3.3-2b
```
That command writes:
- predicted OCR JSON files for easy and hard tasks
- raw model responses for debugging
- `result.json` in leaderboard format
- `summary.json` with the four benchmark scores
|