---
title: Apparatus Ocr
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: gpl-3.0
short_description: 'Benchmarking OCR of critical text editions '
sdk_version: 5.43.1
tags:
- leaderboard
---

# OCR leaderboard

This Space is customized for a two-level OCR benchmark on a single critical-edition page.

Inputs and gold outputs live under `data/lloyd-jones-soph-170/`:
- `png/lloyd-jones-fullpage.png`: hard task input
- `png/lloyd-jones-text.png`: easy task text crop
- `png/lloyd-jones-apparatus.png`: easy task apparatus crop
- `ocr/lloyd-jones-text.json`: gold main-text output
- `ocr/lloyd-jones-apparatus.json`: gold apparatus output

The leaderboard expects result files in the following format:
```json
{
    "config": {
        "model_dtype": "torch.float16",
        "model_name": "org/model",
        "model_sha": "main"
    },
    "results": {
        "easy_levenshtein": {
            "score": 91.23
        },
        "easy_bleu": {
            "score": 84.56
        },
        "hard_levenshtein": {
            "score": 79.10
        },
        "hard_bleu": {
            "score": 70.42
        }
    }
}
```

The Space is local-first:
- If HF backend datasets are configured via env vars, it will sync from them.
- Otherwise it reads seeded queue/results data from `data/leaderboard/`.

Useful files:
- `src/about.py`: task definitions and benchmark copy
- `src/evaluation/metrics.py`: local OCR metric helpers
- `src/evaluation/build_result.py`: CLI to turn predicted OCR JSON files into a leaderboard result JSON
- `src/evaluation/run_granite_pipeline.py`: end-to-end Granite Vision runner for the benchmark images
- `src/leaderboard/read_evals.py`: result ingestion
- `src/populate.py`: leaderboard and queue dataframe assembly

Example:
```bash
python -m src.evaluation.build_result \
  --model-name ibm-granite/granite-vision-3.3-2b \
  --easy-text path/to/easy-text.json \
  --easy-apparatus path/to/easy-apparatus.json \
  --hard-text path/to/hard-text.json \
  --hard-apparatus path/to/hard-apparatus.json \
  --output data/leaderboard/results/ibm-granite/results_2026-03-28T00-00-00Z.json
```

To run the first baseline model directly:
```bash
python -m src.evaluation.run_granite_pipeline \
  --model-name ibm-granite/granite-vision-3.3-2b \
  --output-dir data/leaderboard/runs/granite-vision-3.3-2b
```

That command writes:
- predicted OCR JSON files for easy and hard tasks
- raw model responses for debugging
- `result.json` in leaderboard format
- `summary.json` with the four benchmark scores