File size: 2,572 Bytes
81fef9c
8f9204b
81fef9c
 
 
 
 
 
8f9204b
 
81fef9c
 
 
 
 
69dc570
81fef9c
69dc570
81fef9c
69dc570
 
 
 
 
 
 
 
81fef9c
 
 
69dc570
 
 
81fef9c
 
69dc570
 
 
 
 
 
 
 
81fef9c
69dc570
 
81fef9c
 
 
 
 
69dc570
 
 
81fef9c
69dc570
 
 
 
e34d289
69dc570
 
81fef9c
69dc570
 
 
 
 
 
 
 
 
 
e34d289
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: Apparatus Ocr
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: gpl-3.0
short_description: 'Benchmarking OCR of critical text editions '
sdk_version: 5.43.1
tags:
- leaderboard
---

# OCR leaderboard

This Space is customized for a two-level OCR benchmark on a single critical-edition page.

Inputs and gold outputs live under `data/lloyd-jones-soph-170/`:
- `png/lloyd-jones-fullpage.png`: hard task input
- `png/lloyd-jones-text.png`: easy task text crop
- `png/lloyd-jones-apparatus.png`: easy task apparatus crop
- `ocr/lloyd-jones-text.json`: gold main-text output
- `ocr/lloyd-jones-apparatus.json`: gold apparatus output

The leaderboard expects result files in the following format:
```json
{
    "config": {
        "model_dtype": "torch.float16",
        "model_name": "org/model",
        "model_sha": "main"
    },
    "results": {
        "easy_levenshtein": {
            "score": 91.23
        },
        "easy_bleu": {
            "score": 84.56
        },
        "hard_levenshtein": {
            "score": 79.10
        },
        "hard_bleu": {
            "score": 70.42
        }
    }
}
```

The Space is local-first:
- If HF backend datasets are configured via env vars, it will sync from them.
- Otherwise it reads seeded queue/results data from `data/leaderboard/`.

Useful files:
- `src/about.py`: task definitions and benchmark copy
- `src/evaluation/metrics.py`: local OCR metric helpers
- `src/evaluation/build_result.py`: CLI to turn predicted OCR JSON files into a leaderboard result JSON
- `src/evaluation/run_granite_pipeline.py`: end-to-end Granite Vision runner for the benchmark images
- `src/leaderboard/read_evals.py`: result ingestion
- `src/populate.py`: leaderboard and queue dataframe assembly

Example:
```bash
python -m src.evaluation.build_result \
  --model-name ibm-granite/granite-vision-3.3-2b \
  --easy-text path/to/easy-text.json \
  --easy-apparatus path/to/easy-apparatus.json \
  --hard-text path/to/hard-text.json \
  --hard-apparatus path/to/hard-apparatus.json \
  --output data/leaderboard/results/ibm-granite/results_2026-03-28T00-00-00Z.json
```

To run the first baseline model directly:
```bash
python -m src.evaluation.run_granite_pipeline \
  --model-name ibm-granite/granite-vision-3.3-2b \
  --output-dir data/leaderboard/runs/granite-vision-3.3-2b
```

That command writes:
- predicted OCR JSON files for easy and hard tasks
- raw model responses for debugging
- `result.json` in leaderboard format
- `summary.json` with the four benchmark scores