File size: 6,148 Bytes
c013d55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# Eval pipeline (OpenRouter judge)

A self-contained evaluation pipeline for LFM2.5-VL structured-extraction
models. Extraction runs on your local GPU (vLLM/HF); the VLM judge runs
remotely via the [OpenRouter](https://openrouter.ai/) API β€” no need to
host a 30+ GB vision judge yourself.

## Pipeline

```
WDS tars  ─▢  Extraction (local GPU)  ─▢  predictions
                                              β”‚
              structural metrics  ◀────────────
              (json validity, key P/R/F1)     β”‚
                                              β”‚
              VLM judge (OpenRouter)  β—€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
                       eval_result.json
```

Three primary metrics per run: `json_validity_rate`, `key_f1_macro`,
`vlm_judge_score_avg` (per-key precision / recall also reported as
diagnostic byproducts of F1).

## Files

```
.
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ run_eval.sh       ← entry script (env vars + python call)
β”œβ”€β”€ run_eval.py       ← CLI + orchestrator + metrics aggregation
β”œβ”€β”€ extract.py        ← WDS loader + vLLM/HF extraction + JSON parsing
β”œβ”€β”€ judge.py          ← OpenRouter async VLM judging
β”œβ”€β”€ prompts/          ← 2 prompt templates (.txt)
└── eval_data/        ← shipped 2000-sample eval set (single WDS tar)
```

Three Python files total. No nested packages, no `pyproject.toml`,
no `pip install -e .` β€” just `pip install -r requirements.txt`.

---

## Setup

### 1. Python environment

```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```

`pip install` will pull `vllm`, `torch`, `transformers`, `peft`,
`webdataset`, `pillow`, `openai`, `tqdm`, `numpy` β€” ~5 GB total, takes
5–15 min depending on the network.

> **Mac / no NVIDIA GPU?** vLLM won't install. Either drop the `vllm`
> line from `requirements.txt`, or install everything else manually and
> run with `--extraction-backend hf` (forces the HF transformers path).

### 2. OpenRouter API key

Get a key from https://openrouter.ai/keys, then add it to your `~/.bashrc`:

```bash
export OPENROUTER_API_KEY=sk-or-v1-...
```

Then `source ~/.bashrc` (or open a new shell).

---

## Run

### Quick start

```bash
bash run_eval.sh
```

Defaults:
- Evaluates `LiquidAI/LFM2.5-VL-450M-Extract` on `./eval_data/`
- Runs the full **2000 samples** (~30 min)
- VLM judge: `qwen/qwen3.5-35b-a3b`
- Writes results to `./eval_result.json` and log to `./eval_run.log`

### Tweaking knobs

Open `run_eval.sh` β€” every knob is a top-level variable with an inline
comment. Common changes:

```bash
NUM_SAMPLES=50                # set 50 for a quick smoke test (~5 min)
EXTRACTION_BACKEND="hf"       # if vLLM init fails on your machine
EXTRACTION_BATCH=32           # bump for faster extraction (default 8)
VLM_JUDGE_MODEL="google/gemini-2.5-flash"   # any image-capable OpenRouter model id
JUDGE_CONCURRENCY=8           # lower if you hit OpenRouter rate limits
```

### CLI alternative

If you'd rather skip the .sh wrapper, drive `run_eval.py` directly:

```bash
python run_eval.py \
  --checkpoint-path LiquidAI/LFM2.5-VL-450M-Extract \
  --data-path ./eval_data \
  --output-path ./eval_result.json \
  --num-samples 50 \
  --extraction-backend auto \
  --vlm-judge --vlm-judge-model qwen/qwen3.5-35b-a3b
```

All flags: `python run_eval.py --help`

---

## Eval data

### What ships in `./eval_data/`

2000 `(image, schema, JSON)` samples in a single WebDataset tar
(`eval_set_n2000.tar`). Reference labels were generated by an ensemble
of frontier multimodal models and lightly post-processed for consistency.

### Bring your own

Drop a `.tar` (or directory of tars) anywhere and pass
`--data-path /path/to/your/data`.

### Format spec

Each sample is a WebDataset group sharing a common `<sample_id>` prefix:

```
<sample_id>.jpg                image bytes
<sample_id>.key_explanations   JSON {key_name: description}   (the schema)
<sample_id>.structured_text    JSON {key_name: value}         (ground truth)
```

---

## Output

`./eval_result.json` has three top-level keys:

```jsonc
{
  "metadata": {
    "checkpoint_path": "LiquidAI/LFM2.5-VL-450M-Extract",
    "num_samples_evaluated": 50,
    "extraction_backend": "auto",
    "vlm_judge_model": "qwen/qwen3.5-35b-a3b",
    "elapsed_s": 215.2,
    "timestamp_utc": "2026-05-29T..."
  },
  "metrics": {
    "json_validity_rate":   0.996,    // share of samples with parseable JSON
    "key_precision_macro":  0.996,    // pred-keys ∩ gt-keys / pred-keys
    "key_recall_macro":     0.997,
    "key_f1_macro":         0.997,    // primary schema-consistency metric
    "vlm_judge_score_avg":  0.922,    // 0-1, VLM scoring of all keys vs image
    "samples_evaluated":    50
  },
  "samples": [
    /* per-sample {schema, gt, prediction, per_key scores, raw judge text} */
  ]
}
```

The `samples[].vlm_judge_raw` field preserves the judge's verbatim text
response β€” useful for debugging unexpected scores.

---

## Costs

Default judge on a full 2000-sample run, calculated against per-token
pricing at the time of writing (check https://openrouter.ai/models for
current rates):

| Stage | Model | Input rate | Output rate | Est. cost |
|---|---|---|---|---|
| VLM judge | `qwen/qwen3.5-35b-a3b` | $0.139 / 1M | $1.00 / 1M | ~$1.53 |

**Full 2000-sample run: ~$1.50.** Smoke 50-sample: ~$0.04.

---

## Troubleshooting

- **vLLM init fails** (e.g. `Ninja build failed` / `__cudaLaunch not declared`)
  β†’ set `EXTRACTION_BACKEND="hf"` in `run_eval.sh` for a slower-but-stable
  fallback.
- **OpenRouter 429 (rate limit)** β†’ lower `JUDGE_CONCURRENCY` to 4 or 8.
- **`No usable samples loaded`** β†’ your tars don't have the expected
  `<key>.jpg` / `.key_explanations` / `.structured_text` fields, or the
  `.tar` path is wrong.
- **A new judge model rejects with `Reasoning is mandatory`** or returns all
  zero scores with `finish_reason=length` β†’ edit the `_VLM_JUDGE_REASONING`
  constant in `judge.py` (the OpenRouter `reasoning` param works differently
  per model).