# Eval pipeline (OpenRouter judge) A self-contained evaluation pipeline for LFM2.5-VL structured-extraction models. Extraction runs on your local GPU (vLLM/HF); the VLM judge runs remotely via the [OpenRouter](https://openrouter.ai/) API — no need to host a 30+ GB vision judge yourself. ## Pipeline ``` WDS tars ─▶ Extraction (local GPU) ─▶ predictions │ structural metrics ◀───────────┤ (json validity, key P/R/F1) │ │ VLM judge (OpenRouter) ◀───────┘ │ ▼ eval_result.json ``` Three primary metrics per run: `json_validity_rate`, `key_f1_macro`, `vlm_judge_score_avg` (per-key precision / recall also reported as diagnostic byproducts of F1). ## Files ``` . ├── README.md ├── requirements.txt ├── run_eval.sh ← entry script (env vars + python call) ├── run_eval.py ← CLI + orchestrator + metrics aggregation ├── extract.py ← WDS loader + vLLM/HF extraction + JSON parsing ├── judge.py ← OpenRouter async VLM judging ├── prompts/ ← 2 prompt templates (.txt) └── eval_data/ ← shipped 2000-sample eval set (single WDS tar) ``` Three Python files total. No nested packages, no `pyproject.toml`, no `pip install -e .` — just `pip install -r requirements.txt`. --- ## Setup ### 1. Python environment ```bash python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt ``` `pip install` will pull `vllm`, `torch`, `transformers`, `peft`, `webdataset`, `pillow`, `openai`, `tqdm`, `numpy` — ~5 GB total, takes 5–15 min depending on the network. > **Mac / no NVIDIA GPU?** vLLM won't install. Either drop the `vllm` > line from `requirements.txt`, or install everything else manually and > run with `--extraction-backend hf` (forces the HF transformers path). ### 2. OpenRouter API key Get a key from https://openrouter.ai/keys, then add it to your `~/.bashrc`: ```bash export OPENROUTER_API_KEY=sk-or-v1-... ``` Then `source ~/.bashrc` (or open a new shell). --- ## Run ### Quick start ```bash bash run_eval.sh ``` Defaults: - Evaluates `LiquidAI/LFM2.5-VL-450M-Extract` on `./eval_data/` - Runs the full **2000 samples** (~30 min) - VLM judge: `qwen/qwen3.5-35b-a3b` - Writes results to `./eval_result.json` and log to `./eval_run.log` ### Tweaking knobs Open `run_eval.sh` — every knob is a top-level variable with an inline comment. Common changes: ```bash NUM_SAMPLES=50 # set 50 for a quick smoke test (~5 min) EXTRACTION_BACKEND="hf" # if vLLM init fails on your machine EXTRACTION_BATCH=32 # bump for faster extraction (default 8) VLM_JUDGE_MODEL="google/gemini-2.5-flash" # any image-capable OpenRouter model id JUDGE_CONCURRENCY=8 # lower if you hit OpenRouter rate limits ``` ### CLI alternative If you'd rather skip the .sh wrapper, drive `run_eval.py` directly: ```bash python run_eval.py \ --checkpoint-path LiquidAI/LFM2.5-VL-450M-Extract \ --data-path ./eval_data \ --output-path ./eval_result.json \ --num-samples 50 \ --extraction-backend auto \ --vlm-judge --vlm-judge-model qwen/qwen3.5-35b-a3b ``` All flags: `python run_eval.py --help` --- ## Eval data ### What ships in `./eval_data/` 2000 `(image, schema, JSON)` samples in a single WebDataset tar (`eval_set_n2000.tar`). Reference labels were generated by an ensemble of frontier multimodal models and lightly post-processed for consistency. ### Bring your own Drop a `.tar` (or directory of tars) anywhere and pass `--data-path /path/to/your/data`. ### Format spec Each sample is a WebDataset group sharing a common `` prefix: ``` .jpg image bytes .key_explanations JSON {key_name: description} (the schema) .structured_text JSON {key_name: value} (ground truth) ``` --- ## Output `./eval_result.json` has three top-level keys: ```jsonc { "metadata": { "checkpoint_path": "LiquidAI/LFM2.5-VL-450M-Extract", "num_samples_evaluated": 50, "extraction_backend": "auto", "vlm_judge_model": "qwen/qwen3.5-35b-a3b", "elapsed_s": 215.2, "timestamp_utc": "2026-05-29T..." }, "metrics": { "json_validity_rate": 0.996, // share of samples with parseable JSON "key_precision_macro": 0.996, // pred-keys ∩ gt-keys / pred-keys "key_recall_macro": 0.997, "key_f1_macro": 0.997, // primary schema-consistency metric "vlm_judge_score_avg": 0.922, // 0-1, VLM scoring of all keys vs image "samples_evaluated": 50 }, "samples": [ /* per-sample {schema, gt, prediction, per_key scores, raw judge text} */ ] } ``` The `samples[].vlm_judge_raw` field preserves the judge's verbatim text response — useful for debugging unexpected scores. --- ## Costs Default judge on a full 2000-sample run, calculated against per-token pricing at the time of writing (check https://openrouter.ai/models for current rates): | Stage | Model | Input rate | Output rate | Est. cost | |---|---|---|---|---| | VLM judge | `qwen/qwen3.5-35b-a3b` | $0.139 / 1M | $1.00 / 1M | ~$1.53 | **Full 2000-sample run: ~$1.50.** Smoke 50-sample: ~$0.04. --- ## Troubleshooting - **vLLM init fails** (e.g. `Ninja build failed` / `__cudaLaunch not declared`) → set `EXTRACTION_BACKEND="hf"` in `run_eval.sh` for a slower-but-stable fallback. - **OpenRouter 429 (rate limit)** → lower `JUDGE_CONCURRENCY` to 4 or 8. - **`No usable samples loaded`** → your tars don't have the expected `.jpg` / `.key_explanations` / `.structured_text` fields, or the `.tar` path is wrong. - **A new judge model rejects with `Reasoning is mandatory`** or returns all zero scores with `finish_reason=length` → edit the `_VLM_JUDGE_REASONING` constant in `judge.py` (the OpenRouter `reasoning` param works differently per model).