Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
collect_eval_data.py — Phase 4 paper-collection driver
Orchestrates per-cell audit-format eval runs across the Together AI
model roster, producing one JSONL (+ minimap PNG dir) per
(model, pack, level, seed, fog_mode) cell. Designed so a multi-hour,
multi-model collection can crash-resume losslessly and so every byte
needed for downstream paper analysis is captured at source — full
observations (including the _raw engine dict and the spatial
tensor), the literal HTTP request/response, engine warnings, per-turn
signals, and a terminal: block with outcome / wall-clock / token
totals.
Data layout
<output-dir>/
_invocation.json # the CLI args + cost estimate
_summary.json # written on completion
.logs/ # per-cell stdout+stderr
<model_safe>__<pack>__<level>__seed<N>__<fog>.log
<model_safe>__<pack>__<level>__seed<N>__<fog>.stats.json
<YYYYMMDD-HHMMSS>__<model_safe>/ # one dir per (timestamp, model)
<pack>__<level>__seed<N>__<fog>.jsonl
<pack>__<level>__seed<N>__<fog>/
turn_001.png
turn_002.png
...
The JSONL is one line per turn; each line has these fields (full
schema in openra_bench/full_playback.py):
| field | description |
|---|---|
turn |
int, 1-based turn index |
tick |
int, engine game tick at end of turn |
interrupt |
str or null — engine signal name that fired this turn |
obs |
dict, full RustObsAdapter.render_state() (includes _raw and spatial) |
briefing |
str, exactly the text the model received |
system_prompt |
str, first turn only; subsequent turns = null |
model_request |
dict, literal {url, body} posted to the provider |
model_response |
dict, literal {raw, text, tool_calls, reasoning, usage, finish_reason} |
commands_issued |
list[str], Command.repr() per command parsed |
engine_warnings |
list[str], from the Rust env's info["warnings"] |
signals |
dict, primitive signal snapshot (cash, kills, explored%, …) |
minimap_png |
str or null, relative path to the per-turn PNG |
done |
bool, engine done flag |
terminal |
present ONLY on the final line; see below |
The terminal block has:
{
"outcome": "win|loss|draw",
"final_obs": { ... },
"wall_clock_seconds": 12.345,
"total_tokens_in": 78901,
"total_tokens_out": 1234,
"manifest": { /* scoring/score metadata */ }
}
Invocation
python3 scripts/collect_eval_data.py \
--models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \
--packs all \
--levels easy,medium,hard \
--seeds 1,2,3,4 \
--fog-modes vision \
--run-label paper-collection-v1 \
--output-dir data/runs/paper-collection-v1 \
--parallel-cells 4 \
--resume
--packsacceptsall, a comma list,@file.txt(one pack id per line), or a directory of*.yaml.--fog-modesaccepts any subset ofstructured,structured-clear,vision,vision-clear,image,image-clear.--parallel-cellscontrols subprocess concurrency. Each cell is an isolatedpython -m openra_bench.run_evalinvocation so a crash in one cell never aborts the rest of the run.
Required env var (set in your shell / .env): TOGETHER_API_KEY.
Cost estimates
--cost-estimate prints per-model and total token / USD estimates
WITHOUT spawning any subprocesses. The estimator uses average turns
and tokens-per-turn from the May 2026 pilot runs
(playback/pilot_perception/, playback/pilot_handoff/):
avg_turns_per_cell = 18(max_turns is typically 36; mean is ~half)avg_prompt_tokens_per_turn = 4500(briefing + image + codex, with the 16-turn sliding window)avg_completion_tokens_per_turn = 250(a tool call + brief reasoning)
Pricing snapshot (USD per 1M tokens, May 2026):
| model | in / M | out / M |
|---|---|---|
| Qwen/Qwen3.5-9B | $0.20 | $0.20 |
| Qwen/Qwen3.6-Plus | $0.50 | $1.50 |
| qwen/qwen3.6-flash | $0.18 | $0.18 |
| google/gemma-4-31B-it | $0.25 | $0.25 |
| moonshotai/Kimi-K2.6 | $0.60 | $2.50 |
Update _TOGETHER_PRICES at the top of scripts/collect_eval_data.py
when Together's published rates change.
Common run profiles (cost estimates)
| profile | cells / model | total cells | est USD (4 models, 1 fog) |
|---|---|---|---|
| smoke: 10 packs × 1 level × 1 seed × 1 fog | 10 | 40 | ~$1 |
| mini: 50 packs × 3 levels × 1 seed × 1 fog | 150 | 600 | ~$30 |
| medium: 50 packs × 3 levels × 4 seeds × 1 fog | 600 | 2400 | ~$120 |
| full: 200 packs × 3 levels × 4 seeds × 1 fog | 2400 | 9600 | ~$480 |
| perception sweep: full × 6 fog modes | 14400 | 57600 | ~$2900 |
Always run --cost-estimate before the real thing.
Resume / re-run a failed cell
--resume scans the output dir and skips any cell whose JSONL is
complete (its last line has a terminal: field). Cells that crashed
mid-run leave a <stem>.jsonl.partial behind for forensics; the
final <stem>.jsonl is NOT created, so resume correctly retries them.
To force a re-run of one cell, delete its <stem>.jsonl (and the
sibling PNG dir if you want a fresh image series). The next
--resume invocation will re-spawn it.
For diagnostics, every cell's stdout+stderr is captured to
<output-dir>/.logs/<cell-id>.log. Tail one to see exactly what
python -m openra_bench.run_eval printed.
Loading the data for paper analysis
import json
from pathlib import Path
def iter_cells(run_dir):
"""Yield (path, lines) per cell — lines are list[dict]."""
for p in sorted(Path(run_dir).glob("**/*.jsonl")):
if p.name.startswith("_") or p.name.endswith(".partial"):
continue
with open(p) as fh:
lines = [json.loads(l) for l in fh if l.strip()]
yield p, lines
for path, lines in iter_cells("data/runs/paper-collection-v1"):
term = lines[-1].get("terminal", {})
outcome = term.get("outcome", "?")
n_turns = len(lines)
cost = (
term.get("total_tokens_in", 0) / 1e6 * 0.5
+ term.get("total_tokens_out", 0) / 1e6 * 1.5
)
print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}")
The viewer at scripts/view_playback.py transparently understands
the audit JSONL format alongside the legacy seed<N>/ dirs — point
it at data/runs/<run-label> and it picks up both shapes.
Caveats / known limits
--repeats > 1currently shares the JSONL stem; for paper-grade collection keep it at 1 (one cell == one deterministic seed). The cell-level reliability metric (pass^k) belongs in a separate sweep with distinct--seeds.--full-playbackruns ALONGSIDE the legacyPlayback— pass both--playbackand--full-playbacktorun_evalif you want the human-readable viewer files AND the audit JSONL. The collector script only emits the audit format (the playback dir is what the viewer reads natively).- Engine warnings reflect the
info["warnings"]list emitted by the Rust env at each step; the bench does NOT attach a model-level warning channel (the model is judged only by its commands). model_response.rawcarries the entire provider response JSON, which on Together's side includes ausageblock; downstream paper analysis should pull token counts from there rather than trusting any aggregate, because the per-turn breakdown is the authoritative source.