# `collect_eval_data.py` — Phase 4 paper-collection driver Orchestrates per-cell audit-format eval runs across the Together AI model roster, producing one JSONL (+ minimap PNG dir) per `(model, pack, level, seed, fog_mode)` cell. Designed so a multi-hour, multi-model collection can crash-resume losslessly and so every byte needed for downstream paper analysis is captured at source — full observations (including the `_raw` engine dict and the spatial tensor), the literal HTTP request/response, engine warnings, per-turn signals, and a `terminal:` block with outcome / wall-clock / token totals. ## Data layout ``` / _invocation.json # the CLI args + cost estimate _summary.json # written on completion .logs/ # per-cell stdout+stderr ______seed__.log ______seed__.stats.json __/ # one dir per (timestamp, model) ____seed__.jsonl ____seed__/ turn_001.png turn_002.png ... ``` The JSONL is one line per turn; each line has these fields (full schema in `openra_bench/full_playback.py`): | field | description | | ----------------- | ----------------------------------------------------------- | | `turn` | int, 1-based turn index | | `tick` | int, engine game tick at end of turn | | `interrupt` | str or null — engine signal name that fired this turn | | `obs` | dict, full `RustObsAdapter.render_state()` (includes `_raw` and `spatial`) | | `briefing` | str, exactly the text the model received | | `system_prompt` | str, first turn only; subsequent turns = null | | `model_request` | dict, literal `{url, body}` posted to the provider | | `model_response` | dict, literal `{raw, text, tool_calls, reasoning, usage, finish_reason}` | | `commands_issued` | list[str], `Command.repr()` per command parsed | | `engine_warnings` | list[str], from the Rust env's `info["warnings"]` | | `signals` | dict, primitive signal snapshot (cash, kills, explored%, …) | | `minimap_png` | str or null, relative path to the per-turn PNG | | `done` | bool, engine `done` flag | | `terminal` | present ONLY on the final line; see below | The `terminal` block has: ```json { "outcome": "win|loss|draw", "final_obs": { ... }, "wall_clock_seconds": 12.345, "total_tokens_in": 78901, "total_tokens_out": 1234, "manifest": { /* scoring/score metadata */ } } ``` ## Invocation ```bash python3 scripts/collect_eval_data.py \ --models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \ --packs all \ --levels easy,medium,hard \ --seeds 1,2,3,4 \ --fog-modes vision \ --run-label paper-collection-v1 \ --output-dir data/runs/paper-collection-v1 \ --parallel-cells 4 \ --resume ``` * `--packs` accepts `all`, a comma list, `@file.txt` (one pack id per line), or a directory of `*.yaml`. * `--fog-modes` accepts any subset of `structured,structured-clear,vision,vision-clear,image,image-clear`. * `--parallel-cells` controls subprocess concurrency. Each cell is an isolated `python -m openra_bench.run_eval` invocation so a crash in one cell never aborts the rest of the run. Required env var (set in your shell / `.env`): `TOGETHER_API_KEY`. ## Cost estimates `--cost-estimate` prints per-model and total token / USD estimates WITHOUT spawning any subprocesses. The estimator uses average turns and tokens-per-turn from the May 2026 pilot runs (`playback/pilot_perception/`, `playback/pilot_handoff/`): * `avg_turns_per_cell = 18` (max_turns is typically 36; mean is ~half) * `avg_prompt_tokens_per_turn = 4500` (briefing + image + codex, with the 16-turn sliding window) * `avg_completion_tokens_per_turn = 250` (a tool call + brief reasoning) Pricing snapshot (USD per 1M tokens, May 2026): | model | in / M | out / M | | ---------------------- | ------ | ------- | | Qwen/Qwen3.5-9B | $0.20 | $0.20 | | Qwen/Qwen3.6-Plus | $0.50 | $1.50 | | qwen/qwen3.6-flash | $0.18 | $0.18 | | google/gemma-4-31B-it | $0.25 | $0.25 | | moonshotai/Kimi-K2.6 | $0.60 | $2.50 | Update `_TOGETHER_PRICES` at the top of `scripts/collect_eval_data.py` when Together's published rates change. ### Common run profiles (cost estimates) | profile | cells / model | total cells | est USD (4 models, 1 fog) | | ----------------- | ------------- | ----------- | ------------------------- | | **smoke**: 10 packs × 1 level × 1 seed × 1 fog | 10 | 40 | ~$1 | | **mini**: 50 packs × 3 levels × 1 seed × 1 fog | 150 | 600 | ~$30 | | **medium**: 50 packs × 3 levels × 4 seeds × 1 fog | 600 | 2400 | ~$120 | | **full**: 200 packs × 3 levels × 4 seeds × 1 fog | 2400| 9600 | ~$480 | | **perception sweep**: full × 6 fog modes | 14400| 57600| ~$2900 | Always run `--cost-estimate` before the real thing. ## Resume / re-run a failed cell `--resume` scans the output dir and skips any cell whose JSONL is complete (its last line has a `terminal:` field). Cells that crashed mid-run leave a `.jsonl.partial` behind for forensics; the final `.jsonl` is NOT created, so resume correctly retries them. To force a re-run of one cell, delete its `.jsonl` (and the sibling PNG dir if you want a fresh image series). The next `--resume` invocation will re-spawn it. For diagnostics, every cell's stdout+stderr is captured to `/.logs/.log`. Tail one to see exactly what `python -m openra_bench.run_eval` printed. ## Loading the data for paper analysis ```python import json from pathlib import Path def iter_cells(run_dir): """Yield (path, lines) per cell — lines are list[dict].""" for p in sorted(Path(run_dir).glob("**/*.jsonl")): if p.name.startswith("_") or p.name.endswith(".partial"): continue with open(p) as fh: lines = [json.loads(l) for l in fh if l.strip()] yield p, lines for path, lines in iter_cells("data/runs/paper-collection-v1"): term = lines[-1].get("terminal", {}) outcome = term.get("outcome", "?") n_turns = len(lines) cost = ( term.get("total_tokens_in", 0) / 1e6 * 0.5 + term.get("total_tokens_out", 0) / 1e6 * 1.5 ) print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}") ``` The viewer at `scripts/view_playback.py` transparently understands the audit JSONL format alongside the legacy `seed/` dirs — point it at `data/runs/` and it picks up both shapes. ## Caveats / known limits * `--repeats > 1` currently shares the JSONL stem; for paper-grade collection keep it at 1 (one cell == one deterministic seed). The cell-level reliability metric (pass^k) belongs in a separate sweep with distinct `--seeds`. * `--full-playback` runs ALONGSIDE the legacy `Playback` — pass both `--playback` and `--full-playback` to `run_eval` if you want the human-readable viewer files AND the audit JSONL. The collector script only emits the audit format (the playback dir is what the viewer reads natively). * Engine warnings reflect the `info["warnings"]` list emitted by the Rust env at each step; the bench does NOT attach a model-level warning channel (the model is judged only by its commands). * `model_response.raw` carries the entire provider response JSON, which on Together's side includes a `usage` block; downstream paper analysis should pull token counts from there rather than trusting any aggregate, because the per-turn breakdown is the authoritative source.