Spaces:
Running
Running
File size: 8,050 Bytes
6d71d3b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | # `collect_eval_data.py` — Phase 4 paper-collection driver
Orchestrates per-cell audit-format eval runs across the Together AI
model roster, producing one JSONL (+ minimap PNG dir) per
`(model, pack, level, seed, fog_mode)` cell. Designed so a multi-hour,
multi-model collection can crash-resume losslessly and so every byte
needed for downstream paper analysis is captured at source — full
observations (including the `_raw` engine dict and the spatial
tensor), the literal HTTP request/response, engine warnings, per-turn
signals, and a `terminal:` block with outcome / wall-clock / token
totals.
## Data layout
```
<output-dir>/
_invocation.json # the CLI args + cost estimate
_summary.json # written on completion
.logs/ # per-cell stdout+stderr
<model_safe>__<pack>__<level>__seed<N>__<fog>.log
<model_safe>__<pack>__<level>__seed<N>__<fog>.stats.json
<YYYYMMDD-HHMMSS>__<model_safe>/ # one dir per (timestamp, model)
<pack>__<level>__seed<N>__<fog>.jsonl
<pack>__<level>__seed<N>__<fog>/
turn_001.png
turn_002.png
...
```
The JSONL is one line per turn; each line has these fields (full
schema in `openra_bench/full_playback.py`):
| field | description |
| ----------------- | ----------------------------------------------------------- |
| `turn` | int, 1-based turn index |
| `tick` | int, engine game tick at end of turn |
| `interrupt` | str or null — engine signal name that fired this turn |
| `obs` | dict, full `RustObsAdapter.render_state()` (includes `_raw` and `spatial`) |
| `briefing` | str, exactly the text the model received |
| `system_prompt` | str, first turn only; subsequent turns = null |
| `model_request` | dict, literal `{url, body}` posted to the provider |
| `model_response` | dict, literal `{raw, text, tool_calls, reasoning, usage, finish_reason}` |
| `commands_issued` | list[str], `Command.repr()` per command parsed |
| `engine_warnings` | list[str], from the Rust env's `info["warnings"]` |
| `signals` | dict, primitive signal snapshot (cash, kills, explored%, …) |
| `minimap_png` | str or null, relative path to the per-turn PNG |
| `done` | bool, engine `done` flag |
| `terminal` | present ONLY on the final line; see below |
The `terminal` block has:
```json
{
"outcome": "win|loss|draw",
"final_obs": { ... },
"wall_clock_seconds": 12.345,
"total_tokens_in": 78901,
"total_tokens_out": 1234,
"manifest": { /* scoring/score metadata */ }
}
```
## Invocation
```bash
python3 scripts/collect_eval_data.py \
--models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \
--packs all \
--levels easy,medium,hard \
--seeds 1,2,3,4 \
--fog-modes vision \
--run-label paper-collection-v1 \
--output-dir data/runs/paper-collection-v1 \
--parallel-cells 4 \
--resume
```
* `--packs` accepts `all`, a comma list, `@file.txt` (one pack id per
line), or a directory of `*.yaml`.
* `--fog-modes` accepts any subset of
`structured,structured-clear,vision,vision-clear,image,image-clear`.
* `--parallel-cells` controls subprocess concurrency. Each cell is an
isolated `python -m openra_bench.run_eval` invocation so a crash in
one cell never aborts the rest of the run.
Required env var (set in your shell / `.env`): `TOGETHER_API_KEY`.
## Cost estimates
`--cost-estimate` prints per-model and total token / USD estimates
WITHOUT spawning any subprocesses. The estimator uses average turns
and tokens-per-turn from the May 2026 pilot runs
(`playback/pilot_perception/`, `playback/pilot_handoff/`):
* `avg_turns_per_cell = 18` (max_turns is typically 36; mean is ~half)
* `avg_prompt_tokens_per_turn = 4500` (briefing + image + codex,
with the 16-turn sliding window)
* `avg_completion_tokens_per_turn = 250` (a tool call + brief
reasoning)
Pricing snapshot (USD per 1M tokens, May 2026):
| model | in / M | out / M |
| ---------------------- | ------ | ------- |
| Qwen/Qwen3.5-9B | $0.20 | $0.20 |
| Qwen/Qwen3.6-Plus | $0.50 | $1.50 |
| qwen/qwen3.6-flash | $0.18 | $0.18 |
| google/gemma-4-31B-it | $0.25 | $0.25 |
| moonshotai/Kimi-K2.6 | $0.60 | $2.50 |
Update `_TOGETHER_PRICES` at the top of `scripts/collect_eval_data.py`
when Together's published rates change.
### Common run profiles (cost estimates)
| profile | cells / model | total cells | est USD (4 models, 1 fog) |
| ----------------- | ------------- | ----------- | ------------------------- |
| **smoke**: 10 packs × 1 level × 1 seed × 1 fog | 10 | 40 | ~$1 |
| **mini**: 50 packs × 3 levels × 1 seed × 1 fog | 150 | 600 | ~$30 |
| **medium**: 50 packs × 3 levels × 4 seeds × 1 fog | 600 | 2400 | ~$120 |
| **full**: 200 packs × 3 levels × 4 seeds × 1 fog | 2400| 9600 | ~$480 |
| **perception sweep**: full × 6 fog modes | 14400| 57600| ~$2900 |
Always run `--cost-estimate` before the real thing.
## Resume / re-run a failed cell
`--resume` scans the output dir and skips any cell whose JSONL is
complete (its last line has a `terminal:` field). Cells that crashed
mid-run leave a `<stem>.jsonl.partial` behind for forensics; the
final `<stem>.jsonl` is NOT created, so resume correctly retries them.
To force a re-run of one cell, delete its `<stem>.jsonl` (and the
sibling PNG dir if you want a fresh image series). The next
`--resume` invocation will re-spawn it.
For diagnostics, every cell's stdout+stderr is captured to
`<output-dir>/.logs/<cell-id>.log`. Tail one to see exactly what
`python -m openra_bench.run_eval` printed.
## Loading the data for paper analysis
```python
import json
from pathlib import Path
def iter_cells(run_dir):
"""Yield (path, lines) per cell — lines are list[dict]."""
for p in sorted(Path(run_dir).glob("**/*.jsonl")):
if p.name.startswith("_") or p.name.endswith(".partial"):
continue
with open(p) as fh:
lines = [json.loads(l) for l in fh if l.strip()]
yield p, lines
for path, lines in iter_cells("data/runs/paper-collection-v1"):
term = lines[-1].get("terminal", {})
outcome = term.get("outcome", "?")
n_turns = len(lines)
cost = (
term.get("total_tokens_in", 0) / 1e6 * 0.5
+ term.get("total_tokens_out", 0) / 1e6 * 1.5
)
print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}")
```
The viewer at `scripts/view_playback.py` transparently understands
the audit JSONL format alongside the legacy `seed<N>/` dirs — point
it at `data/runs/<run-label>` and it picks up both shapes.
## Caveats / known limits
* `--repeats > 1` currently shares the JSONL stem; for paper-grade
collection keep it at 1 (one cell == one deterministic seed). The
cell-level reliability metric (pass^k) belongs in a separate sweep
with distinct `--seeds`.
* `--full-playback` runs ALONGSIDE the legacy `Playback` — pass both
`--playback` and `--full-playback` to `run_eval` if you want the
human-readable viewer files AND the audit JSONL. The collector
script only emits the audit format (the playback dir is what the
viewer reads natively).
* Engine warnings reflect the `info["warnings"]` list emitted by the
Rust env at each step; the bench does NOT attach a model-level
warning channel (the model is judged only by its commands).
* `model_response.raw` carries the entire provider response JSON,
which on Together's side includes a `usage` block; downstream
paper analysis should pull token counts from there rather than
trusting any aggregate, because the per-turn breakdown is the
authoritative source.
|