File size: 8,050 Bytes
6d71d3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# `collect_eval_data.py` — Phase 4 paper-collection driver

Orchestrates per-cell audit-format eval runs across the Together AI
model roster, producing one JSONL (+ minimap PNG dir) per
`(model, pack, level, seed, fog_mode)` cell. Designed so a multi-hour,
multi-model collection can crash-resume losslessly and so every byte
needed for downstream paper analysis is captured at source — full
observations (including the `_raw` engine dict and the spatial
tensor), the literal HTTP request/response, engine warnings, per-turn
signals, and a `terminal:` block with outcome / wall-clock / token
totals.

## Data layout

```
<output-dir>/
  _invocation.json                    # the CLI args + cost estimate
  _summary.json                       # written on completion
  .logs/                              # per-cell stdout+stderr
    <model_safe>__<pack>__<level>__seed<N>__<fog>.log
    <model_safe>__<pack>__<level>__seed<N>__<fog>.stats.json
  <YYYYMMDD-HHMMSS>__<model_safe>/    # one dir per (timestamp, model)
    <pack>__<level>__seed<N>__<fog>.jsonl
    <pack>__<level>__seed<N>__<fog>/
      turn_001.png
      turn_002.png
      ...
```

The JSONL is one line per turn; each line has these fields (full
schema in `openra_bench/full_playback.py`):

| field             | description                                                 |
| ----------------- | ----------------------------------------------------------- |
| `turn`            | int, 1-based turn index                                     |
| `tick`            | int, engine game tick at end of turn                        |
| `interrupt`       | str or null — engine signal name that fired this turn       |
| `obs`             | dict, full `RustObsAdapter.render_state()` (includes `_raw` and `spatial`) |
| `briefing`        | str, exactly the text the model received                    |
| `system_prompt`   | str, first turn only; subsequent turns = null               |
| `model_request`   | dict, literal `{url, body}` posted to the provider          |
| `model_response`  | dict, literal `{raw, text, tool_calls, reasoning, usage, finish_reason}` |
| `commands_issued` | list[str], `Command.repr()` per command parsed              |
| `engine_warnings` | list[str], from the Rust env's `info["warnings"]`           |
| `signals`         | dict, primitive signal snapshot (cash, kills, explored%, …) |
| `minimap_png`     | str or null, relative path to the per-turn PNG              |
| `done`            | bool, engine `done` flag                                    |
| `terminal`        | present ONLY on the final line; see below                   |

The `terminal` block has:

```json
{
  "outcome": "win|loss|draw",
  "final_obs": { ... },
  "wall_clock_seconds": 12.345,
  "total_tokens_in": 78901,
  "total_tokens_out": 1234,
  "manifest": { /* scoring/score metadata */ }
}
```

## Invocation

```bash
python3 scripts/collect_eval_data.py \
  --models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \
  --packs all \
  --levels easy,medium,hard \
  --seeds 1,2,3,4 \
  --fog-modes vision \
  --run-label paper-collection-v1 \
  --output-dir data/runs/paper-collection-v1 \
  --parallel-cells 4 \
  --resume
```

* `--packs` accepts `all`, a comma list, `@file.txt` (one pack id per
  line), or a directory of `*.yaml`.
* `--fog-modes` accepts any subset of
  `structured,structured-clear,vision,vision-clear,image,image-clear`.
* `--parallel-cells` controls subprocess concurrency. Each cell is an
  isolated `python -m openra_bench.run_eval` invocation so a crash in
  one cell never aborts the rest of the run.

Required env var (set in your shell / `.env`): `TOGETHER_API_KEY`.

## Cost estimates

`--cost-estimate` prints per-model and total token / USD estimates
WITHOUT spawning any subprocesses. The estimator uses average turns
and tokens-per-turn from the May 2026 pilot runs
(`playback/pilot_perception/`, `playback/pilot_handoff/`):

* `avg_turns_per_cell = 18` (max_turns is typically 36; mean is ~half)
* `avg_prompt_tokens_per_turn = 4500` (briefing + image + codex,
  with the 16-turn sliding window)
* `avg_completion_tokens_per_turn = 250` (a tool call + brief
  reasoning)

Pricing snapshot (USD per 1M tokens, May 2026):

| model                  | in / M | out / M |
| ---------------------- | ------ | ------- |
| Qwen/Qwen3.5-9B        | $0.20  | $0.20   |
| Qwen/Qwen3.6-Plus      | $0.50  | $1.50   |
| qwen/qwen3.6-flash     | $0.18  | $0.18   |
| google/gemma-4-31B-it  | $0.25  | $0.25   |
| moonshotai/Kimi-K2.6   | $0.60  | $2.50   |

Update `_TOGETHER_PRICES` at the top of `scripts/collect_eval_data.py`
when Together's published rates change.

### Common run profiles (cost estimates)

| profile           | cells / model | total cells | est USD (4 models, 1 fog) |
| ----------------- | ------------- | ----------- | ------------------------- |
| **smoke**: 10 packs × 1 level × 1 seed × 1 fog        | 10  | 40   | ~$1     |
| **mini**:  50 packs × 3 levels × 1 seed × 1 fog       | 150 | 600  | ~$30    |
| **medium**: 50 packs × 3 levels × 4 seeds × 1 fog     | 600 | 2400 | ~$120   |
| **full**: 200 packs × 3 levels × 4 seeds × 1 fog      | 2400| 9600 | ~$480   |
| **perception sweep**: full × 6 fog modes              | 14400| 57600| ~$2900  |

Always run `--cost-estimate` before the real thing.

## Resume / re-run a failed cell

`--resume` scans the output dir and skips any cell whose JSONL is
complete (its last line has a `terminal:` field). Cells that crashed
mid-run leave a `<stem>.jsonl.partial` behind for forensics; the
final `<stem>.jsonl` is NOT created, so resume correctly retries them.

To force a re-run of one cell, delete its `<stem>.jsonl` (and the
sibling PNG dir if you want a fresh image series). The next
`--resume` invocation will re-spawn it.

For diagnostics, every cell's stdout+stderr is captured to
`<output-dir>/.logs/<cell-id>.log`. Tail one to see exactly what
`python -m openra_bench.run_eval` printed.

## Loading the data for paper analysis

```python
import json
from pathlib import Path

def iter_cells(run_dir):
    """Yield (path, lines) per cell — lines are list[dict]."""
    for p in sorted(Path(run_dir).glob("**/*.jsonl")):
        if p.name.startswith("_") or p.name.endswith(".partial"):
            continue
        with open(p) as fh:
            lines = [json.loads(l) for l in fh if l.strip()]
        yield p, lines

for path, lines in iter_cells("data/runs/paper-collection-v1"):
    term = lines[-1].get("terminal", {})
    outcome = term.get("outcome", "?")
    n_turns = len(lines)
    cost = (
        term.get("total_tokens_in", 0) / 1e6 * 0.5
        + term.get("total_tokens_out", 0) / 1e6 * 1.5
    )
    print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}")
```

The viewer at `scripts/view_playback.py` transparently understands
the audit JSONL format alongside the legacy `seed<N>/` dirs — point
it at `data/runs/<run-label>` and it picks up both shapes.

## Caveats / known limits

* `--repeats > 1` currently shares the JSONL stem; for paper-grade
  collection keep it at 1 (one cell == one deterministic seed). The
  cell-level reliability metric (pass^k) belongs in a separate sweep
  with distinct `--seeds`.
* `--full-playback` runs ALONGSIDE the legacy `Playback` — pass both
  `--playback` and `--full-playback` to `run_eval` if you want the
  human-readable viewer files AND the audit JSONL. The collector
  script only emits the audit format (the playback dir is what the
  viewer reads natively).
* Engine warnings reflect the `info["warnings"]` list emitted by the
  Rust env at each step; the bench does NOT attach a model-level
  warning channel (the model is judged only by its commands).
* `model_response.raw` carries the entire provider response JSON,
  which on Together's side includes a `usage` block; downstream
  paper analysis should pull token counts from there rather than
  trusting any aggregate, because the per-turn breakdown is the
  authoritative source.