OpenRA-Bench / scripts /COLLECT_EVAL_DATA.md
yxc20098's picture
Quality drive: schema fix, 5 new/revised packs, 4 engine tests, scenario audit
6d71d3b
|
Raw
History Blame Contribute Delete
8.05 kB
# `collect_eval_data.py` — Phase 4 paper-collection driver
Orchestrates per-cell audit-format eval runs across the Together AI
model roster, producing one JSONL (+ minimap PNG dir) per
`(model, pack, level, seed, fog_mode)` cell. Designed so a multi-hour,
multi-model collection can crash-resume losslessly and so every byte
needed for downstream paper analysis is captured at source — full
observations (including the `_raw` engine dict and the spatial
tensor), the literal HTTP request/response, engine warnings, per-turn
signals, and a `terminal:` block with outcome / wall-clock / token
totals.
## Data layout
```
<output-dir>/
_invocation.json # the CLI args + cost estimate
_summary.json # written on completion
.logs/ # per-cell stdout+stderr
<model_safe>__<pack>__<level>__seed<N>__<fog>.log
<model_safe>__<pack>__<level>__seed<N>__<fog>.stats.json
<YYYYMMDD-HHMMSS>__<model_safe>/ # one dir per (timestamp, model)
<pack>__<level>__seed<N>__<fog>.jsonl
<pack>__<level>__seed<N>__<fog>/
turn_001.png
turn_002.png
...
```
The JSONL is one line per turn; each line has these fields (full
schema in `openra_bench/full_playback.py`):
| field | description |
| ----------------- | ----------------------------------------------------------- |
| `turn` | int, 1-based turn index |
| `tick` | int, engine game tick at end of turn |
| `interrupt` | str or null — engine signal name that fired this turn |
| `obs` | dict, full `RustObsAdapter.render_state()` (includes `_raw` and `spatial`) |
| `briefing` | str, exactly the text the model received |
| `system_prompt` | str, first turn only; subsequent turns = null |
| `model_request` | dict, literal `{url, body}` posted to the provider |
| `model_response` | dict, literal `{raw, text, tool_calls, reasoning, usage, finish_reason}` |
| `commands_issued` | list[str], `Command.repr()` per command parsed |
| `engine_warnings` | list[str], from the Rust env's `info["warnings"]` |
| `signals` | dict, primitive signal snapshot (cash, kills, explored%, …) |
| `minimap_png` | str or null, relative path to the per-turn PNG |
| `done` | bool, engine `done` flag |
| `terminal` | present ONLY on the final line; see below |
The `terminal` block has:
```json
{
"outcome": "win|loss|draw",
"final_obs": { ... },
"wall_clock_seconds": 12.345,
"total_tokens_in": 78901,
"total_tokens_out": 1234,
"manifest": { /* scoring/score metadata */ }
}
```
## Invocation
```bash
python3 scripts/collect_eval_data.py \
--models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \
--packs all \
--levels easy,medium,hard \
--seeds 1,2,3,4 \
--fog-modes vision \
--run-label paper-collection-v1 \
--output-dir data/runs/paper-collection-v1 \
--parallel-cells 4 \
--resume
```
* `--packs` accepts `all`, a comma list, `@file.txt` (one pack id per
line), or a directory of `*.yaml`.
* `--fog-modes` accepts any subset of
`structured,structured-clear,vision,vision-clear,image,image-clear`.
* `--parallel-cells` controls subprocess concurrency. Each cell is an
isolated `python -m openra_bench.run_eval` invocation so a crash in
one cell never aborts the rest of the run.
Required env var (set in your shell / `.env`): `TOGETHER_API_KEY`.
## Cost estimates
`--cost-estimate` prints per-model and total token / USD estimates
WITHOUT spawning any subprocesses. The estimator uses average turns
and tokens-per-turn from the May 2026 pilot runs
(`playback/pilot_perception/`, `playback/pilot_handoff/`):
* `avg_turns_per_cell = 18` (max_turns is typically 36; mean is ~half)
* `avg_prompt_tokens_per_turn = 4500` (briefing + image + codex,
with the 16-turn sliding window)
* `avg_completion_tokens_per_turn = 250` (a tool call + brief
reasoning)
Pricing snapshot (USD per 1M tokens, May 2026):
| model | in / M | out / M |
| ---------------------- | ------ | ------- |
| Qwen/Qwen3.5-9B | $0.20 | $0.20 |
| Qwen/Qwen3.6-Plus | $0.50 | $1.50 |
| qwen/qwen3.6-flash | $0.18 | $0.18 |
| google/gemma-4-31B-it | $0.25 | $0.25 |
| moonshotai/Kimi-K2.6 | $0.60 | $2.50 |
Update `_TOGETHER_PRICES` at the top of `scripts/collect_eval_data.py`
when Together's published rates change.
### Common run profiles (cost estimates)
| profile | cells / model | total cells | est USD (4 models, 1 fog) |
| ----------------- | ------------- | ----------- | ------------------------- |
| **smoke**: 10 packs × 1 level × 1 seed × 1 fog | 10 | 40 | ~$1 |
| **mini**: 50 packs × 3 levels × 1 seed × 1 fog | 150 | 600 | ~$30 |
| **medium**: 50 packs × 3 levels × 4 seeds × 1 fog | 600 | 2400 | ~$120 |
| **full**: 200 packs × 3 levels × 4 seeds × 1 fog | 2400| 9600 | ~$480 |
| **perception sweep**: full × 6 fog modes | 14400| 57600| ~$2900 |
Always run `--cost-estimate` before the real thing.
## Resume / re-run a failed cell
`--resume` scans the output dir and skips any cell whose JSONL is
complete (its last line has a `terminal:` field). Cells that crashed
mid-run leave a `<stem>.jsonl.partial` behind for forensics; the
final `<stem>.jsonl` is NOT created, so resume correctly retries them.
To force a re-run of one cell, delete its `<stem>.jsonl` (and the
sibling PNG dir if you want a fresh image series). The next
`--resume` invocation will re-spawn it.
For diagnostics, every cell's stdout+stderr is captured to
`<output-dir>/.logs/<cell-id>.log`. Tail one to see exactly what
`python -m openra_bench.run_eval` printed.
## Loading the data for paper analysis
```python
import json
from pathlib import Path
def iter_cells(run_dir):
"""Yield (path, lines) per cell — lines are list[dict]."""
for p in sorted(Path(run_dir).glob("**/*.jsonl")):
if p.name.startswith("_") or p.name.endswith(".partial"):
continue
with open(p) as fh:
lines = [json.loads(l) for l in fh if l.strip()]
yield p, lines
for path, lines in iter_cells("data/runs/paper-collection-v1"):
term = lines[-1].get("terminal", {})
outcome = term.get("outcome", "?")
n_turns = len(lines)
cost = (
term.get("total_tokens_in", 0) / 1e6 * 0.5
+ term.get("total_tokens_out", 0) / 1e6 * 1.5
)
print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}")
```
The viewer at `scripts/view_playback.py` transparently understands
the audit JSONL format alongside the legacy `seed<N>/` dirs — point
it at `data/runs/<run-label>` and it picks up both shapes.
## Caveats / known limits
* `--repeats > 1` currently shares the JSONL stem; for paper-grade
collection keep it at 1 (one cell == one deterministic seed). The
cell-level reliability metric (pass^k) belongs in a separate sweep
with distinct `--seeds`.
* `--full-playback` runs ALONGSIDE the legacy `Playback` — pass both
`--playback` and `--full-playback` to `run_eval` if you want the
human-readable viewer files AND the audit JSONL. The collector
script only emits the audit format (the playback dir is what the
viewer reads natively).
* Engine warnings reflect the `info["warnings"]` list emitted by the
Rust env at each step; the bench does NOT attach a model-level
warning channel (the model is judged only by its commands).
* `model_response.raw` carries the entire provider response JSON,
which on Together's side includes a `usage` block; downstream
paper analysis should pull token counts from there rather than
trusting any aggregate, because the per-turn breakdown is the
authoritative source.