Spaces:

qpluslab
/

OpenRA-Bench

Running

App Files Files Community

OpenRA-Bench / scripts /COLLECT_EVAL_DATA.md

yxc20098

Quality drive: schema fix, 5 new/revised packs, 4 engine tests, scenario audit

6d71d3b about 1 month ago

preview code

Raw

History Blame Contribute Delete

8.05 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

`collect_eval_data.py` — Phase 4 paper-collection driver

Orchestrates per-cell audit-format eval runs across the Together AI model roster, producing one JSONL (+ minimap PNG dir) per (model, pack, level, seed, fog_mode) cell. Designed so a multi-hour, multi-model collection can crash-resume losslessly and so every byte needed for downstream paper analysis is captured at source — full observations (including the _raw engine dict and the spatial tensor), the literal HTTP request/response, engine warnings, per-turn signals, and a terminal: block with outcome / wall-clock / token totals.

Data layout

<output-dir>/
  _invocation.json                    # the CLI args + cost estimate
  _summary.json                       # written on completion
  .logs/                              # per-cell stdout+stderr
    <model_safe>__<pack>__<level>__seed<N>__<fog>.log
    <model_safe>__<pack>__<level>__seed<N>__<fog>.stats.json
  <YYYYMMDD-HHMMSS>__<model_safe>/    # one dir per (timestamp, model)
    <pack>__<level>__seed<N>__<fog>.jsonl
    <pack>__<level>__seed<N>__<fog>/
      turn_001.png
      turn_002.png
      ...

The JSONL is one line per turn; each line has these fields (full schema in openra_bench/full_playback.py):

field	description
`turn`	int, 1-based turn index
`tick`	int, engine game tick at end of turn
`interrupt`	str or null — engine signal name that fired this turn
`obs`	dict, full `RustObsAdapter.render_state()` (includes `_raw` and `spatial`)
`briefing`	str, exactly the text the model received
`system_prompt`	str, first turn only; subsequent turns = null
`model_request`	dict, literal `{url, body}` posted to the provider
`model_response`	dict, literal `{raw, text, tool_calls, reasoning, usage, finish_reason}`
`commands_issued`	list[str], `Command.repr()` per command parsed
`engine_warnings`	list[str], from the Rust env's `info["warnings"]`
`signals`	dict, primitive signal snapshot (cash, kills, explored%, …)
`minimap_png`	str or null, relative path to the per-turn PNG
`done`	bool, engine `done` flag
`terminal`	present ONLY on the final line; see below

The terminal block has:

{
  "outcome": "win|loss|draw",
  "final_obs": { ... },
  "wall_clock_seconds": 12.345,
  "total_tokens_in": 78901,
  "total_tokens_out": 1234,
  "manifest": { /* scoring/score metadata */ }
}

Invocation

python3 scripts/collect_eval_data.py \
  --models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \
  --packs all \
  --levels easy,medium,hard \
  --seeds 1,2,3,4 \
  --fog-modes vision \
  --run-label paper-collection-v1 \
  --output-dir data/runs/paper-collection-v1 \
  --parallel-cells 4 \
  --resume

--packs accepts all, a comma list, @file.txt (one pack id per line), or a directory of *.yaml.
--fog-modes accepts any subset of structured,structured-clear,vision,vision-clear,image,image-clear.
--parallel-cells controls subprocess concurrency. Each cell is an isolated python -m openra_bench.run_eval invocation so a crash in one cell never aborts the rest of the run.

Required env var (set in your shell / .env): TOGETHER_API_KEY.

Cost estimates

--cost-estimate prints per-model and total token / USD estimates WITHOUT spawning any subprocesses. The estimator uses average turns and tokens-per-turn from the May 2026 pilot runs (playback/pilot_perception/, playback/pilot_handoff/):

avg_turns_per_cell = 18 (max_turns is typically 36; mean is ~half)
avg_prompt_tokens_per_turn = 4500 (briefing + image + codex, with the 16-turn sliding window)
avg_completion_tokens_per_turn = 250 (a tool call + brief reasoning)

Pricing snapshot (USD per 1M tokens, May 2026):

model	in / M	out / M
Qwen/Qwen3.5-9B	$0.20	$0.20
Qwen/Qwen3.6-Plus	$0.50	$1.50
qwen/qwen3.6-flash	$0.18	$0.18
google/gemma-4-31B-it	$0.25	$0.25
moonshotai/Kimi-K2.6	$0.60	$2.50

Update _TOGETHER_PRICES at the top of scripts/collect_eval_data.py when Together's published rates change.

Common run profiles (cost estimates)

profile	cells / model	total cells	est USD (4 models, 1 fog)
smoke: 10 packs × 1 level × 1 seed × 1 fog	10	40	~$1
mini: 50 packs × 3 levels × 1 seed × 1 fog	150	600	~$30
medium: 50 packs × 3 levels × 4 seeds × 1 fog	600	2400	~$120
full: 200 packs × 3 levels × 4 seeds × 1 fog	2400	9600	~$480
perception sweep: full × 6 fog modes	14400	57600	~$2900

Always run --cost-estimate before the real thing.

Resume / re-run a failed cell

--resume scans the output dir and skips any cell whose JSONL is complete (its last line has a terminal: field). Cells that crashed mid-run leave a <stem>.jsonl.partial behind for forensics; the final <stem>.jsonl is NOT created, so resume correctly retries them.

To force a re-run of one cell, delete its <stem>.jsonl (and the sibling PNG dir if you want a fresh image series). The next --resume invocation will re-spawn it.

For diagnostics, every cell's stdout+stderr is captured to <output-dir>/.logs/<cell-id>.log. Tail one to see exactly what python -m openra_bench.run_eval printed.

Loading the data for paper analysis

import json
from pathlib import Path

def iter_cells(run_dir):
    """Yield (path, lines) per cell — lines are list[dict]."""
    for p in sorted(Path(run_dir).glob("**/*.jsonl")):
        if p.name.startswith("_") or p.name.endswith(".partial"):
            continue
        with open(p) as fh:
            lines = [json.loads(l) for l in fh if l.strip()]
        yield p, lines

for path, lines in iter_cells("data/runs/paper-collection-v1"):
    term = lines[-1].get("terminal", {})
    outcome = term.get("outcome", "?")
    n_turns = len(lines)
    cost = (
        term.get("total_tokens_in", 0) / 1e6 * 0.5
        + term.get("total_tokens_out", 0) / 1e6 * 1.5
    )
    print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}")

The viewer at scripts/view_playback.py transparently understands the audit JSONL format alongside the legacy seed<N>/ dirs — point it at data/runs/<run-label> and it picks up both shapes.

Caveats / known limits

--repeats > 1 currently shares the JSONL stem; for paper-grade collection keep it at 1 (one cell == one deterministic seed). The cell-level reliability metric (pass^k) belongs in a separate sweep with distinct --seeds.
--full-playback runs ALONGSIDE the legacy Playback — pass both --playback and --full-playback to run_eval if you want the human-readable viewer files AND the audit JSONL. The collector script only emits the audit format (the playback dir is what the viewer reads natively).
Engine warnings reflect the info["warnings"] list emitted by the Rust env at each step; the bench does NOT attach a model-level warning channel (the model is judged only by its commands).
model_response.raw carries the entire provider response JSON, which on Together's side includes a usage block; downstream paper analysis should pull token counts from there rather than trusting any aggregate, because the per-turn breakdown is the authoritative source.

collect_eval_data.py — Phase 4 paper-collection driver