OpenRA-Bench / scripts /COLLECT_EVAL_DATA.md
yxc20098's picture
Quality drive: schema fix, 5 new/revised packs, 4 engine tests, scenario audit
6d71d3b
|
Raw
History Blame Contribute Delete
8.05 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

collect_eval_data.py — Phase 4 paper-collection driver

Orchestrates per-cell audit-format eval runs across the Together AI model roster, producing one JSONL (+ minimap PNG dir) per (model, pack, level, seed, fog_mode) cell. Designed so a multi-hour, multi-model collection can crash-resume losslessly and so every byte needed for downstream paper analysis is captured at source — full observations (including the _raw engine dict and the spatial tensor), the literal HTTP request/response, engine warnings, per-turn signals, and a terminal: block with outcome / wall-clock / token totals.

Data layout

<output-dir>/
  _invocation.json                    # the CLI args + cost estimate
  _summary.json                       # written on completion
  .logs/                              # per-cell stdout+stderr
    <model_safe>__<pack>__<level>__seed<N>__<fog>.log
    <model_safe>__<pack>__<level>__seed<N>__<fog>.stats.json
  <YYYYMMDD-HHMMSS>__<model_safe>/    # one dir per (timestamp, model)
    <pack>__<level>__seed<N>__<fog>.jsonl
    <pack>__<level>__seed<N>__<fog>/
      turn_001.png
      turn_002.png
      ...

The JSONL is one line per turn; each line has these fields (full schema in openra_bench/full_playback.py):

field description
turn int, 1-based turn index
tick int, engine game tick at end of turn
interrupt str or null — engine signal name that fired this turn
obs dict, full RustObsAdapter.render_state() (includes _raw and spatial)
briefing str, exactly the text the model received
system_prompt str, first turn only; subsequent turns = null
model_request dict, literal {url, body} posted to the provider
model_response dict, literal {raw, text, tool_calls, reasoning, usage, finish_reason}
commands_issued list[str], Command.repr() per command parsed
engine_warnings list[str], from the Rust env's info["warnings"]
signals dict, primitive signal snapshot (cash, kills, explored%, …)
minimap_png str or null, relative path to the per-turn PNG
done bool, engine done flag
terminal present ONLY on the final line; see below

The terminal block has:

{
  "outcome": "win|loss|draw",
  "final_obs": { ... },
  "wall_clock_seconds": 12.345,
  "total_tokens_in": 78901,
  "total_tokens_out": 1234,
  "manifest": { /* scoring/score metadata */ }
}

Invocation

python3 scripts/collect_eval_data.py \
  --models Qwen/Qwen3.5-9B,Qwen/Qwen3.6-Plus,google/gemma-4-31B-it,moonshotai/Kimi-K2.6 \
  --packs all \
  --levels easy,medium,hard \
  --seeds 1,2,3,4 \
  --fog-modes vision \
  --run-label paper-collection-v1 \
  --output-dir data/runs/paper-collection-v1 \
  --parallel-cells 4 \
  --resume
  • --packs accepts all, a comma list, @file.txt (one pack id per line), or a directory of *.yaml.
  • --fog-modes accepts any subset of structured,structured-clear,vision,vision-clear,image,image-clear.
  • --parallel-cells controls subprocess concurrency. Each cell is an isolated python -m openra_bench.run_eval invocation so a crash in one cell never aborts the rest of the run.

Required env var (set in your shell / .env): TOGETHER_API_KEY.

Cost estimates

--cost-estimate prints per-model and total token / USD estimates WITHOUT spawning any subprocesses. The estimator uses average turns and tokens-per-turn from the May 2026 pilot runs (playback/pilot_perception/, playback/pilot_handoff/):

  • avg_turns_per_cell = 18 (max_turns is typically 36; mean is ~half)
  • avg_prompt_tokens_per_turn = 4500 (briefing + image + codex, with the 16-turn sliding window)
  • avg_completion_tokens_per_turn = 250 (a tool call + brief reasoning)

Pricing snapshot (USD per 1M tokens, May 2026):

model in / M out / M
Qwen/Qwen3.5-9B $0.20 $0.20
Qwen/Qwen3.6-Plus $0.50 $1.50
qwen/qwen3.6-flash $0.18 $0.18
google/gemma-4-31B-it $0.25 $0.25
moonshotai/Kimi-K2.6 $0.60 $2.50

Update _TOGETHER_PRICES at the top of scripts/collect_eval_data.py when Together's published rates change.

Common run profiles (cost estimates)

profile cells / model total cells est USD (4 models, 1 fog)
smoke: 10 packs × 1 level × 1 seed × 1 fog 10 40 ~$1
mini: 50 packs × 3 levels × 1 seed × 1 fog 150 600 ~$30
medium: 50 packs × 3 levels × 4 seeds × 1 fog 600 2400 ~$120
full: 200 packs × 3 levels × 4 seeds × 1 fog 2400 9600 ~$480
perception sweep: full × 6 fog modes 14400 57600 ~$2900

Always run --cost-estimate before the real thing.

Resume / re-run a failed cell

--resume scans the output dir and skips any cell whose JSONL is complete (its last line has a terminal: field). Cells that crashed mid-run leave a <stem>.jsonl.partial behind for forensics; the final <stem>.jsonl is NOT created, so resume correctly retries them.

To force a re-run of one cell, delete its <stem>.jsonl (and the sibling PNG dir if you want a fresh image series). The next --resume invocation will re-spawn it.

For diagnostics, every cell's stdout+stderr is captured to <output-dir>/.logs/<cell-id>.log. Tail one to see exactly what python -m openra_bench.run_eval printed.

Loading the data for paper analysis

import json
from pathlib import Path

def iter_cells(run_dir):
    """Yield (path, lines) per cell — lines are list[dict]."""
    for p in sorted(Path(run_dir).glob("**/*.jsonl")):
        if p.name.startswith("_") or p.name.endswith(".partial"):
            continue
        with open(p) as fh:
            lines = [json.loads(l) for l in fh if l.strip()]
        yield p, lines

for path, lines in iter_cells("data/runs/paper-collection-v1"):
    term = lines[-1].get("terminal", {})
    outcome = term.get("outcome", "?")
    n_turns = len(lines)
    cost = (
        term.get("total_tokens_in", 0) / 1e6 * 0.5
        + term.get("total_tokens_out", 0) / 1e6 * 1.5
    )
    print(f"{path.stem:<60} {outcome:<5} turns={n_turns:>3} ~${cost:.3f}")

The viewer at scripts/view_playback.py transparently understands the audit JSONL format alongside the legacy seed<N>/ dirs — point it at data/runs/<run-label> and it picks up both shapes.

Caveats / known limits

  • --repeats > 1 currently shares the JSONL stem; for paper-grade collection keep it at 1 (one cell == one deterministic seed). The cell-level reliability metric (pass^k) belongs in a separate sweep with distinct --seeds.
  • --full-playback runs ALONGSIDE the legacy Playback — pass both --playback and --full-playback to run_eval if you want the human-readable viewer files AND the audit JSONL. The collector script only emits the audit format (the playback dir is what the viewer reads natively).
  • Engine warnings reflect the info["warnings"] list emitted by the Rust env at each step; the bench does NOT attach a model-level warning channel (the model is judged only by its commands).
  • model_response.raw carries the entire provider response JSON, which on Together's side includes a usage block; downstream paper analysis should pull token counts from there rather than trusting any aggregate, because the per-turn breakdown is the authoritative source.