OpenRA-Bench / SCENARIO_REVIEW_CHECKLIST.md
yxc20098's picture
docs: per-scenario closer-look checklist distilled from #1 and #2
2b3ad6d
|
Raw
History Blame Contribute Delete
7.41 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Per-Scenario Closer-Look Checklist

The reproducible methodology applied to #1 (action-multiunit-coordination) and #2 (action-sequenced-execution). Every remaining active pack gets the same treatment: closer look โ†’ fix defects โ†’ re-verify โ†’ run each difficulty easyโ†’mediumโ†’hard on the model โ†’ inspect the playback โ†’ commit/push. Read this fully before touching a pack.

Guiding principle: the benchmark must validly test its stated capability; do NOT compensate for model weakness, and do NOT over-engineer. A model failing a correctly-designed scenario is a good result (real discrimination). Only fix genuine scenario defects.


A. SOLVENCY โ€” can the intended strategy actually win, in budget?

  1. The win predicate must enforce the advertised capability โ€” and only that. The classic defect: the prose claims X but the win condition is satisfiable without doing X.
    • #1: "split the force" but reach_region (โ‰ฅ1 unit) let one touring unit win โ†’ switched to units_in_region_gte (โ‰ฅ2 in EACH region).
    • #2: "ordered route" but only the final region was a predicate โ†’ a beeline that skipped every waypoint won โ†’ added the stateful waypoint_sequence latch (Wk+1 only counts after Wk; skip / out-of-order / idle โ‡’ never satisfied).
    • Ask: "what is the laziest play that satisfies this win condition?" If it isn't the intended capability, the predicate is wrong.
  2. Is the optimal/intended play winnable within the tick budget? Estimate path length; engine advances ~90 ticks per decision turn (tick โ‰ˆ 93 + 90ยท(turn-1)). The intended strategy must finish comfortably under within_ticks; the inefficient strategy must NOT (that gap is the discrimination).
  3. Coordinate-blind objectives must be solvable. If objective_coords: relative, the model can't cell-count โ€” it needs a feedback loop (an enemy_building_spotted interrupt revealing the marker, a landmark building at each region) and a forgiving enough radius. A "search band" beats a bare compass word.
  4. Map fits the actors. Every actor/region coordinate must be inside the map's playable bounds. Actors at scout-arena coords on a rush-hour map โ†’ engine panic. Confirm compiled.map_supported and that base_map resolves to the intended map.

B. STABILITY โ€” deterministic, no crashes, fail is reachable

  1. Non-win must be a real LOSS, not a draw. Every level needs a fail_condition. Idiom: any_of[ {after_ticks: BUDGET+1}, {not:{units_lost_lte: N}}, {not:{own_units_gte: 1}} ].
  2. Tick/turn alignment (critical, easy to get wrong). A fail after_ticks: K only bites if K is reachable within max_turns: require K โ‰ค 93 + 90ยท(max_turns-1). Likewise the episode must be able to reach within_ticks before max_turns ends, or a staller draws instead of losing. Re-derive this for every level after any budget/turn change.
  3. base_map override goes INSIDE overrides: โ€” a Level-level base_map key is silently ignored (it's not a Level field).
  4. Smoke the engine path before a full run: compile the level, _scenario_to_tmp_yaml, RustEnvPool reset(seed) + a few env.step; if the pack enables interrupts, also drive raw_env.step_until_event([...],None,5,[sig]) ~30 steps โ€” catches panics/oob without burning a model run.
  5. Hard-tier spawn contract: if the pack is in tests/test_hard_tier.py::UPGRADED, hard must keep โ‰ฅ2 spawn_point groups (seed-varied start). A deliberate exception is allowed only with a stated reason added to NOT_APPLICABLE.

C. CAPABILITY โ€” clean difficulty axis, faithful framing

  1. One new controlled variable per tier so a tier failure attributes to a single capability. easy = the bare skill; medium = +1 axis (a third region / parallel split / contest / attrition); hard = +1 more (relative coords + scouting, larger map, strict budget). Avoid stacking uncontrolled variables.
  2. Keep established idioms (e.g. the single-final-region + [after_ticks, within_ticks] band for "execute, don't stall"; fact+proc key-building destruction for adversarial). Don't invent a new mechanism when one exists โ€” but DO add a reusable predicate / engine feature when the capability genuinely can't be expressed (units_in_region_gte, waypoint_sequence, enemy_building_ spotted interrupt, scripted enemy.bot).
  3. Title/description are plain and self-explanatory; the objective brief the model sees (game_knowledge.objective_brief) must state the exact machine win/fail in plain language with no degenerate lines.

D. RUN & INSPECT (one difficulty at a time)

  1. Run on qwen/qwen3.6-flash via OpenRouter (key in ./.env, git-ignored): python3 -m openra_bench.run_eval --packs openra_bench/scenarios/packs/<pack>.yaml --levels <lvl> --seeds 1 --provider openrouter --model qwen/qwen3.6-flash --playback playback/run1. easy, then medium, then hard.
  2. Inspect the playback (playback/run1/*/<pack>:<lvl>:public/ seed1/): manifest.json (outcome/turns), score.json (composite/weakest_link/speed), turns.jsonl (per-turn units, enemies, signals, interrupt, goal), messages.json (model text is in the reasoning field, not content). Reconstruct: did the intended mechanism fire? final positions vs win regions? units_lost? terminal "episode end" frame present?
  3. Classify the outcome: scenario defect (โ†’ fix, re-verify, re-run) vs legitimate model failure (โ†’ record as valid discrimination, no change). Cite evidence from the playback.

E. RE-VERIFY & SHIP

  1. python3 -m pytest tests/ -q fully green (add/extend focused tests for any new predicate/knob/scenario behaviour).
  2. python3 scripts/gen_scenario_docs.py (regenerate the HTML catalog) when prose/objectives change.
  3. Commit per fixed scenario, no Claude co-author line, using git -c commit.gpgsign=false commit. Then git fetch -q origin && git rebase -q origin/main && git push -q origin HEAD. Engine changes (OpenRA-Rust) commit+push separately; rebuild the wheel (maturin develop --release) if the engine changed and re-run the affected scenario.
  4. Record a "Per-scenario closer look โ€” #N" note in SCENARIO_QUALITY.md (the defect found, the fix, the easy/medium/hard outcomes + verdict).

Reference: defect patterns already seen

  • Win condition doesn't enforce the stated capability (laziest play wins). โ€” #1, #2
  • reach_region/single-region where a split or ordered visit is intended. โ€” #1, #2
  • Missing fail_condition โ‡’ non-win == draw. โ€” #1, #2
  • after_ticks fail unreachable within max_turns (tick/turn misalignment). โ€” #2
  • Relative objective too literal ("NE corner" โ†’ map extreme), region inset & unreachable blind, landmark fogged โ‡’ unfair. โ€” #1 hard
  • base_map at Level level silently ignored. โ€” #2 hard
  • Actors placed off the resolved map โ‡’ engine panic in the interrupt path. โ€” #2 hard
  • Bench advertises a tool the engine can't execute (1:1 parity). โ€” capture_actor / S8
  • Inert deadline (within_ticks โ‰ซ optimal) โ‡’ no anti-stall teeth. โ€” #1 easy, #2 easy (acceptable for easy only).