Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.19.0
Per-Scenario Closer-Look Checklist
The reproducible methodology applied to #1 (action-multiunit-coordination) and #2 (action-sequenced-execution). Every remaining active pack gets the same treatment: closer look โ fix defects โ re-verify โ run each difficulty easyโmediumโhard on the model โ inspect the playback โ commit/push. Read this fully before touching a pack.
Guiding principle: the benchmark must validly test its stated capability; do NOT compensate for model weakness, and do NOT over-engineer. A model failing a correctly-designed scenario is a good result (real discrimination). Only fix genuine scenario defects.
A. SOLVENCY โ can the intended strategy actually win, in budget?
- The win predicate must enforce the advertised capability โ and
only that. The classic defect: the prose claims X but the win
condition is satisfiable without doing X.
- #1: "split the force" but
reach_region(โฅ1 unit) let one touring unit win โ switched tounits_in_region_gte(โฅ2 in EACH region). - #2: "ordered route" but only the final region was a predicate โ
a beeline that skipped every waypoint won โ added the stateful
waypoint_sequencelatch (Wk+1 only counts after Wk; skip / out-of-order / idle โ never satisfied). - Ask: "what is the laziest play that satisfies this win condition?" If it isn't the intended capability, the predicate is wrong.
- #1: "split the force" but
- Is the optimal/intended play winnable within the tick budget?
Estimate path length; engine advances ~90 ticks per decision
turn (
tick โ 93 + 90ยท(turn-1)). The intended strategy must finish comfortably underwithin_ticks; the inefficient strategy must NOT (that gap is the discrimination). - Coordinate-blind objectives must be solvable. If
objective_coords: relative, the model can't cell-count โ it needs a feedback loop (anenemy_building_spottedinterrupt revealing the marker, a landmark building at each region) and a forgiving enough radius. A "search band" beats a bare compass word. - Map fits the actors. Every actor/region coordinate must be
inside the map's playable bounds. Actors at scout-arena coords on a
rush-hour map โ engine panic. Confirm
compiled.map_supportedand thatbase_mapresolves to the intended map.
B. STABILITY โ deterministic, no crashes, fail is reachable
- Non-win must be a real LOSS, not a draw. Every level needs a
fail_condition. Idiom:any_of[ {after_ticks: BUDGET+1}, {not:{units_lost_lte: N}}, {not:{own_units_gte: 1}} ]. - Tick/turn alignment (critical, easy to get wrong). A
fail after_ticks: Konly bites if K is reachable withinmax_turns: requireK โค 93 + 90ยท(max_turns-1). Likewise the episode must be able to reachwithin_ticksbeforemax_turnsends, or a staller draws instead of losing. Re-derive this for every level after any budget/turn change. base_mapoverride goes INSIDEoverrides:โ a Level-levelbase_mapkey is silently ignored (it's not aLevelfield).- Smoke the engine path before a full run: compile the level,
_scenario_to_tmp_yaml,RustEnvPoolreset(seed) + a fewenv.step; if the pack enables interrupts, also driveraw_env.step_until_event([...],None,5,[sig])~30 steps โ catches panics/oob without burning a model run. - Hard-tier spawn contract: if the pack is in
tests/test_hard_tier.py::UPGRADED,hardmust keep โฅ2spawn_pointgroups (seed-varied start). A deliberate exception is allowed only with a stated reason added toNOT_APPLICABLE.
C. CAPABILITY โ clean difficulty axis, faithful framing
- One new controlled variable per tier so a tier failure attributes to a single capability. easy = the bare skill; medium = +1 axis (a third region / parallel split / contest / attrition); hard = +1 more (relative coords + scouting, larger map, strict budget). Avoid stacking uncontrolled variables.
- Keep established idioms (e.g. the single-final-region +
[after_ticks, within_ticks]band for "execute, don't stall"; fact+proc key-building destruction for adversarial). Don't invent a new mechanism when one exists โ but DO add a reusable predicate / engine feature when the capability genuinely can't be expressed (units_in_region_gte,waypoint_sequence,enemy_building_ spottedinterrupt, scriptedenemy.bot). - Title/description are plain and self-explanatory; the
objective brief the model sees (
game_knowledge.objective_brief) must state the exact machine win/fail in plain language with no degenerate lines.
D. RUN & INSPECT (one difficulty at a time)
- Run on
qwen/qwen3.6-flashvia OpenRouter (key in./.env, git-ignored):python3 -m openra_bench.run_eval --packs openra_bench/scenarios/packs/<pack>.yaml --levels <lvl> --seeds 1 --provider openrouter --model qwen/qwen3.6-flash --playback playback/run1. easy, then medium, then hard. - Inspect the playback (
playback/run1/*/<pack>:<lvl>:public/ seed1/):manifest.json(outcome/turns),score.json(composite/weakest_link/speed),turns.jsonl(per-turnunits,enemies,signals,interrupt,goal),messages.json(model text is in thereasoningfield, notcontent). Reconstruct: did the intended mechanism fire? final positions vs win regions?units_lost? terminal "episode end" frame present? - Classify the outcome: scenario defect (โ fix, re-verify, re-run) vs legitimate model failure (โ record as valid discrimination, no change). Cite evidence from the playback.
E. RE-VERIFY & SHIP
python3 -m pytest tests/ -qfully green (add/extend focused tests for any new predicate/knob/scenario behaviour).python3 scripts/gen_scenario_docs.py(regenerate the HTML catalog) when prose/objectives change.- Commit per fixed scenario, no Claude co-author line, using
git -c commit.gpgsign=false commit. Thengit fetch -q origin && git rebase -q origin/main && git push -q origin HEAD. Engine changes (OpenRA-Rust) commit+push separately; rebuild the wheel (maturin develop --release) if the engine changed and re-run the affected scenario. - Record a "Per-scenario closer look โ #N" note in
SCENARIO_QUALITY.md(the defect found, the fix, the easy/medium/hard outcomes + verdict).
Reference: defect patterns already seen
- Win condition doesn't enforce the stated capability (laziest play wins). โ #1, #2
reach_region/single-region where a split or ordered visit is intended. โ #1, #2- Missing
fail_conditionโ non-win == draw. โ #1, #2 after_ticksfail unreachable withinmax_turns(tick/turn misalignment). โ #2- Relative objective too literal ("NE corner" โ map extreme), region inset & unreachable blind, landmark fogged โ unfair. โ #1 hard
base_mapat Level level silently ignored. โ #2 hard- Actors placed off the resolved map โ engine panic in the interrupt path. โ #2 hard
- Bench advertises a tool the engine can't execute (1:1 parity). โ capture_actor / S8
- Inert deadline (
within_ticksโซ optimal) โ no anti-stall teeth. โ #1 easy, #2 easy (acceptable for easy only).