Spaces:
Running
Running
File size: 7,411 Bytes
2b3ad6d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | # Per-Scenario Closer-Look Checklist
The reproducible methodology applied to #1 (action-multiunit-coordination)
and #2 (action-sequenced-execution). Every remaining active pack gets
the same treatment: **closer look โ fix defects โ re-verify โ run each
difficulty easyโmediumโhard on the model โ inspect the playback โ
commit/push**. Read this fully before touching a pack.
Guiding principle: **the benchmark must validly test its stated
capability; do NOT compensate for model weakness, and do NOT
over-engineer.** A model failing a correctly-designed scenario is a
*good* result (real discrimination). Only fix genuine scenario defects.
---
## A. SOLVENCY โ can the intended strategy actually win, in budget?
1. **The win predicate must enforce the advertised capability โ and
only that.** The classic defect: the prose claims X but the win
condition is satisfiable without doing X.
- #1: "split the force" but `reach_region` (โฅ1 unit) let one
touring unit win โ switched to `units_in_region_gte` (โฅ2 in EACH
region).
- #2: "ordered route" but only the *final* region was a predicate โ
a beeline that skipped every waypoint won โ added the stateful
`waypoint_sequence` latch (Wk+1 only counts after Wk; skip /
out-of-order / idle โ never satisfied).
- Ask: "what is the laziest play that satisfies this win condition?"
If it isn't the intended capability, the predicate is wrong.
2. **Is the optimal/intended play winnable within the tick budget?**
Estimate path length; engine advances **~90 ticks per decision
turn** (`tick โ 93 + 90ยท(turn-1)`). The intended strategy must
finish comfortably under `within_ticks`; the *inefficient* strategy
must NOT (that gap is the discrimination).
3. **Coordinate-blind objectives must be solvable.** If
`objective_coords: relative`, the model can't cell-count โ it needs
a feedback loop (an `enemy_building_spotted` interrupt revealing the
marker, a landmark building at each region) and a forgiving enough
radius. A "search band" beats a bare compass word.
4. **Map fits the actors.** Every actor/region coordinate must be
inside the map's playable bounds. Actors at scout-arena coords on a
rush-hour map โ engine panic. Confirm `compiled.map_supported` and
that `base_map` resolves to the *intended* map.
## B. STABILITY โ deterministic, no crashes, fail is reachable
5. **Non-win must be a real LOSS, not a draw.** Every level needs a
`fail_condition`. Idiom: `any_of[ {after_ticks: BUDGET+1},
{not:{units_lost_lte: N}}, {not:{own_units_gte: 1}} ]`.
6. **Tick/turn alignment (critical, easy to get wrong).** A
`fail after_ticks: K` only bites if K is reachable within
`max_turns`: require `K โค 93 + 90ยท(max_turns-1)`. Likewise the
episode must be able to reach `within_ticks` before `max_turns`
ends, or a staller draws instead of losing. Re-derive this for
every level after any budget/turn change.
7. **`base_map` override goes INSIDE `overrides:`** โ a Level-level
`base_map` key is silently ignored (it's not a `Level` field).
8. **Smoke the engine path before a full run**: compile the level,
`_scenario_to_tmp_yaml`, `RustEnvPool` reset(seed) + a few
`env.step`; if the pack enables interrupts, also drive
`raw_env.step_until_event([...],None,5,[sig])` ~30 steps โ catches
panics/oob without burning a model run.
9. **Hard-tier spawn contract**: if the pack is in
`tests/test_hard_tier.py::UPGRADED`, `hard` must keep โฅ2
`spawn_point` groups (seed-varied start). A deliberate exception is
allowed only with a stated reason added to `NOT_APPLICABLE`.
## C. CAPABILITY โ clean difficulty axis, faithful framing
10. **One new controlled variable per tier** so a tier failure
attributes to a single capability. easy = the bare skill; medium =
+1 axis (a third region / parallel split / contest / attrition);
hard = +1 more (relative coords + scouting, larger map, strict
budget). Avoid stacking uncontrolled variables.
11. **Keep established idioms** (e.g. the single-final-region +
`[after_ticks, within_ticks]` band for "execute, don't stall";
fact+proc key-building destruction for adversarial). Don't invent a
new mechanism when one exists โ but DO add a reusable predicate /
engine feature when the capability genuinely can't be expressed
(`units_in_region_gte`, `waypoint_sequence`, `enemy_building_
spotted` interrupt, scripted `enemy.bot`).
12. **Title/description are plain and self-explanatory**; the
objective brief the model sees (`game_knowledge.objective_brief`)
must state the exact machine win/fail in plain language with no
degenerate lines.
## D. RUN & INSPECT (one difficulty at a time)
13. Run on **`qwen/qwen3.6-flash`** via OpenRouter (key in `./.env`,
git-ignored): `python3 -m openra_bench.run_eval --packs
openra_bench/scenarios/packs/<pack>.yaml --levels <lvl> --seeds 1
--provider openrouter --model qwen/qwen3.6-flash --playback
playback/run1`. easy, then medium, then hard.
14. **Inspect the playback** (`playback/run1/*/<pack>:<lvl>:public/
seed1/`): `manifest.json` (outcome/turns), `score.json`
(composite/weakest_link/speed), `turns.jsonl` (per-turn `units`,
`enemies`, `signals`, `interrupt`, `goal`), `messages.json` (model
text is in the **`reasoning`** field, not `content`). Reconstruct:
did the intended mechanism fire? final positions vs win regions?
`units_lost`? terminal "episode end" frame present?
15. **Classify the outcome**: scenario defect (โ fix, re-verify, re-run)
vs legitimate model failure (โ record as valid discrimination, no
change). Cite evidence from the playback.
## E. RE-VERIFY & SHIP
16. `python3 -m pytest tests/ -q` fully green (add/extend focused
tests for any new predicate/knob/scenario behaviour).
17. `python3 scripts/gen_scenario_docs.py` (regenerate the HTML
catalog) when prose/objectives change.
18. Commit per fixed scenario, **no Claude co-author line**, using
`git -c commit.gpgsign=false commit`. Then
`git fetch -q origin && git rebase -q origin/main && git push -q
origin HEAD`. Engine changes (OpenRA-Rust) commit+push separately;
rebuild the wheel (`maturin develop --release`) if the engine
changed and re-run the affected scenario.
19. Record a "Per-scenario closer look โ #N" note in
`SCENARIO_QUALITY.md` (the defect found, the fix, the
easy/medium/hard outcomes + verdict).
## Reference: defect patterns already seen
- Win condition doesn't enforce the stated capability (laziest play
wins). โ #1, #2
- `reach_region`/single-region where a *split* or *ordered* visit is
intended. โ #1, #2
- Missing `fail_condition` โ non-win == draw. โ #1, #2
- `after_ticks` fail unreachable within `max_turns` (tick/turn
misalignment). โ #2
- Relative objective too literal ("NE corner" โ map extreme), region
inset & unreachable blind, landmark fogged โ unfair. โ #1 hard
- `base_map` at Level level silently ignored. โ #2 hard
- Actors placed off the resolved map โ engine panic in the interrupt
path. โ #2 hard
- Bench advertises a tool the engine can't execute (1:1 parity). โ
capture_actor / S8
- Inert deadline (`within_ticks` โซ optimal) โ no anti-stall teeth. โ
#1 easy, #2 easy (acceptable for *easy* only).
|