File size: 7,411 Bytes
2b3ad6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Per-Scenario Closer-Look Checklist

The reproducible methodology applied to #1 (action-multiunit-coordination)
and #2 (action-sequenced-execution). Every remaining active pack gets
the same treatment: **closer look โ†’ fix defects โ†’ re-verify โ†’ run each
difficulty easyโ†’mediumโ†’hard on the model โ†’ inspect the playback โ†’
commit/push**. Read this fully before touching a pack.

Guiding principle: **the benchmark must validly test its stated
capability; do NOT compensate for model weakness, and do NOT
over-engineer.** A model failing a correctly-designed scenario is a
*good* result (real discrimination). Only fix genuine scenario defects.

---

## A. SOLVENCY โ€” can the intended strategy actually win, in budget?

1. **The win predicate must enforce the advertised capability โ€” and
   only that.** The classic defect: the prose claims X but the win
   condition is satisfiable without doing X.
   - #1: "split the force" but `reach_region` (โ‰ฅ1 unit) let one
     touring unit win โ†’ switched to `units_in_region_gte` (โ‰ฅ2 in EACH
     region).
   - #2: "ordered route" but only the *final* region was a predicate โ†’
     a beeline that skipped every waypoint won โ†’ added the stateful
     `waypoint_sequence` latch (Wk+1 only counts after Wk; skip /
     out-of-order / idle โ‡’ never satisfied).
   - Ask: "what is the laziest play that satisfies this win condition?"
     If it isn't the intended capability, the predicate is wrong.
2. **Is the optimal/intended play winnable within the tick budget?**
   Estimate path length; engine advances **~90 ticks per decision
   turn** (`tick โ‰ˆ 93 + 90ยท(turn-1)`). The intended strategy must
   finish comfortably under `within_ticks`; the *inefficient* strategy
   must NOT (that gap is the discrimination).
3. **Coordinate-blind objectives must be solvable.** If
   `objective_coords: relative`, the model can't cell-count โ€” it needs
   a feedback loop (an `enemy_building_spotted` interrupt revealing the
   marker, a landmark building at each region) and a forgiving enough
   radius. A "search band" beats a bare compass word.
4. **Map fits the actors.** Every actor/region coordinate must be
   inside the map's playable bounds. Actors at scout-arena coords on a
   rush-hour map โ†’ engine panic. Confirm `compiled.map_supported` and
   that `base_map` resolves to the *intended* map.

## B. STABILITY โ€” deterministic, no crashes, fail is reachable

5. **Non-win must be a real LOSS, not a draw.** Every level needs a
   `fail_condition`. Idiom: `any_of[ {after_ticks: BUDGET+1},
   {not:{units_lost_lte: N}}, {not:{own_units_gte: 1}} ]`.
6. **Tick/turn alignment (critical, easy to get wrong).** A
   `fail after_ticks: K` only bites if K is reachable within
   `max_turns`: require `K โ‰ค 93 + 90ยท(max_turns-1)`. Likewise the
   episode must be able to reach `within_ticks` before `max_turns`
   ends, or a staller draws instead of losing. Re-derive this for
   every level after any budget/turn change.
7. **`base_map` override goes INSIDE `overrides:`** โ€” a Level-level
   `base_map` key is silently ignored (it's not a `Level` field).
8. **Smoke the engine path before a full run**: compile the level,
   `_scenario_to_tmp_yaml`, `RustEnvPool` reset(seed) + a few
   `env.step`; if the pack enables interrupts, also drive
   `raw_env.step_until_event([...],None,5,[sig])` ~30 steps โ€” catches
   panics/oob without burning a model run.
9. **Hard-tier spawn contract**: if the pack is in
   `tests/test_hard_tier.py::UPGRADED`, `hard` must keep โ‰ฅ2
   `spawn_point` groups (seed-varied start). A deliberate exception is
   allowed only with a stated reason added to `NOT_APPLICABLE`.

## C. CAPABILITY โ€” clean difficulty axis, faithful framing

10. **One new controlled variable per tier** so a tier failure
    attributes to a single capability. easy = the bare skill; medium =
    +1 axis (a third region / parallel split / contest / attrition);
    hard = +1 more (relative coords + scouting, larger map, strict
    budget). Avoid stacking uncontrolled variables.
11. **Keep established idioms** (e.g. the single-final-region +
    `[after_ticks, within_ticks]` band for "execute, don't stall";
    fact+proc key-building destruction for adversarial). Don't invent a
    new mechanism when one exists โ€” but DO add a reusable predicate /
    engine feature when the capability genuinely can't be expressed
    (`units_in_region_gte`, `waypoint_sequence`, `enemy_building_
    spotted` interrupt, scripted `enemy.bot`).
12. **Title/description are plain and self-explanatory**; the
    objective brief the model sees (`game_knowledge.objective_brief`)
    must state the exact machine win/fail in plain language with no
    degenerate lines.

## D. RUN & INSPECT (one difficulty at a time)

13. Run on **`qwen/qwen3.6-flash`** via OpenRouter (key in `./.env`,
    git-ignored): `python3 -m openra_bench.run_eval --packs
    openra_bench/scenarios/packs/<pack>.yaml --levels <lvl> --seeds 1
    --provider openrouter --model qwen/qwen3.6-flash --playback
    playback/run1`. easy, then medium, then hard.
14. **Inspect the playback** (`playback/run1/*/<pack>:<lvl>:public/
    seed1/`): `manifest.json` (outcome/turns), `score.json`
    (composite/weakest_link/speed), `turns.jsonl` (per-turn `units`,
    `enemies`, `signals`, `interrupt`, `goal`), `messages.json` (model
    text is in the **`reasoning`** field, not `content`). Reconstruct:
    did the intended mechanism fire? final positions vs win regions?
    `units_lost`? terminal "episode end" frame present?
15. **Classify the outcome**: scenario defect (โ†’ fix, re-verify, re-run)
    vs legitimate model failure (โ†’ record as valid discrimination, no
    change). Cite evidence from the playback.

## E. RE-VERIFY & SHIP

16. `python3 -m pytest tests/ -q` fully green (add/extend focused
    tests for any new predicate/knob/scenario behaviour).
17. `python3 scripts/gen_scenario_docs.py` (regenerate the HTML
    catalog) when prose/objectives change.
18. Commit per fixed scenario, **no Claude co-author line**, using
    `git -c commit.gpgsign=false commit`. Then
    `git fetch -q origin && git rebase -q origin/main && git push -q
    origin HEAD`. Engine changes (OpenRA-Rust) commit+push separately;
    rebuild the wheel (`maturin develop --release`) if the engine
    changed and re-run the affected scenario.
19. Record a "Per-scenario closer look โ€” #N" note in
    `SCENARIO_QUALITY.md` (the defect found, the fix, the
    easy/medium/hard outcomes + verdict).

## Reference: defect patterns already seen

- Win condition doesn't enforce the stated capability (laziest play
  wins). โ€” #1, #2
- `reach_region`/single-region where a *split* or *ordered* visit is
  intended. โ€” #1, #2
- Missing `fail_condition` โ‡’ non-win == draw. โ€” #1, #2
- `after_ticks` fail unreachable within `max_turns` (tick/turn
  misalignment). โ€” #2
- Relative objective too literal ("NE corner" โ†’ map extreme), region
  inset & unreachable blind, landmark fogged โ‡’ unfair. โ€” #1 hard
- `base_map` at Level level silently ignored. โ€” #2 hard
- Actors placed off the resolved map โ‡’ engine panic in the interrupt
  path. โ€” #2 hard
- Bench advertises a tool the engine can't execute (1:1 parity). โ€”
  capture_actor / S8
- Inert deadline (`within_ticks` โ‰ซ optimal) โ‡’ no anti-stall teeth. โ€”
  #1 easy, #2 easy (acceptable for *easy* only).