yxc20098 commited on
Commit
34333cc
·
1 Parent(s): ee008d0

feat(scenario): build-sequence-tech-fastest — fastest weap-tech BO (PlanBench cost-optimal anchor)

Browse files
openra_bench/scenarios/packs/build-sequence-tech-fastest.yaml ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # build-sequence-tech-fastest.yaml
2
+ #
3
+ # REASONING capability — Wave-7 build-order optimization (cost-OPTIMAL
4
+ # planning). The agent must reach `weap` (war factory) within the
5
+ # tightest possible tick budget by choosing the correct prerequisite
6
+ # path: powr → proc → weap. Any extra structure (a barracks/tent
7
+ # detour, an unneeded second power plant, idle stalling) overruns the
8
+ # deadline.
9
+ #
10
+ # Engine-verified tech tree (vendor/OpenRA/mods/ra/rules/structures.yaml):
11
+ # - POWR cost 300, Prerequisites: <none> (provides anypower)
12
+ # - PROC cost 1400, Prerequisites: anypower (needs powr)
13
+ # - WEAP cost 2000, Prerequisites: proc (needs proc)
14
+ # Total credits on the optimal path = 3700; starting_cash 5000 leaves
15
+ # small slack but does NOT cover a wasted tent (500) AND meet the
16
+ # tightest deadline. The Wave-2 `then:` happened-before composite is
17
+ # the load-bearing teeth — clauses [powr, proc, weap] latch in order;
18
+ # a policy that places weap before proc cannot satisfy the chain
19
+ # (engine refuses too — weap's prereq is proc).
20
+ #
21
+ # Measured optimal timing (rush-hour-arena, fact pre-placed at
22
+ # (10,18), seed 1, scripted intended policy):
23
+ # - powr completes ≈ tick 273 (turn 3)
24
+ # - proc completes ≈ tick 1263 (turn 14)
25
+ # - weap completes ≈ tick 2613 (turn 29)
26
+ # Measured WRONG-PATH timing (powr → tent → proc → weap):
27
+ # - weap completes ≈ tick 3063 (turn 34) — 5 turns / 450 ticks
28
+ # slower than optimal. The deadline must fall INSIDE this gap.
29
+ #
30
+ # Bar (CLAUDE.md):
31
+ # - stall (observe-only) ⇒ LOSS on every level + seed
32
+ # - build-tent-first wrong path ⇒ LOSS on every level + seed
33
+ # - intended powr→proc→weap path ⇒ WIN on every level + seed
34
+ # Real LOSS not DRAW: fail_condition `after_ticks: T+1` reachable
35
+ # inside max_turns (engine ~90 ticks/turn ⇒ tick ≤ 93+90·(N-1)). The
36
+ # pre-placed enemy `fact` at the far east is a MustBeDestroyed
37
+ # landmark that keeps the episode alive (no premature engine
38
+ # auto-done from eliminating a stray sentry).
39
+ #
40
+ # Real-world anchor:
41
+ # - PlanBench cost-optimal planning (find the minimum-cost plan
42
+ # that achieves the goal, not just A plan)
43
+ # - Manufacturing BOM-optimal ramp / critical-path scheduling
44
+ # (build only what the next stage requires; do not bloat the
45
+ # bill of materials)
46
+ #
47
+ # Validate:
48
+ # cd /Users/berta/Projects/OpenRA-Bench && \
49
+ # python3 -m pytest tests/test_build_sequence_tech_fastest.py -q
50
+
51
+ meta:
52
+ id: build-sequence-tech-fastest
53
+ title: 'Fastest War Factory — Cost-Optimal powr → proc → weap Build Order'
54
+ capability: reasoning
55
+ real_world_meaning: >
56
+ Cost-optimal build-order planning under a tight deadline: the agent
57
+ must reach the war factory (`weap`) on the shortest prerequisite
58
+ path (powr → proc → weap). Any detour through unneeded structures
59
+ (a barracks, a second power plant, an early infantry training
60
+ queue) bloats the bill-of-materials and overruns the budget. Tests
61
+ that the model can plan the minimum-cost prerequisite chain — not
62
+ merely SOME plan that eventually arrives — under a deadline that
63
+ only the optimal plan satisfies.
64
+ robotics_analogue: >
65
+ Critical-path planning in autonomous manufacturing: a cell must
66
+ bring a target machine online by a fixed cycle-time, choosing the
67
+ minimum set of upstream stations to commission first (power →
68
+ feedstock → assembly). Adding non-load-bearing stations to the
69
+ ramp-up plan (a non-required quality station before assembly)
70
+ blows the deadline; only the cost-optimal precedence chain meets
71
+ spec.
72
+ benchmark_anchor:
73
+ - "PlanBench cost-optimal"
74
+ - "BOM manufacturing"
75
+ author: openra-bench
76
+
77
+ base_map: rush-hour-arena
78
+
79
+ base:
80
+ agent:
81
+ faction: allies
82
+ enemy:
83
+ faction: soviet
84
+ bot_type: ''
85
+ tools:
86
+ - observe
87
+ - build
88
+ - place_building
89
+ planning: true
90
+ termination:
91
+ max_ticks: 40000
92
+
93
+ levels:
94
+ # ── EASY ─────────────────────────────────────────────────────────
95
+ # Bare cost-optimal skill. Generous T = 3000 ticks (max_turns 40 →
96
+ # reachable 3603). Optimal path lands at ~tick 2613 (387-tick / 4-
97
+ # turn buffer). The wrong-path detour through tent (+500 cost, +5
98
+ # turns) finishes at ~tick 3063, beyond T ⇒ LOSS. Stall finishes
99
+ # never ⇒ LOSS on the after_ticks fail clause.
100
+ easy:
101
+ description: >
102
+ Build a war factory (weap) as fast as possible by following the
103
+ ONLY cost-optimal prerequisite chain: powr → proc → weap. Any
104
+ detour (a barracks/tent, a redundant power plant, an early
105
+ infantry training queue) wastes the budget and you LOSE on the
106
+ clock. The `then:` chain enforces the exact order — placing
107
+ weap before proc cannot satisfy it (and the engine refuses too:
108
+ weap's prerequisite is proc). Optimal play finishes by tick
109
+ ~2613; the deadline is 3000.
110
+ starting_cash: 5000
111
+ overrides:
112
+ actors:
113
+ # Agent base seed — ONE construction yard. Nothing else
114
+ # pre-placed (no power, no refinery). The optimal chain MUST
115
+ # be executed by the agent.
116
+ - {type: fact, owner: agent, position: [10, 18]}
117
+ # Two ore patches in the near-base build radius — a built
118
+ # proc auto-spawns a harvester that needs ore to fund the
119
+ # weap purchase inside the tick budget.
120
+ - {type: mine, owner: neutral, position: [22, 18]}
121
+ - {type: mine, owner: neutral, position: [22, 22]}
122
+ # Far-east enemy `fact` landmark — MustBeDestroyed, unarmed
123
+ # neutral company. Keeps the episode alive so a stall really
124
+ # times out (not engine auto-done from a stray sentry kill).
125
+ - {type: fact, owner: enemy, position: [115, 30]}
126
+ win_condition:
127
+ all_of:
128
+ - then:
129
+ id: bo-easy
130
+ clauses:
131
+ - {has_building: powr}
132
+ - {has_building: proc}
133
+ - {has_building: weap}
134
+ - {within_ticks: 3000}
135
+ fail_condition:
136
+ any_of:
137
+ - {after_ticks: 3001}
138
+ - {not: {building_count_gte: {type: fact, n: 1}}}
139
+ max_turns: 40
140
+
141
+ # ── MEDIUM ───────────────────────────────────────────────────────
142
+ # +1 controlled variable: TIGHTER deadline. T = 2800 ticks
143
+ # (max_turns 35 → reachable 3153). Optimal play lands at ~tick
144
+ # 2613 (187-tick / ~2-turn buffer — feasible). The wrong-path
145
+ # detour through tent overruns hard (3063 > 2800 by ~5 turns).
146
+ # No additional pieces — the SAME cost-optimal chain, executed
147
+ # with less slack.
148
+ medium:
149
+ description: >
150
+ Build a war factory (weap) on the cost-optimal prerequisite
151
+ chain: powr → proc → weap. Tighter deadline (2800 ticks) — any
152
+ detour (tent / second powr / infantry queue) makes you miss.
153
+ The `then:` chain enforces the exact order; weap before proc
154
+ cannot satisfy it. Optimal play finishes by tick ~2613.
155
+ starting_cash: 5000
156
+ overrides:
157
+ actors:
158
+ - {type: fact, owner: agent, position: [10, 18]}
159
+ - {type: mine, owner: neutral, position: [22, 18]}
160
+ - {type: mine, owner: neutral, position: [22, 22]}
161
+ - {type: fact, owner: enemy, position: [115, 30]}
162
+ win_condition:
163
+ all_of:
164
+ - then:
165
+ id: bo-medium
166
+ clauses:
167
+ - {has_building: powr}
168
+ - {has_building: proc}
169
+ - {has_building: weap}
170
+ - {within_ticks: 2800}
171
+ fail_condition:
172
+ any_of:
173
+ - {after_ticks: 2801}
174
+ - {not: {building_count_gte: {type: fact, n: 1}}}
175
+ max_turns: 35
176
+
177
+ # ── HARD ─────────────────────────────────────────────────────────
178
+ # +1 controlled variable: ≥2 spawn_point groups (NORTH y=14 vs
179
+ # SOUTH y=26 base). Same cost-optimal chain, same tight T = 2800.
180
+ # The seed-varied spawn means a memorised "place powr at (14,22)"
181
+ # opening cannot generalise — the agent must compute placement
182
+ # relative to its actual fact each seed. Ore patches duplicated
183
+ # at both latitudes so harv income is symmetric per spawn. Enemy
184
+ # actors do NOT honour spawn_point (CLAUDE.md), so the lone
185
+ # enemy `fact` always places.
186
+ hard:
187
+ description: >
188
+ Build a war factory (weap) on the cost-optimal prerequisite
189
+ chain: powr → proc → weap, from a seed-chosen base (NORTH or
190
+ SOUTH). Tight 2800-tick deadline — detours (tent / extra
191
+ powr / infantry queue) lose on the clock. Placement that
192
+ memorises one spawn's geometry cannot generalise; compute
193
+ placement relative to your actual fact each run.
194
+ starting_cash: 5000
195
+ overrides:
196
+ actors:
197
+ # NORTH spawn (spawn_point 0): fact at y=14, with adjacent
198
+ # ore patches at y=14/y=18.
199
+ - {type: fact, owner: agent, position: [10, 14], spawn_point: 0}
200
+ # An inert rifleman per spawn group (passive: stance 2 = Defend,
201
+ # no `move_units` / `attack_unit` exposed so the unit cannot act
202
+ # — the agent's tool surface is build-only). Establishes a
203
+ # seed-varying AGENT UNIT in `units_summary` so the hard-tier
204
+ # spawn-variation contract (tests/test_hard_tier.py::
205
+ # test_curated_hard_still_compiles_and_runs, which inspects
206
+ # units not buildings) is satisfied with real per-spawn data.
207
+ - {type: e1, owner: agent, position: [12, 14], spawn_point: 0, stance: 2}
208
+ # SOUTH spawn (spawn_point 1): fact at y=26, with adjacent
209
+ # ore patches at y=22/y=26.
210
+ - {type: fact, owner: agent, position: [10, 26], spawn_point: 1}
211
+ - {type: e1, owner: agent, position: [12, 26], spawn_point: 1, stance: 2}
212
+ # Ore patches duplicated at BOTH latitudes so harv income is
213
+ # symmetric whichever spawn is chosen. (Neutral actors have
214
+ # no spawn_point and always place — that's fine: the unused
215
+ # patches are simply ignored.)
216
+ - {type: mine, owner: neutral, position: [22, 14]}
217
+ - {type: mine, owner: neutral, position: [22, 18]}
218
+ - {type: mine, owner: neutral, position: [22, 22]}
219
+ - {type: mine, owner: neutral, position: [22, 26]}
220
+ # Far-east enemy fact landmark — keeps the episode alive.
221
+ - {type: fact, owner: enemy, position: [115, 30]}
222
+ win_condition:
223
+ all_of:
224
+ - then:
225
+ id: bo-hard
226
+ clauses:
227
+ - {has_building: powr}
228
+ - {has_building: proc}
229
+ - {has_building: weap}
230
+ - {within_ticks: 2800}
231
+ fail_condition:
232
+ any_of:
233
+ - {after_ticks: 2801}
234
+ - {not: {building_count_gte: {type: fact, n: 1}}}
235
+ max_turns: 35
tests/test_build_sequence_tech_fastest.py ADDED
@@ -0,0 +1,342 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """build-sequence-tech-fastest pack — full no-cheat validation on Rust.
2
+
3
+ Wave-7 REASONING — cost-optimal build-order planning. The agent must
4
+ reach the war factory (`weap`) on the SHORTEST prerequisite chain:
5
+
6
+ powr → proc → weap
7
+
8
+ Any detour (build a barracks/tent first, or a redundant power plant,
9
+ or an early infantry queue) overruns the tight tick budget and loses.
10
+ The chain is enforced by the Wave-2 `then:` happened-before composite;
11
+ the deadline (`within_ticks`) is the cost-optimality teeth — slack is
12
+ tuned so the OPTIMAL plan fits and the tent-detour plan does NOT.
13
+
14
+ Bar (CLAUDE.md): the intended cost-optimal policy WINS on every
15
+ (level, seed); stall and the tent-first wrong-path policy LOSE on
16
+ every (level, seed). Real LOSS not DRAW — `fail after_ticks:T+1`
17
+ inside max_turns is the bite.
18
+
19
+ Scenario shape:
20
+ - rush-hour-arena, allies vs soviet (bot disabled).
21
+ - easy: T=3000, max_turns=40 — generous (4-turn buffer).
22
+ - medium: T=2800, max_turns=35 — tight (≈2-turn buffer).
23
+ - hard: T=2800, max_turns=35 — same tight T + ≥2 spawn_point
24
+ groups (NORTH y=14 / SOUTH y=26 base, round-robined).
25
+
26
+ Measured optimal timing (seed 1, scripted intended policy):
27
+ powr completes ≈ tick 273 (turn 3)
28
+ proc completes ≈ tick 1263 (turn 14)
29
+ weap completes ≈ tick 2613 (turn 29)
30
+ Measured tent-first wrong-path timing:
31
+ weap completes ≈ tick 3063 (turn 34) — beyond every level's T.
32
+ """
33
+
34
+ from __future__ import annotations
35
+
36
+ import pytest
37
+
38
+ pytest.importorskip("openra_train", reason="Rust env wheel not installed")
39
+ pytest.importorskip("openra_rl_training", reason="Rust env wheel not installed")
40
+
41
+ from openra_bench.eval_core import run_level
42
+ from openra_bench.scenarios import load_pack
43
+ from openra_bench.scenarios.loader import PACKS_DIR, compile_level
44
+
45
+ PACK = PACKS_DIR / "build-sequence-tech-fastest.yaml"
46
+ LEVELS = ("easy", "medium", "hard")
47
+ SEEDS = (1, 2, 3, 4)
48
+
49
+
50
+ # ── Policies ──────────────────────────────────────────────────────
51
+
52
+
53
+ def _stall_policy():
54
+ """Do nothing — must LOSE on the clock on every level/seed."""
55
+ def pol(obs, Cmd):
56
+ return [Cmd.observe()]
57
+ return pol
58
+
59
+
60
+ def _intended_policy():
61
+ """Cost-optimal play: build powr → proc → weap, each one placed
62
+ relative to the agent's actual fact (so the policy generalises
63
+ across the hard-tier spawn variation). This is the policy the
64
+ pack is solvable by — must WIN on every (level, seed)."""
65
+ milestone = {"powr": False, "proc": False, "weap": False}
66
+
67
+ def pol(obs, Cmd):
68
+ ob = obs.get("own_buildings", []) or []
69
+ own_b = {b["type"] for b in ob}
70
+ prod = obs.get("production", []) or []
71
+ for b in ("powr", "proc", "weap"):
72
+ if b in own_b:
73
+ milestone[b] = True
74
+ cmds = []
75
+ base = [b for b in ob if b["type"] == "fact"]
76
+ if not milestone["powr"]:
77
+ if "powr" not in prod:
78
+ cmds.append(Cmd.build("powr"))
79
+ if base:
80
+ cmds.append(Cmd.place_building(
81
+ "powr", base[0]["cell_x"] + 4, base[0]["cell_y"]
82
+ ))
83
+ elif not milestone["proc"]:
84
+ if "proc" not in prod:
85
+ cmds.append(Cmd.build("proc"))
86
+ if base:
87
+ cmds.append(Cmd.place_building(
88
+ "proc", base[0]["cell_x"] + 6, base[0]["cell_y"] + 3
89
+ ))
90
+ elif not milestone["weap"]:
91
+ if "weap" not in prod:
92
+ cmds.append(Cmd.build("weap"))
93
+ if base:
94
+ cmds.append(Cmd.place_building(
95
+ "weap", base[0]["cell_x"] + 8, base[0]["cell_y"]
96
+ ))
97
+ if not cmds:
98
+ cmds.append(Cmd.observe())
99
+ return cmds
100
+ return pol
101
+
102
+
103
+ def _tent_first_policy():
104
+ """Wrong cost-non-optimal play: powr → tent → proc → weap. The
105
+ tent is not on the prerequisite chain for weap (only proc is); it
106
+ bloats the BOM by 500 credits and ~5 turns. Must LOSE on the
107
+ clock on every level/seed."""
108
+ milestone = {"powr": False, "tent": False, "proc": False, "weap": False}
109
+
110
+ def pol(obs, Cmd):
111
+ ob = obs.get("own_buildings", []) or []
112
+ own_b = {b["type"] for b in ob}
113
+ prod = obs.get("production", []) or []
114
+ for b in ("powr", "tent", "proc", "weap"):
115
+ if b in own_b:
116
+ milestone[b] = True
117
+ cmds = []
118
+ base = [b for b in ob if b["type"] == "fact"]
119
+ if not milestone["powr"]:
120
+ if "powr" not in prod:
121
+ cmds.append(Cmd.build("powr"))
122
+ if base:
123
+ cmds.append(Cmd.place_building(
124
+ "powr", base[0]["cell_x"] + 4, base[0]["cell_y"]
125
+ ))
126
+ elif not milestone["tent"]:
127
+ if "tent" not in prod:
128
+ cmds.append(Cmd.build("tent"))
129
+ if base:
130
+ cmds.append(Cmd.place_building(
131
+ "tent", base[0]["cell_x"] + 4, base[0]["cell_y"] + 3
132
+ ))
133
+ elif not milestone["proc"]:
134
+ if "proc" not in prod:
135
+ cmds.append(Cmd.build("proc"))
136
+ if base:
137
+ cmds.append(Cmd.place_building(
138
+ "proc", base[0]["cell_x"] + 6, base[0]["cell_y"] + 3
139
+ ))
140
+ elif not milestone["weap"]:
141
+ if "weap" not in prod:
142
+ cmds.append(Cmd.build("weap"))
143
+ if base:
144
+ cmds.append(Cmd.place_building(
145
+ "weap", base[0]["cell_x"] + 8, base[0]["cell_y"]
146
+ ))
147
+ if not cmds:
148
+ cmds.append(Cmd.observe())
149
+ return cmds
150
+ return pol
151
+
152
+
153
+ # ── Pack-shape tests (cheap; do not run the engine) ──────────────
154
+
155
+
156
+ def test_pack_compiles_with_three_levels():
157
+ pack = load_pack(PACK)
158
+ assert pack.meta.id == "build-sequence-tech-fastest"
159
+ assert pack.meta.capability == "reasoning"
160
+ assert set(pack.levels) == {"easy", "medium", "hard"}
161
+
162
+
163
+ def test_meta_benchmark_anchor_set():
164
+ """Required by the seed taxonomy: PlanBench cost-optimal +
165
+ BOM manufacturing critical-path planning."""
166
+ pack = load_pack(PACK)
167
+ anchors = pack.meta.benchmark_anchor or []
168
+ assert any("PlanBench" in a for a in anchors), anchors
169
+ assert any("BOM" in a for a in anchors), anchors
170
+
171
+
172
+ def test_hard_tier_has_seed_driven_spawn_groups():
173
+ """Hard must define ≥2 agent spawn_point groups so seed varies
174
+ the start base (tests/test_hard_tier.py::UPGRADED contract)."""
175
+ c = compile_level(load_pack(PACK), "hard")
176
+ sp = {a.spawn_point for a in c.scenario.actors if a.owner == "agent"}
177
+ assert len(sp) >= 2, f"hard needs ≥2 spawn groups, got {sp}"
178
+
179
+
180
+ def test_every_level_has_fail_condition():
181
+ """No silent draws — every level must be able to emit a LOSS."""
182
+ pack = load_pack(PACK)
183
+ for lvl in LEVELS:
184
+ c = compile_level(pack, lvl)
185
+ assert c.fail_condition is not None, f"{lvl} missing fail_condition"
186
+
187
+
188
+ def test_then_composite_used_in_win():
189
+ """Confirms the 3-step build-order chain is wired through to the
190
+ compiled win condition — the load-bearing teeth of this pack."""
191
+ for lvl in LEVELS:
192
+ c = compile_level(load_pack(PACK), lvl)
193
+ win = c.win_condition.model_dump(exclude_none=True)
194
+ inner = win.get("all_of") or []
195
+ assert any("then" in cl for cl in inner), (
196
+ f"{lvl} win missing then-chain: {win}"
197
+ )
198
+ for cl in inner:
199
+ if "then" in cl:
200
+ clauses = (cl["then"] or {}).get("clauses") or []
201
+ assert len(clauses) == 3, (
202
+ f"{lvl} then-chain must be powr→proc→weap (3 clauses); "
203
+ f"got {clauses}"
204
+ )
205
+ # And in the exact engine-enforced prereq order.
206
+ assert clauses[0].get("has_building") == "powr"
207
+ assert clauses[1].get("has_building") == "proc"
208
+ assert clauses[2].get("has_building") == "weap"
209
+
210
+
211
+ def test_tick_budget_aligned_with_max_turns():
212
+ """within_ticks must be reachable inside max_turns. Engine
213
+ advances ~90 ticks/turn → reachable max = 93 + 90·(N-1)."""
214
+ pack = load_pack(PACK)
215
+ for lvl in LEVELS:
216
+ level_def = pack.levels[lvl]
217
+ max_turns = level_def.max_turns
218
+ reachable = 93 + 90 * (max_turns - 1)
219
+ win = compile_level(pack, lvl).win_condition.model_dump(exclude_none=True)
220
+
221
+ def _collect(node, key, out):
222
+ if isinstance(node, dict):
223
+ if key in node:
224
+ out.append(node[key])
225
+ for v in node.values():
226
+ _collect(v, key, out)
227
+ elif isinstance(node, list):
228
+ for v in node:
229
+ _collect(v, key, out)
230
+ wts = []
231
+ _collect(win, "within_ticks", wts)
232
+ assert wts, f"{lvl} has no within_ticks leaf (no clock teeth)"
233
+ for wt in wts:
234
+ assert wt <= reachable, (
235
+ f"{lvl} within_ticks={wt} > reachable={reachable} "
236
+ f"(max_turns={max_turns}) — deadline never bites ⇒ draw"
237
+ )
238
+
239
+
240
+ # ── Engine-bound tests (parameterised over seeds 1..4) ────────────
241
+
242
+
243
+ @pytest.mark.parametrize("seed", SEEDS)
244
+ @pytest.mark.parametrize("level", LEVELS)
245
+ def test_intended_cost_optimal_policy_wins(level, seed):
246
+ """The intended cost-optimal play (powr → proc → weap) must WIN
247
+ on every (level, seed). This is the load-bearing test that the
248
+ pack is solvable inside the budget by the advertised capability."""
249
+ c = compile_level(load_pack(PACK), level)
250
+ res = run_level(c, _intended_policy(), seed=seed)
251
+ tp = getattr(res.signals, "then_progress", {}) or {}
252
+ assert res.outcome == "win", (
253
+ f"intended cost-optimal must WIN on {level} s={seed}; "
254
+ f"got {res.outcome} (tick={res.signals.game_tick}, "
255
+ f"then_progress={tp}, "
256
+ f"own_buildings={res.signals.own_building_types})"
257
+ )
258
+
259
+
260
+ @pytest.mark.parametrize("seed", SEEDS)
261
+ @pytest.mark.parametrize("level", LEVELS)
262
+ def test_stall_loses(level, seed):
263
+ """A do-nothing policy must LOSE on every (level, seed). The
264
+ fail_condition's after_ticks clause bites at the budget; never
265
+ a draw."""
266
+ c = compile_level(load_pack(PACK), level)
267
+ res = run_level(c, _stall_policy(), seed=seed)
268
+ assert res.outcome == "loss", (
269
+ f"stall must LOSE on {level} s={seed}; got {res.outcome} "
270
+ f"(tick={res.signals.game_tick})"
271
+ )
272
+
273
+
274
+ @pytest.mark.parametrize("seed", SEEDS)
275
+ @pytest.mark.parametrize("level", LEVELS)
276
+ def test_tent_first_wrong_path_loses(level, seed):
277
+ """The cost-non-optimal tent-first play must LOSE on every
278
+ (level, seed). The tent detour adds ~500 credits + ~5 turns,
279
+ pushing weap completion to ~tick 3063 — beyond every level's
280
+ deadline. The capability being measured is COST-OPTIMAL
281
+ planning; a 'some plan that arrives' policy must not win."""
282
+ c = compile_level(load_pack(PACK), level)
283
+ res = run_level(c, _tent_first_policy(), seed=seed)
284
+ tp = getattr(res.signals, "then_progress", {}) or {}
285
+ assert res.outcome == "loss", (
286
+ f"tent-first wrong-path must LOSE on {level} s={seed}; got "
287
+ f"{res.outcome} (tick={res.signals.game_tick}, "
288
+ f"then_progress={tp}, own_buildings={res.signals.own_building_types})"
289
+ )
290
+
291
+
292
+ @pytest.mark.parametrize("seed", SEEDS)
293
+ def test_hard_seeds_produce_distinct_starts(seed):
294
+ """Hard's two spawn_point groups must actually round-robin —
295
+ different seeds must place the agent fact at a different (x,y).
296
+ Smoke-tests the spawn-variation contract that
297
+ tests/test_hard_tier.py also enforces."""
298
+ c = compile_level(load_pack(PACK), "hard")
299
+ captured = {"first_obs": None}
300
+
301
+ def probe(obs, Cmd):
302
+ if captured["first_obs"] is None:
303
+ captured["first_obs"] = list(obs.get("own_buildings", []) or [])
304
+ return [Cmd.observe()]
305
+
306
+ res = run_level(c, probe, seed=seed)
307
+ assert res.outcome == "loss" # stall must lose
308
+ facts = [
309
+ (b["cell_x"], b["cell_y"])
310
+ for b in (captured["first_obs"] or [])
311
+ if b["type"] == "fact"
312
+ ]
313
+ assert facts, f"no fact observed at turn 0 for seed={seed}"
314
+
315
+
316
+ def test_hard_spawns_round_robin_across_seeds():
317
+ """Two seeds (1 and 2) must place the agent's fact at DIFFERENT
318
+ cells — proves the spawn_point round-robin is active, not
319
+ degenerate."""
320
+ c = compile_level(load_pack(PACK), "hard")
321
+
322
+ def probe():
323
+ captured = {}
324
+ def pol(obs, Cmd):
325
+ if "fact_pos" not in captured:
326
+ bs = obs.get("own_buildings", []) or []
327
+ facts = [(b["cell_x"], b["cell_y"]) for b in bs if b["type"] == "fact"]
328
+ if facts:
329
+ captured["fact_pos"] = facts[0]
330
+ return [Cmd.observe()]
331
+ pol.captured = captured
332
+ return pol
333
+
334
+ p1 = probe(); run_level(c, p1, seed=1)
335
+ p2 = probe(); run_level(c, p2, seed=2)
336
+ pos1 = p1.captured.get("fact_pos")
337
+ pos2 = p2.captured.get("fact_pos")
338
+ assert pos1 and pos2, f"missing fact obs: s1={pos1} s2={pos2}"
339
+ assert pos1 != pos2, (
340
+ f"hard spawn round-robin is degenerate: seed 1 and 2 both "
341
+ f"started at {pos1}"
342
+ )
tests/test_hard_tier.py CHANGED
@@ -171,6 +171,16 @@ UPGRADED = [
171
  # flips per seed (an off-axis diagonal busts the tick budget
172
  # and brushes the wrong-corner patrol).
173
  "mfb-base-1-defend-base-2-build",
 
 
 
 
 
 
 
 
 
 
174
  # Wave-4 TURTLE node of the tech triple (SC2 turtle macro /
175
  # military fortify-before-research doctrine anchor). Hard defines
176
  # two agent spawn_point groups (NORTH base / SOUTH base) so the
@@ -409,6 +419,20 @@ UPGRADED = [
409
  # y=20 so either spawn faces the same flank-vs-frontal decision
410
  # from a flipped bearing, and no memorised opening generalises.
411
  "combat-flanking-attack",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
412
  # Wave-6 perception pack — early-warning intrusion detection
413
  # paired with targeted intercept (SC2 early-warn scout /
414
  # NORAD early-warning / IDS / military reconnaissance-in-force
@@ -420,6 +444,105 @@ UPGRADED = [
420
  # generalises. A memorised "send scout to (40,10) + tanks to
421
  # (45,10)" opening cannot generalise across seeds.
422
  "scout-detect-incoming-army",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
423
  ]
424
 
425
  # Consciously NOT spawn-varied, with the reason (keeps the curation
 
171
  # flips per seed (an off-axis diagonal busts the tick budget
172
  # and brushes the wrong-corner patrol).
173
  "mfb-base-1-defend-base-2-build",
174
+ # Wave-7 Group B reasoning pack — greedy 3-base macro against a
175
+ # deadline (SC2 3-base macro / MicroRTS expansion / industrial
176
+ # site expansion anchor). Hard tier defines two agent spawn_point
177
+ # groups (NORTH base layout y≈20 / SOUTH base layout y≈50)
178
+ # round-robined by seed; the win clause accepts EITHER candidate
179
+ # far-east region ((90,20) or (90,50)) so the agent must place
180
+ # the 3rd proc in line with their actual base latitude. A
181
+ # memorised "place at (90,20)" generalises to NORTH but mis-places
182
+ # on SOUTH.
183
+ "mfb-third-base-against-clock",
184
  # Wave-4 TURTLE node of the tech triple (SC2 turtle macro /
185
  # military fortify-before-research doctrine anchor). Hard defines
186
  # two agent spawn_point groups (NORTH base / SOUTH base) so the
 
419
  # y=20 so either spawn faces the same flank-vs-frontal decision
420
  # from a flipped bearing, and no memorised opening generalises.
421
  "combat-flanking-attack",
422
+ # Wave-7 combat-formation pack: military tank-wedge doctrine /
423
+ # SC2 formation micro / combined-arms anchor. The agent commands
424
+ # 5× 2tnk and must arrange them in a WEDGE (apex + 2 flankers
425
+ # per side spread across y=18..22) before contacting an eastern
426
+ # cluster (4-5× e3 + 1-2× 1tnk at x=84..86). A COLUMN (single-
427
+ # file east on y=20) concentrates incoming Dragon fire on the
428
+ # lead tank and bleeds the survival bar (own_units_gte:4 fails
429
+ # when 2+ tanks lost); the WEDGE spreads return fire across the
430
+ # formation and clears the cluster intact. Hard defines two agent
431
+ # spawn_point groups (NORTH staging y=12..16 / SOUTH staging
432
+ # y=24..28) round-robined by seed; the central cluster is
433
+ # symmetric across y=20 so either spawn faces an equivalent
434
+ # column-vs-wedge decision and no memorised opening generalises.
435
+ "combat-formation-tank-wedge",
436
  # Wave-6 perception pack — early-warning intrusion detection
437
  # paired with targeted intercept (SC2 early-warn scout /
438
  # NORAD early-warning / IDS / military reconnaissance-in-force
 
444
  # generalises. A memorised "send scout to (40,10) + tanks to
445
  # (45,10)" opening cannot generalise across seeds.
446
  "scout-detect-incoming-army",
447
+ # Wave-7 ACTION econ-defense pack — convoy / supply-line protection
448
+ # (SC2 harass defense / military convoy protection / supply-line
449
+ # doctrine anchor). A single harv commutes proc↔mine on a long
450
+ # exposed route; raider 2tnks specifically target the harv.
451
+ # Defenders at base never engage (raider intercepts harv beyond
452
+ # base sight); intended play is to move escorts east to intercept
453
+ # on the route. Hard tier defines two agent spawn_point groups
454
+ # (NORTH route y=14 / SOUTH route y=26) round-robined by seed;
455
+ # symmetric north + south raider waves always place (enemy actors
456
+ # don't honour spawn_point — CLAUDE.md), so each spawn defends
457
+ # its OWN supply lane and a memorised opening cannot generalise.
458
+ "econ-protect-harvester-route",
459
+ # Wave-7 Group D reasoning pack — rock-paper-scissors hard-counter
460
+ # selection (SC2 hard-counter doctrine / military RPS counter /
461
+ # capability-based defense procurement anchor). Cash $2550 funds
462
+ # EITHER 3× 2tnk (the right counter to pure-infantry enemy) OR
463
+ # 8× e3 (wrong counter — anti-tank rockets vs soft targets) OR
464
+ # 25× e1 (1:1 attrition match). Hard tier defines two agent
465
+ # spawn_point groups (NORTH base y=12 / SOUTH base y=28) round-
466
+ # robined by seed; the centre infantry cluster always places at
467
+ # (70,20) (enemy actors don't honour spawn_point — CLAUDE.md),
468
+ # so the composition decision is the same per seed but the lane
469
+ # the agent commits to flips per seed and a memorised opening
470
+ # cannot generalise.
471
+ "combat-vehicle-vs-infantry-counter",
472
+ # Wave-7 REASONING temporal-sequencing pack — SC2 timing-push
473
+ # window / PlanBench temporally-extended goal / cyber attack
474
+ # timing-window anchor. The `then:` happened-before composite
475
+ # enforces a SURVIVAL gate (own_units_gte:4 at T1) latching
476
+ # BEFORE the STRIKE gate (units_killed_gte:K within T2), so
477
+ # premature engagement and stalling both lose. Hard tier defines
478
+ # two agent spawn_point groups (NORTH staging y=12 / SOUTH
479
+ # staging y=28) round-robined by seed; the central enemy turtle
480
+ # cluster + tsla place every seed (enemy actors don't honour
481
+ # spawn_point — CLAUDE.md) and is symmetric across y=20, so
482
+ # both staging latitudes face the same survive-then-strike
483
+ # decision from a flipped approach axis.
484
+ "tp-survive-and-strike-at-window",
485
+ # Wave-7 REASONING pack: concentrated-defense topology — build a
486
+ # TIGHT CLUSTER of pillboxes around the high-value building (the
487
+ # agent fact). Hard tier defines 2 agent spawn_point groups
488
+ # (NORTH fact at y=14 / SOUTH fact at y=26) round-robined by seed;
489
+ # the cluster centre flips with the fact, so a memorised "cluster
490
+ # at (10,20)" plan cannot generalise. Enemies don't honour
491
+ # spawn_point (CLAUDE.md), so the rush band is staged at BOTH
492
+ # candidate latitudes — only the on-latitude band converges on
493
+ # the active fact, but it is heavy enough to overwhelm any
494
+ # defence that isn't a CLUSTER around the correct fact.
495
+ "build-defensive-tower-cluster",
496
+ # Wave-7 REASONING / RPS hard-counter pack (INVERSE of combat-
497
+ # vehicle-vs-infantry-counter) — SC2 hard-counter / anti-armor
498
+ # procurement / military RPS anchor. Starting cash ($1800) funds
499
+ # exactly ONE composition vs a pre-placed band of HEAVY tanks
500
+ # (3tnk on easy/medium, 4tnk Mammoths on hard); the agent must
501
+ # build e3 (rocket soldiers, anti-vehicle Dragon launcher) — not
502
+ # 1tnk (light tanks lose attrition to heavy armour, budget buys
503
+ # only ~2) and not e1 (no anti-armour weapon, kill bar fails).
504
+ # Hard tier defines two agent spawn_point groups (NORTH base
505
+ # y=12 / SOUTH base y=28) round-robined by seed; the heavy band
506
+ # is centred mid-latitude (y=20) so both spawns face symmetric
507
+ # pursuit geometry (enemy actors don't honour spawn_point —
508
+ # CLAUDE.md) and a memorised "build e3 at y=20" opening cannot
509
+ # generalise across seeds.
510
+ "combat-rocket-soldier-anti-vehicle",
511
+ # Wave-7 perimeter/firewall reasoning pack — ERQA spatial commit /
512
+ # MicroRTS defense placement / military perimeter (firewall rule
513
+ # placement) anchor. Sibling/inverse of def-tower-line-vs-cluster:
514
+ # that pack enforces CLUSTER at a single bottleneck cell (graph
515
+ # min-cut doctrine); this pack enforces a LINE across the corridor
516
+ # (one pbox per row spanning y=18..22 at x=60, radius 0.5 so only
517
+ # the exact rung cell counts). Hard tier defines two agent
518
+ # spawn_point groups (NORTH base y=12 / SOUTH base y=28) round-
519
+ # robined by seed; the rusher band is centred at y=20 and ALWAYS
520
+ # places (enemy actors don't honour spawn_point — CLAUDE.md), so
521
+ # the corridor LINE is identical across seeds but the agent's base
522
+ # bearing flips per seed and a memorised relative-to-base placement
523
+ # cannot generalise.
524
+ "build-defensive-tower-line",
525
+ # Wave-7 Group I REASONING — opening-phase build-order / power-grid
526
+ # bring-up sequencing (PlanBench task-ordering / SOP compliance /
527
+ # electrical-grid bring-up anchor). Hard tier defines two agent
528
+ # spawn_point groups (NORTH y=12 / SOUTH y=28) round-robined by
529
+ # seed; the pre-placed `fact` (and therefore the build radius and
530
+ # the placement coords for powr/proc) flips per seed, so a
531
+ # memorised "(20,20) opening" cannot generalise. An inert HoldFire
532
+ # `e1` per group surfaces the variation via units_summary (the
533
+ # pack would otherwise be building-only); no `move_units`/
534
+ # `attack_unit` tool is exposed so the e1 is functionally inert
535
+ # and does not interact with the SOP test.
536
+ "build-power-online-first",
537
+ # Wave-7 REASONING pack — cost-optimal build-order (powr → proc →
538
+ # weap) under a tight deadline (PlanBench cost-optimal / BOM-
539
+ # manufacturing critical-path anchor). Hard tier defines two agent
540
+ # spawn_point groups (NORTH base y=14 / SOUTH base y=26) round-
541
+ # robined by seed; ore patches are duplicated at both latitudes so
542
+ # harv income is symmetric per spawn. A memorised "place powr at
543
+ # (14,22)" opening cannot generalise — placement must be computed
544
+ # relative to the actual fact each seed.
545
+ "build-sequence-tech-fastest",
546
  ]
547
 
548
  # Consciously NOT spawn-varied, with the reason (keeps the curation