Spaces:

qpluslab
/

OpenRA-Bench

Running

App Files Files Community

OpenRA-Bench / SCENARIO_CATALOG.md

yxc20098

Curate: remove the auto-generated cat-* corpus

e9d5097 about 2 months ago

preview code

Raw

History Blame Contribute Delete

8.35 kB

	> RETIRED (curation decision). The auto-generated 200-level `cat-*`
	> corpus described below has been removed from the repo. It scored low
	> on discrimination/rigor (see `SCENARIO_QUALITY.md`) and added
	> memorization surface without distinct capability. The benchmark is now
	> a small curated, hand-authored set (Art-of-War, economy/building
	> planning, strategy, rush-hour, adversarial, strict-spec, perception,
	> reasoning), each designed deliberately one at a time with a
	> non-gameable win+fail and a named external-benchmark analogue. This
	> file is kept only as historical design notes.

	# OpenRA-Bench Scenario Catalog (research-grounded, 200 levels)

	Goal: a benchmark whose scenarios generalize on reasoning — we
	empirically observed finetuning on the rush-hour scenario also lifted
	ERQA. The literature explains why and how to make that systematic.

	## Why this transfers (literature)

	- Game/spatial RL → planning transfer is real but axis-specific.
	lmgame-Bench (Hu et al., 2025, arXiv:2505.15146): RL on simplified
	Sokoban lifted Blocksworld-1D 17.9→32.7%, Blocksworld-2D 9.0→29.5%,
	WebShop 7.0→19.1% — but not GSM8K/SQL. Transfer targets are
	spatial / planning / embodied, not math.
	- RL not SFT. "Understanding Transferability of LLM Reasoning"
	(arXiv:2507.00432): RL generalizes (minimal representation drift);
	trajectory SFT catastrophically forgets. Use these scenarios with
	verifiable-reward RL, not trajectory SFT.
	- ERQA is reasoning-sensitive. ERQA (Gemini Robotics, DeepMind,
	arXiv:2503.20020): 400 MCQ over spatial / trajectory / action /
	state-estimation / multi-view / task reasoning; CoT alone moves it
	+4–6.5pts → a reasoning-shaping finetune (rush-hour) plausibly moves
	its spatial/trajectory/task subscores. Our suite is built to drive
	exactly those axes.
	- Anti-memorization is mandatory. SMAC→SMACv2 (arXiv:2212.07489):
	a timestep/ID open-loop policy beat SMAC → memorization, not
	reasoning. Procgen (arXiv:1912.01588), Crafter (arXiv:2109.06780),
	ARC-AGI (arXiv:1911.01547): procedurally varied, held-out instances;
	report the generalization gap. Distractor/decoy decorrelation of
	"obvious cue" vs "correct action" (shortcut-learning, arXiv:2412.05152).

	## Capability taxonomy → measurement

	Consolidated from the general-reasoning, embodied/spatial, and
	games-as-eval surveys. Metrics borrow canonical instruments: SPL
	(Anderson et al., 2018, arXiv:1807.06757) for "reach/find under fog";
	coverage-under-budget / time-to-find / region-commitment regret
	(Active Neural SLAM, Chaplot et al., ICLR 2020, arXiv:2004.05155;
	frontier exploration, Yamauchi 1997); **Goal-Condition Success +
	path-length weighting (ALFRED, arXiv:1912.01734); predicate
	satisfaction** (PlanBench, arXiv:2206.10498; BEHAVIOR-1K,
	arXiv:2403.09227); win/attrition under a difficulty ladder
	(TextStarCraft II, arXiv:2312.11865; SmartPlay, arXiv:2310.01557).

	\| Cap \| Dimension \| Primary literature \|
	\|---\|---\|---\|
	\| PERC \| spatial/state-estimation under partial obs \| ERQA; SpatialVLM (2401.12168); VSR (TACL'23) \|
	\| FRONT \| which-unexplored-region-to-commit \| Active Neural SLAM; frontier exploration; exploration-aware EQA (2503.11117) \|
	\| PLAN \| sequential planning under constraints \| PlanBench; ALFWorld (2010.03768) \|
	\| TECH \| precondition / tech-graph ordering \| PlanBench (Mystery-Blocksworld); BEHAVIOR-1K BDDL \|
	\| ECON \| resource allocation / multi-objective \| PlanBench cost-optimal; SmartPlay \|
	\| COORD \| multi-unit / multi-agent coordination \| Watch-And-Help (2010.09890); SMAC \|
	\| RISK \| risk assessment / replanning under partial info \| PlanBench replanning; StarCraft II Arena (ICLR'25) \|
	\| TEMPO \| timing / tempo windows \| TextStarCraft II; SmartPlay \|

	## The 200-level catalog (12 categories, ~67 three-level packs)

	From the engine-grounded end-to-end RA dissection (28 decision points,
	phases opening→scouting→economy→tech→army→engagement→defense→assault→
	endgame). Every win-condition uses ONLY predicates in
	`win_conditions.py`; all on `rush-hour-arena` (only Rust-loadable map
	until S10). Difficulty ladder = the decision gets harder (less info,
	more decoys, tighter clock, committed-not-loose budget, attrition
	cap) — never just bigger numbers (Procgen/SMACv2 rule).

	\| # \| Category \| Cap \| Levels \| Win pattern (real grammar) \| Status \|
	\|---\|---\|---\|---\|---\|---\|
	\| C1 \| Frontier Scouting \| FRONT \| 18 \| `all_of[explored_pct_gte:P,(units_lost_lte:0),within_ticks:T]` \| now \|
	\| C2 \| Threat Enumeration \| PERC \| 18 \| `all_of[enemies_discovered_gte:N / buildings_discovered_gte:K, units_lost_lte:0, within_ticks:T]` \| now \|
	\| C3 \| Tech Critical Path \| TECH \| 18 \| `all_of[has_building:X, building_total_gte:N, power_surplus_gte:0, within_ticks:T]` \| now \|
	\| C4 \| Power-Budget Online \| PLAN \| 15 \| `all_of[power_surplus_gte:0, building_total_gte:N, within_ticks:T]` \| now \|
	\| C5 \| Budget Allocation (spend) \| ECON \| 18 \| `all_of[building_count_gte{A,n}, building_count_gte{B,n}, within_ticks:T]` \| now \|
	\| C6 \| Time-Boxed Capital Deploy \| ECON \| 18 \| `all_of[own_units_gte:K, building_total_gte:M,(units_lost_lte:0),within_ticks:T]` \| now \|
	\| C7 \| Defensive-Direction Commit \| PERC \| 18 \| `all_of[building_in_region{type:pbox,...,count:k}, within_ticks:T]` \| now \|
	\| C8 \| Base-Placement & Staging \| PERC \| 15 \| `all_of[building_in_region{type:fact,...},(reach_region),within_ticks:T]` \| now \|
	\| C9 \| Commit-vs-Retreat \| RISK \| 18 \| `all_of[units_killed_gte:N, units_lost_lte:L, within_ticks:T]` \| now \|
	\| C10 \| Force Coordination \| COORD \| 18 \| `all_of[all_units_in_region{...}, units_lost_lte:0, within_ticks:T]` \| now \|
	\| C11 \| Tempo / Timing Window \| TEMPO \| 15 \| `all_of[after_ticks:t0, units_killed_gte:N, within_ticks:T1]` \| now \|
	\| C12 \| Error Recovery / Replan \| RISK \| 15 \| `all_of[has_building:X, building_total_gte:N, after_ticks:t0, within_ticks:T]` \| now \|

	≈ 201 levels / ~67 packs, all runnable on the current engine.
	Economy categories are scoped as spend/conservation decisions
	(`starting_cash`), the discriminating choice today; an
	Economy-Harvest 13th category is now RUNNABLE — engine task
	#14 (S0 ore-seeding + idempotent harvest, S1 capped resource pool
	from refineries/silos + `economy_value`) is done. Shipped packs:
	`economy-harvest-timebox` (earn max economic value in a time budget,
	keep harvesters productive) and `economy-harvest-investment` (split a
	budget between widening collection and silo storage so burst yield is
	not lost — the storage decision). Win = `economy_value_gte`
	(=cash+stored resources) / `harvesters_gte` / `has_building:silo`
	within `within_ticks`; integ-tested end-to-end (test_economy_harvest).

	## Generation & integ-test strategy ("each run + validated")

	- Packs are generated by `tools/gen_catalog.py` (procedurally
	parameterized per the anti-memorization rule), written flat into
	`openra_bench/scenarios/packs/` with a `cat-` prefix.
	- Validation: `python -m openra_bench.scenarios.validate` compiles
	all 3 levels of every pack against the engine `ScenarioDefinition`
	+ win-grammar (schema gate).
	- Engine run: the existing parametrized
	`tests/test_robustness.py::test_bundled_pack_runs_on_engine`
	auto-covers every `packs/*.yaml` — each catalog pack's easy level
	must actually execute on the Rust engine (terrain resolves, actors
	load, scores), not merely validate.
	- Catalog coverage test: `tests/test_catalog.py` asserts all 12
	categories are present, ≈200 levels exist, every pack carries a
	capability tag, and difficulty is monotone (tighter clock / less
	info per level) — the literature's controlled-ladder requirement.
	- Transfer panel (follow-up): pre-register ERQA + Blocksworld +
	WebShop external eval to quantify generalization gap (lmgame-Bench
	protocol); report per-axis deltas, not aggregate.

	## Cross-cutting rules (enforced by the generator)

	1. No enemy-destruction predicate exists → offensive success =
	`units_killed_gte` + positional `reach_region`/`building_in_region`.
	2. Difficulty = decision hardness (info↓, decoys↑, clock↓, budget
	committed, attrition cap), never raw scaling.
	3. Per-pack `real_world_meaning` + `robotics_analogue` are mandatory
	and tie to the mapped capability dimension + its literature.