Spaces:
Running
Running
| # OpenRA-Bench โ Paper Plan & Critique | |
| Working document. Captures the central thesis, the paper framing, the | |
| three findings, the experiment program, and the open critiques / | |
| reviewer-defenses. Living doc โ update as experiments land. | |
| --- | |
| ## 1. Central thesis (as stated) | |
| We build a benchmark on the Real-Time Strategy game *Red Alert* and | |
| measure LLM performance on **adversarial, multi-modal, long-horizon | |
| strategic planning and execution**. Three findings: | |
| 1. **The major gap is image-based perception.** In pure-text | |
| battlefields models execute complex strategies; in mixed modality | |
| โ where own and enemy positions must be read off a minimap โ they | |
| struggle. We show this by SFT on a small set of text-modality | |
| results with the image attached, and observe improvement on this | |
| benchmark *and* on external vision benchmarks (ERQA). | |
| 2. **Models panic in unfavorable / uncertain situations.** Scores | |
| drop sharply under fog of war while humans hold a baseline. Under | |
| heavy losses, models choose `observe`/`stop` instead of actively | |
| redirecting units. Hypothesis: a transfer of behavior from | |
| code-heavy training, where the rewarded move in an unfavorable | |
| state is to stop and wait for human intervention. | |
| 3. **Bridging both gaps via simple SFT** lets a *smaller* model beat | |
| large reasoning models on 1v1 ELO. | |
| Contributions: 200 research-grounded scenarios; an active 1v1 | |
| battleground; evidence that models form complex, diverse strategies | |
| despite the perception and adversity gaps (and that strategy varies | |
| with temperature and is model-characteristic โ turtle / balanced / | |
| aggressive). | |
| --- | |
| ## 2. Framing & positioning | |
| **The hook is the inversion:** LLMs can do the *hard* part | |
| (long-horizon adversarial strategy) but fail the "easy" parts | |
| (reading a map, holding their nerve). Lead with that โ it inverts the | |
| usual "LLMs can't plan" narrative, and the failures are specific and | |
| *fixable*. | |
| Position OpenRA-Bench as a **diagnostic instrument**, not a | |
| leaderboard. The methodological novelty vs. the crowded | |
| SC2 / TextStarCraft LLM-agent space is the **ablation methodology that | |
| decomposes failure** into perception / reasoning / action / adversity. | |
| Arc of the paper: **diagnosis โ localization โ treatment โ transfer.** | |
| Title candidates: | |
| - *Map-Blind and Panic-Prone: Diagnosing LLM Strategic Agents in | |
| Real-Time Strategy* | |
| - *The Bottleneck Is Perception, Not Planning: A Diagnostic RTS | |
| Benchmark* | |
| - *LLMs Can Strategize But Cannot See: Decomposing Agent Failure in | |
| Red Alert* | |
| Venue: strong enough for a main track (ICLR/NeurIPS/ICML) on the | |
| diagnosis+treatment+transfer arc โ not only Datasets & Benchmarks. | |
| --- | |
| ## 3. Contributions (sharpened) | |
| 1. **OpenRA-Bench** โ 200 controlled RTS scenarios on a deterministic | |
| engine, each anchored to a named real-world capability / external | |
| benchmark, built to a **no-cheat / no-defect** bar (every lazy / | |
| brute / stall policy provably loses on every level and seed). The | |
| validation rigor is itself a contribution: it is why the benchmark | |
| measures the *intended* capability. | |
| 2. **An ablation methodology that decomposes failure** โ the | |
| perception grid (channel ร fog) and the handoff sweep โ turning | |
| "the model lost" into "the model lost *because of perception / | |
| adversity / planning*." | |
| 3. **A 1v1 self-play battleground** with ELO. | |
| 4. **Two localized, fixable gaps** โ perception-binding and | |
| inaction-under-adversity โ each with an SFT remedy, cross-benchmark | |
| transfer (ERQA), and a small-beats-large result. | |
| --- | |
| ## 4. Paper structure | |
| ยง1 Intro (the inversion) ยท ยง2 Related work ยท ยง3 OpenRA-Bench (engine, | |
| 200 scenarios, no-cheat design, ablation axes, 1v1/ELO, | |
| human-labeling) ยท ยง4 Setup (models, sweeps, metrics) ยท **ยง5 Models | |
| *can* strategize** (the diversity control) ยท ยง6 Finding 1: perception | |
| gap ยท ยง7 Finding 2: panic ยท ยง8 Finding 3: SFT remedy + transfer ยท | |
| ยง9 Limitations ยท ยง10 Conclusion. | |
| **Key structural move:** promote the "diverse strategies" result from | |
| a bonus to **ยง5, before the failure findings.** Showing models produce | |
| coherent, diverse, model-characteristic strategies is what licenses | |
| the claim that the later failures are *not* failures of strategic | |
| reasoning. Without it a reviewer says "maybe they just can't plan." | |
| --- | |
| ## 5. ยง5 control โ models *can* strategize | |
| Promote from "bonus" to a load-bearing control. | |
| - Classify each game's strategy (turtle / balanced / aggressive) via a | |
| trajectory classifier (first-attack tick, army-vs-econ ratio, | |
| expansion count, aggression index, build order). Show per-model | |
| distributions differ. | |
| - Temperature sweep: strategy entropy vs. temperature for one model. | |
| - Show strategies are *coherent* (executed consistently once chosen), | |
| not random. | |
| **The headline figure โ the strategy-embedding scatter, in 1v1.** Each | |
| game's trajectory is featurized and embedded to 2D (PCA / UMAP); | |
| points colored by model. The plot shows, at a glance: | |
| - **inter-model clusters** โ model A lives in the aggressive region, | |
| model B turtles โ models have characteristic strategy *priors*; | |
| - **intra-model spread** โ the same model across games (especially at | |
| high temperature) scatters โ one model generates *diverse* play. | |
| 1v1 is the right venue: full-macro adversarial games make strategy | |
| legible in a way single-scenario tasks do not. A bigger model roster | |
| (8โ12, Together + Bedrock) makes the clustering visually striking. | |
| This is the evidence that planning is not the bottleneck โ and a | |
| genuinely compelling standalone result. | |
| --- | |
| ## 6. Finding 1 โ Perception gap | |
| **Claim, sharpened.** Decompose perception into **extraction** (read | |
| state from the image) and **binding** (act on image-derived state). | |
| The perception sweep already separates these: `image`-channel on | |
| perception packs (count / locate) isolates extraction; `image` on | |
| action packs isolates binding. Report *which* breaks. The result is | |
| "the perceptionโaction **binding** fails while pure planning and pure | |
| extraction are intact" โ not the weak "mixed modality is hard." | |
| **Experiments.** | |
| - Perception sweep (6 cells) ร model roster ร scenario set ร seeds โ | |
| the modality-gap number per model (`structured โ image`). | |
| - A "perceive-then-act" scaffold (force the model to transcribe | |
| positions first) โ if it rescues performance, the bottleneck is | |
| binding, not extraction. | |
| - **Cross-modal action distillation SFT**: text-mode *winning* | |
| trajectories + the attached minimap โ train imageโaction. Eval | |
| `image` channel on **held-out** scenarios. | |
| - **Transfer**: ERQA + โฅ2 other spatial / vision benchmarks | |
| (e.g. BLINK, a spatial-VQA set), before / after SFT. | |
| **Why transfer is the crown jewel.** Training imageโaction where the | |
| action is good could teach good *action priors* without the model | |
| ever using the image. The cross-benchmark transfer is what proves the | |
| SFT taught the model to *see*. Without it, finding 1 is "task | |
| finetuning helps task." With it, it is "agentic RTS data improves | |
| general visual spatial reasoning." Use โฅ2 external benchmarks so it | |
| cannot be called cherry-picked. | |
| **Reviewer attack โ defense.** "VLMs are known to be bad at dense | |
| small images โ this is just OCR." โ We measure it in a *consequential | |
| agentic loop*, we *localize* extraction vs binding, and we show a fix | |
| that *transfers*. Plus a render-style mini-ablation (the | |
| `constant_colors` / scale knobs) so the gap is not an artifact of one | |
| minimap style. | |
| --- | |
| ## 7. Finding 2 โ Panic / inaction under adversity | |
| **Claim.** Models degrade far more than humans under fog; under heavy | |
| losses they default to `observe` / `stop` instead of active redirect. | |
| Operational definition = the **`passivity` metric** (fraction of the | |
| model's turns spent on `observe` / `stop` only). Keep "panic" only as | |
| an informal label; formal text says "inaction bias under adversity." | |
| **Experiments.** | |
| - Fog ablation ร models ร **humans** โ report the *fog-penalty gap* | |
| (human degrades little, model degrades a lot), not raw fog scores. | |
| - Handoff bad-prefix โ passivity, models vs. human. Models freeze | |
| (high passivity), humans redirect (low). | |
| - Show passivity is *causally costly* โ within-model, passive turns | |
| predict worse outcomes controlling for position quality. | |
| **The coding-bias hypothesis is the spiciest claim โ handle with | |
| care.** Cannot be asserted as fact. Keep as a hypothesis AND support | |
| with โฅ1 of: | |
| - **base vs. RLHF / instruct** versions of the same model โ if base | |
| models panic less, post-training is implicated; | |
| - a **loss-aware prompt intervention** ("inaction is costly; never | |
| just observe when losing") โ if a prompt largely fixes it, the | |
| behavior is a shallow prior, consistent with the hypothesis; | |
| - passivity vs. known code / RLHF intensity across the roster. | |
| **Critical-path risk.** The human baseline needs *real human data at | |
| scale*. The Play tab is built; data is not collected. Scope: a | |
| representative scenario subset, several humans, disclosed RTS skill. A | |
| strong scripted policy can serve as a secondary "competent | |
| non-panicking" reference if human N is thin. | |
| --- | |
| ## 8. Finding 3 โ SFT remedy, small beats large | |
| **Reviewer attack (serious).** "A task-finetuned small model beating a | |
| zero-shot large model is trivial." **Defenses to build in:** | |
| - The SFT is **small and targeted** (perception grounding + active | |
| recovery), explicitly *not* trained on eval scenarios โ state the | |
| train / eval split loudly. | |
| - **Also SFT the large model** โ show the gap is real and fixable at | |
| every scale. Headline becomes: "the perception+panic gap is large | |
| enough that closing it in a small model *outweighs a 10ร reasoning | |
| advantage* โ and it was never a reasoning gap." | |
| - Ablate the SFT: perception-only / recovery-only / both โ each | |
| component contributes. | |
| - ELO with enough games + confidence intervals; held-out 1v1 maps. | |
| **Infrastructure synergy (state explicitly).** The handoff | |
| `TrajectoryController` + the human-Playback format *are* the SFT data | |
| pipeline โ recovered bad-prefix episodes are active-recovery | |
| exemplars; text-mode wins are perception-distillation data. The | |
| ablation infrastructure doubles as the training-data factory. | |
| --- | |
| ## 9. Experiment program | |
| | # | Experiment | Infra | Status | | |
| |---|---|---|---| | |
| | ยง5 strategy diversity | classify 1v1 games; temp sweep; per-model dists | 1v1 harness โ; needs strategy classifier | TODO | | |
| | Perception sweep | 6-cell ร roster ร scenarios ร seeds | โ `--perception-sweep` | **run** | | |
| | Handoff / passivity | base/bad/good ร roster | โ `--handoff-sweep` | **run** | | |
| | Human baseline | fog ร scenario subset ร humans | Play tab โ, **no data** | **highest-risk** | | |
| | Cross-modal SFT + transfer | distill textโimage; ERQA + 2 more | data pipeline โ, needs finetuning | **biggest compute** | | |
| | 1v1 ELO tournament | round-robin + CIs | harness โ | run | | |
| | Recovery SFT | active-recovery exemplars โ finetune | handoff bank โ | run | | |
| Critical path: human baseline (logistics) and the SFT (compute). | |
| --- | |
| ## 10. Metrics | |
| Win-rate / outcome ยท composite P/R/A score ยท objective-progress | |
| (continuous) ยท ELO (1v1, with CIs) ยท **passivity** (freeze metric) ยท | |
| generalization gap (public vs. held-out seeds) ยท strategy class / | |
| entropy ยท human-normalized score (model / human) ยท derived gaps: | |
| modality gap = `score(structured) โ score(image)`, fog penalty = | |
| `score(clear) โ score(fog)`, per model and per human. | |
| --- | |
| ## 11. Threats to validity / limitations to preempt | |
| ### 11.1 Out-of-scope engine features (paper must scope around them) | |
| The Rust engine is a *RA-Lite* โ ground-only, no resource layer. | |
| The following features are **not implemented** and the bench has | |
| zero packs for them. They are documented future work, NOT silently | |
| missing: | |
| - **Engineer capture** (`capture_actor`) โ task #11 (S8). | |
| - **Superweapons** โ nuke, iron-curtain, chronosphere โ S8. | |
| - **Spies / thief** โ infiltration, steal โ S8. | |
| - **Tanya** (Allied commando hero unit) โ new unit type, not in plan. | |
| - **Air units** โ yak / mig / heli โ needs `Aircraft` trait + flight. | |
| - **Naval** โ dd / ca / pt / lst + water mapgen. | |
| - **Resource layer / ore patches** โ `Resource` trait + harvester | |
| contention. The 1v1 map `rush-hour-arena` has **no ore patches**; | |
| economy is driven by `starting_cash` only. No mining contestation. | |
| - **APC ground transport** โ engine HAS `enter_transport` / | |
| `unload` + cargo storage; the bench has ~1 pack โ could author | |
| more but the mechanism is sound. | |
| **Paper scope:** "macro economy + combat micro + multi-base + | |
| perception, in a ground-only RA-Lite engine." The features above are | |
| documented as out-of-scope; reviewers will see the explicit list. | |
| ### 11.2 Methodological caveats (the standard list) | |
| - **One game (RA).** Lean on the capability taxonomy | |
| (`meta.benchmark_anchor`) + the ERQA transfer for generality. | |
| - **Engine is a reimplementation.** Deterministic + validated is the | |
| answer. | |
| - **One minimap render style.** Render-robustness ablation. | |
| - **Human skill / N.** Disclose; representative subset. | |
| - **"Panic = code training."** Hypothesis, not claim โ support with | |
| the probes in ยง7. | |
| - **SFT leakage.** Loud train/eval scenario split. | |
| - **ELO methodology.** Game count, pairing, confidence intervals. | |
| ### 11.3 Triage coverage (`scripts/triage.py`) | |
| Per-pack `INTENDED` policy attestation comes from each pack's | |
| dedicated `tests/test_<pack>.py` file (when present) โ every such | |
| test is in the suite and the suite is green, so the test passing | |
| proves the intended policy still wins against the current engine. | |
| Post defect-fix wave: | |
| - **167 / 196 packs** (85%) have a dedicated test โ "VERIFIED." | |
| - **29 / 196 packs** (15%) are stall-bar-only verified (no test). | |
| Either add a test or rely on full-run empirical attestation. | |
| - **1 pack** (`def-with-ambush`) is exempt by design (positional- | |
| discipline scenario where do-nothing IS the intended policy). | |
| - **0 packs** fail the stall-must-lose bar. | |
| --- | |
| ## 12. Pre-full-run audits (must land before the 200-pack sweep) | |
| After the pilot finishes and *before* committing compute to the full | |
| 200-pack run, three audits gate the rigor of the headline numbers: | |
| ### 12.1 Scenario quality audit | |
| Two layers: | |
| - **Static** โ re-run the scripted-policy bar (`stall` / `brute` / | |
| `intended`) across all 200 packs. Engine fixes may have drifted a | |
| pack since authoring (a lazy policy now wins, or `intended` now | |
| loses). Catches benchmark rot. | |
| - **Empirical** โ from pilot/full-run data, flag packs where *every* | |
| model wins (too easy / a trivial idiom dominates โ task #43) or | |
| *every* model loses (unsolvable or a predicate is mis-tuned โ | |
| task #44). Discriminative packs are the only useful ones. | |
| Paper payoff: a post-hoc audit table converts the "no-defect bar" | |
| claim into something you can *show* โ a strong methodology subsection. | |
| ### 12.2 Coverage map โ RTS phase ร decision-divergence | |
| Map all 200 packs (plus the 1v1 battleground) onto the **RTS phase ร | |
| decision-divergence matrix** from the original plan | |
| (opening / early-mid / mid / mid-late / late ร the canonical decisions | |
| in each). Produce a coverage heatmap; flag empty / thin cells. Surface | |
| the `meta.capability`-tag imbalance (`adversarial`=1 pack โ full | |
| end-to-end macro lives in the 1v1 battleground, both belong on the | |
| map). Paper payoff: a figure showing the bench spans the *real* game, | |
| not just easy probes. | |
| ### 12.3 Multi-run reliability โ `pass^k` | |
| Each (cell, seed) is run N times varying only model nondeterminism | |
| (requires temperature > 0). Report mean ยฑ CI **and** `pass^k` | |
| (all-k-wins). A model that wins 5/10 identical runs is a fundamentally | |
| different finding than 10/10. `--repeats N` in `run_eval`; default | |
| `k=5` (Codex / SWE-bench convention). Paper payoff: mean-only is | |
| fragile; reliability is itself a possible headline result. | |
| ## 13. Stretch ideas | |
| - **Pivotal-turn analysis** โ single-turn counterfactual swaps to show | |
| RTS losses are 1โ2 catastrophic decisions, not uniform decay. | |
| --- | |
| ## 13. Ablation infrastructure already built (this is real, today) | |
| - **Fog axis** โ engine `reveal_map` no-fog flag (`OpenRA-Rust`), | |
| the `-clear` perception cells. | |
| - **Modality axis** โ `structured` / `vision` / `image` (image- | |
| primary, text redacted, labelled minimap) channels; | |
| `run_eval --perception-sweep` expands `pack:level` into the 6 | |
| modality cells. | |
| - **Handoff axis** โ `openra_bench/handoff.py` | |
| (`HandoffController`, `TrajectoryController`), `run_eval | |
| --handoff-sweep`; the `passivity` metric on every result. | |
| - **1v1 battleground + ELO** โ `one_v_one.py`, scripted ladder. | |
| - **Human-labeling** โ the Play tab persists human runs in the | |
| standard `Playback` format (apples-to-apples with model runs). | |
| - **200 scenario packs** โ no-cheat-validated, capability-anchored. | |
| --- | |
| ## 14. Open decisions | |
| - Model roster for the sweeps (which models; vision-capable required | |
| for `vision` / `image` channels). | |
| - Compute / API budget for the full sweeps. | |
| - Human-study scope (how many humans, which scenario subset). | |
| - SFT base model(s) and the small/large pairing for finding 3. | |
| - Strategy-classifier definition for ยง5. | |