OpenRA-Bench / PAPER_PLAN.md
yxc20098's picture
Add triage tool + scope paper around out-of-engine features
64027cb
|
Raw
History Blame Contribute Delete
17.2 kB
# OpenRA-Bench โ€” Paper Plan & Critique
Working document. Captures the central thesis, the paper framing, the
three findings, the experiment program, and the open critiques /
reviewer-defenses. Living doc โ€” update as experiments land.
---
## 1. Central thesis (as stated)
We build a benchmark on the Real-Time Strategy game *Red Alert* and
measure LLM performance on **adversarial, multi-modal, long-horizon
strategic planning and execution**. Three findings:
1. **The major gap is image-based perception.** In pure-text
battlefields models execute complex strategies; in mixed modality
โ€” where own and enemy positions must be read off a minimap โ€” they
struggle. We show this by SFT on a small set of text-modality
results with the image attached, and observe improvement on this
benchmark *and* on external vision benchmarks (ERQA).
2. **Models panic in unfavorable / uncertain situations.** Scores
drop sharply under fog of war while humans hold a baseline. Under
heavy losses, models choose `observe`/`stop` instead of actively
redirecting units. Hypothesis: a transfer of behavior from
code-heavy training, where the rewarded move in an unfavorable
state is to stop and wait for human intervention.
3. **Bridging both gaps via simple SFT** lets a *smaller* model beat
large reasoning models on 1v1 ELO.
Contributions: 200 research-grounded scenarios; an active 1v1
battleground; evidence that models form complex, diverse strategies
despite the perception and adversity gaps (and that strategy varies
with temperature and is model-characteristic โ€” turtle / balanced /
aggressive).
---
## 2. Framing & positioning
**The hook is the inversion:** LLMs can do the *hard* part
(long-horizon adversarial strategy) but fail the "easy" parts
(reading a map, holding their nerve). Lead with that โ€” it inverts the
usual "LLMs can't plan" narrative, and the failures are specific and
*fixable*.
Position OpenRA-Bench as a **diagnostic instrument**, not a
leaderboard. The methodological novelty vs. the crowded
SC2 / TextStarCraft LLM-agent space is the **ablation methodology that
decomposes failure** into perception / reasoning / action / adversity.
Arc of the paper: **diagnosis โ†’ localization โ†’ treatment โ†’ transfer.**
Title candidates:
- *Map-Blind and Panic-Prone: Diagnosing LLM Strategic Agents in
Real-Time Strategy*
- *The Bottleneck Is Perception, Not Planning: A Diagnostic RTS
Benchmark*
- *LLMs Can Strategize But Cannot See: Decomposing Agent Failure in
Red Alert*
Venue: strong enough for a main track (ICLR/NeurIPS/ICML) on the
diagnosis+treatment+transfer arc โ€” not only Datasets & Benchmarks.
---
## 3. Contributions (sharpened)
1. **OpenRA-Bench** โ€” 200 controlled RTS scenarios on a deterministic
engine, each anchored to a named real-world capability / external
benchmark, built to a **no-cheat / no-defect** bar (every lazy /
brute / stall policy provably loses on every level and seed). The
validation rigor is itself a contribution: it is why the benchmark
measures the *intended* capability.
2. **An ablation methodology that decomposes failure** โ€” the
perception grid (channel ร— fog) and the handoff sweep โ€” turning
"the model lost" into "the model lost *because of perception /
adversity / planning*."
3. **A 1v1 self-play battleground** with ELO.
4. **Two localized, fixable gaps** โ€” perception-binding and
inaction-under-adversity โ€” each with an SFT remedy, cross-benchmark
transfer (ERQA), and a small-beats-large result.
---
## 4. Paper structure
ยง1 Intro (the inversion) ยท ยง2 Related work ยท ยง3 OpenRA-Bench (engine,
200 scenarios, no-cheat design, ablation axes, 1v1/ELO,
human-labeling) ยท ยง4 Setup (models, sweeps, metrics) ยท **ยง5 Models
*can* strategize** (the diversity control) ยท ยง6 Finding 1: perception
gap ยท ยง7 Finding 2: panic ยท ยง8 Finding 3: SFT remedy + transfer ยท
ยง9 Limitations ยท ยง10 Conclusion.
**Key structural move:** promote the "diverse strategies" result from
a bonus to **ยง5, before the failure findings.** Showing models produce
coherent, diverse, model-characteristic strategies is what licenses
the claim that the later failures are *not* failures of strategic
reasoning. Without it a reviewer says "maybe they just can't plan."
---
## 5. ยง5 control โ€” models *can* strategize
Promote from "bonus" to a load-bearing control.
- Classify each game's strategy (turtle / balanced / aggressive) via a
trajectory classifier (first-attack tick, army-vs-econ ratio,
expansion count, aggression index, build order). Show per-model
distributions differ.
- Temperature sweep: strategy entropy vs. temperature for one model.
- Show strategies are *coherent* (executed consistently once chosen),
not random.
**The headline figure โ€” the strategy-embedding scatter, in 1v1.** Each
game's trajectory is featurized and embedded to 2D (PCA / UMAP);
points colored by model. The plot shows, at a glance:
- **inter-model clusters** โ€” model A lives in the aggressive region,
model B turtles โ€” models have characteristic strategy *priors*;
- **intra-model spread** โ€” the same model across games (especially at
high temperature) scatters โ€” one model generates *diverse* play.
1v1 is the right venue: full-macro adversarial games make strategy
legible in a way single-scenario tasks do not. A bigger model roster
(8โ€“12, Together + Bedrock) makes the clustering visually striking.
This is the evidence that planning is not the bottleneck โ€” and a
genuinely compelling standalone result.
---
## 6. Finding 1 โ€” Perception gap
**Claim, sharpened.** Decompose perception into **extraction** (read
state from the image) and **binding** (act on image-derived state).
The perception sweep already separates these: `image`-channel on
perception packs (count / locate) isolates extraction; `image` on
action packs isolates binding. Report *which* breaks. The result is
"the perceptionโ†’action **binding** fails while pure planning and pure
extraction are intact" โ€” not the weak "mixed modality is hard."
**Experiments.**
- Perception sweep (6 cells) ร— model roster ร— scenario set ร— seeds โ†’
the modality-gap number per model (`structured โˆ’ image`).
- A "perceive-then-act" scaffold (force the model to transcribe
positions first) โ€” if it rescues performance, the bottleneck is
binding, not extraction.
- **Cross-modal action distillation SFT**: text-mode *winning*
trajectories + the attached minimap โ†’ train imageโ†’action. Eval
`image` channel on **held-out** scenarios.
- **Transfer**: ERQA + โ‰ฅ2 other spatial / vision benchmarks
(e.g. BLINK, a spatial-VQA set), before / after SFT.
**Why transfer is the crown jewel.** Training imageโ†’action where the
action is good could teach good *action priors* without the model
ever using the image. The cross-benchmark transfer is what proves the
SFT taught the model to *see*. Without it, finding 1 is "task
finetuning helps task." With it, it is "agentic RTS data improves
general visual spatial reasoning." Use โ‰ฅ2 external benchmarks so it
cannot be called cherry-picked.
**Reviewer attack โ†’ defense.** "VLMs are known to be bad at dense
small images โ€” this is just OCR." โ†’ We measure it in a *consequential
agentic loop*, we *localize* extraction vs binding, and we show a fix
that *transfers*. Plus a render-style mini-ablation (the
`constant_colors` / scale knobs) so the gap is not an artifact of one
minimap style.
---
## 7. Finding 2 โ€” Panic / inaction under adversity
**Claim.** Models degrade far more than humans under fog; under heavy
losses they default to `observe` / `stop` instead of active redirect.
Operational definition = the **`passivity` metric** (fraction of the
model's turns spent on `observe` / `stop` only). Keep "panic" only as
an informal label; formal text says "inaction bias under adversity."
**Experiments.**
- Fog ablation ร— models ร— **humans** โ€” report the *fog-penalty gap*
(human degrades little, model degrades a lot), not raw fog scores.
- Handoff bad-prefix โ†’ passivity, models vs. human. Models freeze
(high passivity), humans redirect (low).
- Show passivity is *causally costly* โ€” within-model, passive turns
predict worse outcomes controlling for position quality.
**The coding-bias hypothesis is the spiciest claim โ€” handle with
care.** Cannot be asserted as fact. Keep as a hypothesis AND support
with โ‰ฅ1 of:
- **base vs. RLHF / instruct** versions of the same model โ€” if base
models panic less, post-training is implicated;
- a **loss-aware prompt intervention** ("inaction is costly; never
just observe when losing") โ€” if a prompt largely fixes it, the
behavior is a shallow prior, consistent with the hypothesis;
- passivity vs. known code / RLHF intensity across the roster.
**Critical-path risk.** The human baseline needs *real human data at
scale*. The Play tab is built; data is not collected. Scope: a
representative scenario subset, several humans, disclosed RTS skill. A
strong scripted policy can serve as a secondary "competent
non-panicking" reference if human N is thin.
---
## 8. Finding 3 โ€” SFT remedy, small beats large
**Reviewer attack (serious).** "A task-finetuned small model beating a
zero-shot large model is trivial." **Defenses to build in:**
- The SFT is **small and targeted** (perception grounding + active
recovery), explicitly *not* trained on eval scenarios โ€” state the
train / eval split loudly.
- **Also SFT the large model** โ€” show the gap is real and fixable at
every scale. Headline becomes: "the perception+panic gap is large
enough that closing it in a small model *outweighs a 10ร— reasoning
advantage* โ€” and it was never a reasoning gap."
- Ablate the SFT: perception-only / recovery-only / both โ€” each
component contributes.
- ELO with enough games + confidence intervals; held-out 1v1 maps.
**Infrastructure synergy (state explicitly).** The handoff
`TrajectoryController` + the human-Playback format *are* the SFT data
pipeline โ€” recovered bad-prefix episodes are active-recovery
exemplars; text-mode wins are perception-distillation data. The
ablation infrastructure doubles as the training-data factory.
---
## 9. Experiment program
| # | Experiment | Infra | Status |
|---|---|---|---|
| ยง5 strategy diversity | classify 1v1 games; temp sweep; per-model dists | 1v1 harness โœ“; needs strategy classifier | TODO |
| Perception sweep | 6-cell ร— roster ร— scenarios ร— seeds | โœ“ `--perception-sweep` | **run** |
| Handoff / passivity | base/bad/good ร— roster | โœ“ `--handoff-sweep` | **run** |
| Human baseline | fog ร— scenario subset ร— humans | Play tab โœ“, **no data** | **highest-risk** |
| Cross-modal SFT + transfer | distill textโ†’image; ERQA + 2 more | data pipeline โœ“, needs finetuning | **biggest compute** |
| 1v1 ELO tournament | round-robin + CIs | harness โœ“ | run |
| Recovery SFT | active-recovery exemplars โ†’ finetune | handoff bank โœ“ | run |
Critical path: human baseline (logistics) and the SFT (compute).
---
## 10. Metrics
Win-rate / outcome ยท composite P/R/A score ยท objective-progress
(continuous) ยท ELO (1v1, with CIs) ยท **passivity** (freeze metric) ยท
generalization gap (public vs. held-out seeds) ยท strategy class /
entropy ยท human-normalized score (model / human) ยท derived gaps:
modality gap = `score(structured) โˆ’ score(image)`, fog penalty =
`score(clear) โˆ’ score(fog)`, per model and per human.
---
## 11. Threats to validity / limitations to preempt
### 11.1 Out-of-scope engine features (paper must scope around them)
The Rust engine is a *RA-Lite* โ€” ground-only, no resource layer.
The following features are **not implemented** and the bench has
zero packs for them. They are documented future work, NOT silently
missing:
- **Engineer capture** (`capture_actor`) โ€” task #11 (S8).
- **Superweapons** โ€” nuke, iron-curtain, chronosphere โ€” S8.
- **Spies / thief** โ€” infiltration, steal โ€” S8.
- **Tanya** (Allied commando hero unit) โ€” new unit type, not in plan.
- **Air units** โ€” yak / mig / heli โ€” needs `Aircraft` trait + flight.
- **Naval** โ€” dd / ca / pt / lst + water mapgen.
- **Resource layer / ore patches** โ€” `Resource` trait + harvester
contention. The 1v1 map `rush-hour-arena` has **no ore patches**;
economy is driven by `starting_cash` only. No mining contestation.
- **APC ground transport** โ€” engine HAS `enter_transport` /
`unload` + cargo storage; the bench has ~1 pack โ€” could author
more but the mechanism is sound.
**Paper scope:** "macro economy + combat micro + multi-base +
perception, in a ground-only RA-Lite engine." The features above are
documented as out-of-scope; reviewers will see the explicit list.
### 11.2 Methodological caveats (the standard list)
- **One game (RA).** Lean on the capability taxonomy
(`meta.benchmark_anchor`) + the ERQA transfer for generality.
- **Engine is a reimplementation.** Deterministic + validated is the
answer.
- **One minimap render style.** Render-robustness ablation.
- **Human skill / N.** Disclose; representative subset.
- **"Panic = code training."** Hypothesis, not claim โ€” support with
the probes in ยง7.
- **SFT leakage.** Loud train/eval scenario split.
- **ELO methodology.** Game count, pairing, confidence intervals.
### 11.3 Triage coverage (`scripts/triage.py`)
Per-pack `INTENDED` policy attestation comes from each pack's
dedicated `tests/test_<pack>.py` file (when present) โ€” every such
test is in the suite and the suite is green, so the test passing
proves the intended policy still wins against the current engine.
Post defect-fix wave:
- **167 / 196 packs** (85%) have a dedicated test โ†’ "VERIFIED."
- **29 / 196 packs** (15%) are stall-bar-only verified (no test).
Either add a test or rely on full-run empirical attestation.
- **1 pack** (`def-with-ambush`) is exempt by design (positional-
discipline scenario where do-nothing IS the intended policy).
- **0 packs** fail the stall-must-lose bar.
---
## 12. Pre-full-run audits (must land before the 200-pack sweep)
After the pilot finishes and *before* committing compute to the full
200-pack run, three audits gate the rigor of the headline numbers:
### 12.1 Scenario quality audit
Two layers:
- **Static** โ€” re-run the scripted-policy bar (`stall` / `brute` /
`intended`) across all 200 packs. Engine fixes may have drifted a
pack since authoring (a lazy policy now wins, or `intended` now
loses). Catches benchmark rot.
- **Empirical** โ€” from pilot/full-run data, flag packs where *every*
model wins (too easy / a trivial idiom dominates โ€” task #43) or
*every* model loses (unsolvable or a predicate is mis-tuned โ€”
task #44). Discriminative packs are the only useful ones.
Paper payoff: a post-hoc audit table converts the "no-defect bar"
claim into something you can *show* โ€” a strong methodology subsection.
### 12.2 Coverage map โ€” RTS phase ร— decision-divergence
Map all 200 packs (plus the 1v1 battleground) onto the **RTS phase ร—
decision-divergence matrix** from the original plan
(opening / early-mid / mid / mid-late / late ร— the canonical decisions
in each). Produce a coverage heatmap; flag empty / thin cells. Surface
the `meta.capability`-tag imbalance (`adversarial`=1 pack โ€” full
end-to-end macro lives in the 1v1 battleground, both belong on the
map). Paper payoff: a figure showing the bench spans the *real* game,
not just easy probes.
### 12.3 Multi-run reliability โ€” `pass^k`
Each (cell, seed) is run N times varying only model nondeterminism
(requires temperature > 0). Report mean ยฑ CI **and** `pass^k`
(all-k-wins). A model that wins 5/10 identical runs is a fundamentally
different finding than 10/10. `--repeats N` in `run_eval`; default
`k=5` (Codex / SWE-bench convention). Paper payoff: mean-only is
fragile; reliability is itself a possible headline result.
## 13. Stretch ideas
- **Pivotal-turn analysis** โ€” single-turn counterfactual swaps to show
RTS losses are 1โ€“2 catastrophic decisions, not uniform decay.
---
## 13. Ablation infrastructure already built (this is real, today)
- **Fog axis** โ€” engine `reveal_map` no-fog flag (`OpenRA-Rust`),
the `-clear` perception cells.
- **Modality axis** โ€” `structured` / `vision` / `image` (image-
primary, text redacted, labelled minimap) channels;
`run_eval --perception-sweep` expands `pack:level` into the 6
modality cells.
- **Handoff axis** โ€” `openra_bench/handoff.py`
(`HandoffController`, `TrajectoryController`), `run_eval
--handoff-sweep`; the `passivity` metric on every result.
- **1v1 battleground + ELO** โ€” `one_v_one.py`, scripted ladder.
- **Human-labeling** โ€” the Play tab persists human runs in the
standard `Playback` format (apples-to-apples with model runs).
- **200 scenario packs** โ€” no-cheat-validated, capability-anchored.
---
## 14. Open decisions
- Model roster for the sweeps (which models; vision-capable required
for `vision` / `image` channels).
- Compute / API budget for the full sweeps.
- Human-study scope (how many humans, which scenario subset).
- SFT base model(s) and the small/large pairing for finding 3.
- Strategy-classifier definition for ยง5.