Spaces:

qpluslab
/

OpenRA-Bench

Running

yxc20098 commited on May 23

Commit

71cc3ba

1 Parent(s): 9b5f094

Phase 5 doc-coherence pass: sync all per-cell counts to 93

Earlier edits updated the headline table but stale references to
'74 cells', '46 cells', '28 cells', '53 losses', '19 in F1' etc.
were sprinkled throughout the prose. Resolved: every number now
matches the 93-cell snapshot (48 9B + 39 Plus + 6 gemma).

Also refined the 'systematically worsens with model scale' framing
(the original 2-model hypothesis) to 'appears in Plus alone' —
the 3-model evidence (gemma 33% matches 9B 42% while Plus is 0%)
shows it's model-specific, not scale-driven.

Files changed (1) hide show

PHASE5_FINDINGS.md +15 -16

PHASE5_FINDINGS.md CHANGED Viewed

@@ -1,16 +1,16 @@
 # Phase 5 — Model Failure Triage Findings
-Source: 74 completed cells across 2 Together models (Qwen/Qwen3.5-9B
-and Qwen/Qwen3.6-Plus), 12 engine-feature scenario packs × 2 levels
 (easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
 the full untruncated turn-by-turn record (obs, system_prompt,
 briefing, model_request, model_response, commands, signals,
-terminal). Gemma-4-31B-it cells in flight (6 cells partial). Triage
 generated by `scripts/triage_phase4.py`.
 ## Outcome matrix
-### Qwen/Qwen3.5-9B (46 cells)
 | pack                          | easy   | medium |
 |-------------------------------|--------|--------|
 | combat-naval-shore-strike     | 2W     | 1W 1L  |
@@ -27,7 +27,7 @@ generated by `scripts/triage_phase4.py`.
 | spec-thief-steal-cash         | 2W     | 1L 1D  |
 **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
-### Qwen/Qwen3.6-Plus (28 cells)
 | pack                          | easy   | medium |
 |-------------------------------|--------|--------|
 | spec-engineer-capture         | 2L     | 4L     |
@@ -35,7 +35,7 @@ generated by `scripts/triage_phase4.py`.
 | spec-spy-infiltrate           | 2L     | 4L     |
 | spec-tanya-c4-strike          | 2L     | 4L     |
 | spec-thief-steal-cash         | 2D     | 4L     |
-**Totals: 0W, 26L, 2D — 0% win across ALL cells.**
 ## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
@@ -71,7 +71,7 @@ other models of comparable scale win cleanly.
 | Qwen/Qwen3.6-Plus  | **39** | **0** | **37** | 2   | **0.0%** | **Observe (17-45×)** |
 Two models at very different scales (9B and 31B) win 40-42% of
-cells. Plus (a much larger model) wins **zero** cells across 30
 runs, with `Observe×N` as its only action across the entire
 budget on every cell.
@@ -86,11 +86,10 @@ perceived difficulty); model-side; scale-INVERSE. Plus appears
 calibrated to refuse action when uncertain, while 9B at least
 attempts navigation. This is the "freeze and panic" failure class
 predicted in PAPER_PLAN.md, now empirically demonstrated with a
-clean ≥30pp scaling delta.
 ### Implication for the paper
-The bench surfaces a failure mode that **systematically worsens
-with model scale** within the Qwen3.x family. RTS-style benchmarks
 are sensitive to a commitment-vs-conservatism axis that doesn't
 show up in static QA benchmarks. The user (and the field) should
 not assume bigger = better on real-time decision tasks.
@@ -108,7 +107,7 @@ identification under partial information fails.
 ## F3 — Target initial visibility predicts win rate
-Across Qwen3.5-9B's 46 cells, the strongest predictor of WIN is
 "target in initial sight":
 - `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
 - `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
@@ -121,7 +120,7 @@ medium versions — the only systematic difference is target
 visibility. This validates the bench's Perception axis: model can
 ACT when target is given; model FAILS when target requires search.
-## Engine vs Scenario vs Model attribution (74 cells)
 - **Engine bugs**: 0 attributable to the engine in the sample.
   (3 pre-existing engine P0s — per-player cash race, proc
@@ -134,9 +133,9 @@ ACT when target is given; model FAILS when target requires search.
   auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
   defensive cash-strip commit b77e43d preempts the entire
   regression class for 62 packs.)
-- **Model failures**: all 53 losses + 3 draws are model-side. The
-  triage script's classifier puts 19 in F1 (passive/walk-only,
-  no special verb), 4 in F2 (superweapon mis-aim), and 30 in
   "verb invoked but still lost" (the third class — model used
   the right verbs but couldn't sequence them effectively;
   examples include econ-multi-patch-allocation losing with
@@ -159,7 +158,7 @@ time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
 ## Data integrity
-- **All 74 cells captured in full untruncated JSONL** with per-turn
   PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
 - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
 - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells

 # Phase 5 — Model Failure Triage Findings
+Source: 93 completed cells across 2 Together models (Qwen/Qwen3.5-9B
+and Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario packs × 2 levels
 (easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
 the full untruncated turn-by-turn record (obs, system_prompt,
 briefing, model_request, model_response, commands, signals,
+terminal). Gemma-4-31B-it complete (6/6). Triage
 generated by `scripts/triage_phase4.py`.
 ## Outcome matrix
+### Qwen/Qwen3.5-9B (48 cells)
 | pack                          | easy   | medium |
 |-------------------------------|--------|--------|
 | combat-naval-shore-strike     | 2W     | 1W 1L  |
 | spec-thief-steal-cash         | 2W     | 1L 1D  |
 **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
+### Qwen/Qwen3.6-Plus (39 cells)
 | pack                          | easy   | medium |
 |-------------------------------|--------|--------|
 | spec-engineer-capture         | 2L     | 4L     |
 | spec-spy-infiltrate           | 2L     | 4L     |
 | spec-tanya-c4-strike          | 2L     | 4L     |
 | spec-thief-steal-cash         | 2D     | 4L     |
+**Totals: 0W, 37L, 2D — 0% win across ALL 39 cells.**
 ## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
 | Qwen/Qwen3.6-Plus  | **39** | **0** | **37** | 2   | **0.0%** | **Observe (17-45×)** |
 Two models at very different scales (9B and 31B) win 40-42% of
+cells. Plus (a much larger model) wins **zero** cells across 39
 runs, with `Observe×N` as its only action across the entire
 budget on every cell.
 calibrated to refuse action when uncertain, while 9B at least
 attempts navigation. This is the "freeze and panic" failure class
 predicted in PAPER_PLAN.md, now empirically demonstrated with a
+clean ≥33pp delta (Plus 0% vs the other two models 33-42%).
 ### Implication for the paper
+The bench surfaces a failure mode that **appears in Plus alone** within the Qwen3.x family. RTS-style benchmarks
 are sensitive to a commitment-vs-conservatism axis that doesn't
 show up in static QA benchmarks. The user (and the field) should
 not assume bigger = better on real-time decision tasks.
 ## F3 — Target initial visibility predicts win rate
+Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
 "target in initial sight":
 - `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
 - `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
 visibility. This validates the bench's Perception axis: model can
 ACT when target is given; model FAILS when target requires search.
+## Engine vs Scenario vs Model attribution (93 cells)
 - **Engine bugs**: 0 attributable to the engine in the sample.
   (3 pre-existing engine P0s — per-player cash race, proc
   auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
   defensive cash-strip commit b77e43d preempts the entire
   regression class for 62 packs.)
+- **Model failures**: all 68 losses + 3 draws are model-side. The
+  triage script's classifier puts 46 in F1 (passive/walk-only,
+  no special verb), 4 in F2 (superweapon mis-aim), and 18 in
   "verb invoked but still lost" (the third class — model used
   the right verbs but couldn't sequence them effectively;
   examples include econ-multi-patch-allocation losing with
 ## Data integrity
+- **All 93 cells captured in full untruncated JSONL** with per-turn
   PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
 - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
 - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells