Spaces:
Running
Running
Phase 5 doc-coherence pass: sync all per-cell counts to 93
Browse filesEarlier edits updated the headline table but stale references to
'74 cells', '46 cells', '28 cells', '53 losses', '19 in F1' etc.
were sprinkled throughout the prose. Resolved: every number now
matches the 93-cell snapshot (48 9B + 39 Plus + 6 gemma).
Also refined the 'systematically worsens with model scale' framing
(the original 2-model hypothesis) to 'appears in Plus alone' —
the 3-model evidence (gemma 33% matches 9B 42% while Plus is 0%)
shows it's model-specific, not scale-driven.
- PHASE5_FINDINGS.md +15 -16
PHASE5_FINDINGS.md
CHANGED
|
@@ -1,16 +1,16 @@
|
|
| 1 |
# Phase 5 — Model Failure Triage Findings
|
| 2 |
|
| 3 |
-
Source:
|
| 4 |
-
and Qwen/Qwen3.6-Plus), 12 engine-feature scenario packs × 2 levels
|
| 5 |
(easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
|
| 6 |
the full untruncated turn-by-turn record (obs, system_prompt,
|
| 7 |
briefing, model_request, model_response, commands, signals,
|
| 8 |
-
terminal). Gemma-4-31B-it
|
| 9 |
generated by `scripts/triage_phase4.py`.
|
| 10 |
|
| 11 |
## Outcome matrix
|
| 12 |
|
| 13 |
-
### Qwen/Qwen3.5-9B (
|
| 14 |
| pack | easy | medium |
|
| 15 |
|-------------------------------|--------|--------|
|
| 16 |
| combat-naval-shore-strike | 2W | 1W 1L |
|
|
@@ -27,7 +27,7 @@ generated by `scripts/triage_phase4.py`.
|
|
| 27 |
| spec-thief-steal-cash | 2W | 1L 1D |
|
| 28 |
**Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
|
| 29 |
|
| 30 |
-
### Qwen/Qwen3.6-Plus (
|
| 31 |
| pack | easy | medium |
|
| 32 |
|-------------------------------|--------|--------|
|
| 33 |
| spec-engineer-capture | 2L | 4L |
|
|
@@ -35,7 +35,7 @@ generated by `scripts/triage_phase4.py`.
|
|
| 35 |
| spec-spy-infiltrate | 2L | 4L |
|
| 36 |
| spec-tanya-c4-strike | 2L | 4L |
|
| 37 |
| spec-thief-steal-cash | 2D | 4L |
|
| 38 |
-
**Totals: 0W,
|
| 39 |
|
| 40 |
## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
|
| 41 |
|
|
@@ -71,7 +71,7 @@ other models of comparable scale win cleanly.
|
|
| 71 |
| Qwen/Qwen3.6-Plus | **39** | **0** | **37** | 2 | **0.0%** | **Observe (17-45×)** |
|
| 72 |
|
| 73 |
Two models at very different scales (9B and 31B) win 40-42% of
|
| 74 |
-
cells. Plus (a much larger model) wins **zero** cells across
|
| 75 |
runs, with `Observe×N` as its only action across the entire
|
| 76 |
budget on every cell.
|
| 77 |
|
|
@@ -86,11 +86,10 @@ perceived difficulty); model-side; scale-INVERSE. Plus appears
|
|
| 86 |
calibrated to refuse action when uncertain, while 9B at least
|
| 87 |
attempts navigation. This is the "freeze and panic" failure class
|
| 88 |
predicted in PAPER_PLAN.md, now empirically demonstrated with a
|
| 89 |
-
clean ≥
|
| 90 |
|
| 91 |
### Implication for the paper
|
| 92 |
-
The bench surfaces a failure mode that **
|
| 93 |
-
with model scale** within the Qwen3.x family. RTS-style benchmarks
|
| 94 |
are sensitive to a commitment-vs-conservatism axis that doesn't
|
| 95 |
show up in static QA benchmarks. The user (and the field) should
|
| 96 |
not assume bigger = better on real-time decision tasks.
|
|
@@ -108,7 +107,7 @@ identification under partial information fails.
|
|
| 108 |
|
| 109 |
## F3 — Target initial visibility predicts win rate
|
| 110 |
|
| 111 |
-
Across Qwen3.5-9B's
|
| 112 |
"target in initial sight":
|
| 113 |
- `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
|
| 114 |
- `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
|
|
@@ -121,7 +120,7 @@ medium versions — the only systematic difference is target
|
|
| 121 |
visibility. This validates the bench's Perception axis: model can
|
| 122 |
ACT when target is given; model FAILS when target requires search.
|
| 123 |
|
| 124 |
-
## Engine vs Scenario vs Model attribution (
|
| 125 |
|
| 126 |
- **Engine bugs**: 0 attributable to the engine in the sample.
|
| 127 |
(3 pre-existing engine P0s — per-player cash race, proc
|
|
@@ -134,9 +133,9 @@ ACT when target is given; model FAILS when target requires search.
|
|
| 134 |
auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
|
| 135 |
defensive cash-strip commit b77e43d preempts the entire
|
| 136 |
regression class for 62 packs.)
|
| 137 |
-
- **Model failures**: all
|
| 138 |
-
triage script's classifier puts
|
| 139 |
-
no special verb), 4 in F2 (superweapon mis-aim), and
|
| 140 |
"verb invoked but still lost" (the third class — model used
|
| 141 |
the right verbs but couldn't sequence them effectively;
|
| 142 |
examples include econ-multi-patch-allocation losing with
|
|
@@ -159,7 +158,7 @@ time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
|
|
| 159 |
|
| 160 |
## Data integrity
|
| 161 |
|
| 162 |
-
- **All
|
| 163 |
PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
|
| 164 |
- Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
|
| 165 |
- Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
|
|
|
|
| 1 |
# Phase 5 — Model Failure Triage Findings
|
| 2 |
|
| 3 |
+
Source: 93 completed cells across 2 Together models (Qwen/Qwen3.5-9B
|
| 4 |
+
and Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario packs × 2 levels
|
| 5 |
(easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
|
| 6 |
the full untruncated turn-by-turn record (obs, system_prompt,
|
| 7 |
briefing, model_request, model_response, commands, signals,
|
| 8 |
+
terminal). Gemma-4-31B-it complete (6/6). Triage
|
| 9 |
generated by `scripts/triage_phase4.py`.
|
| 10 |
|
| 11 |
## Outcome matrix
|
| 12 |
|
| 13 |
+
### Qwen/Qwen3.5-9B (48 cells)
|
| 14 |
| pack | easy | medium |
|
| 15 |
|-------------------------------|--------|--------|
|
| 16 |
| combat-naval-shore-strike | 2W | 1W 1L |
|
|
|
|
| 27 |
| spec-thief-steal-cash | 2W | 1L 1D |
|
| 28 |
**Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
|
| 29 |
|
| 30 |
+
### Qwen/Qwen3.6-Plus (39 cells)
|
| 31 |
| pack | easy | medium |
|
| 32 |
|-------------------------------|--------|--------|
|
| 33 |
| spec-engineer-capture | 2L | 4L |
|
|
|
|
| 35 |
| spec-spy-infiltrate | 2L | 4L |
|
| 36 |
| spec-tanya-c4-strike | 2L | 4L |
|
| 37 |
| spec-thief-steal-cash | 2D | 4L |
|
| 38 |
+
**Totals: 0W, 37L, 2D — 0% win across ALL 39 cells.**
|
| 39 |
|
| 40 |
## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
|
| 41 |
|
|
|
|
| 71 |
| Qwen/Qwen3.6-Plus | **39** | **0** | **37** | 2 | **0.0%** | **Observe (17-45×)** |
|
| 72 |
|
| 73 |
Two models at very different scales (9B and 31B) win 40-42% of
|
| 74 |
+
cells. Plus (a much larger model) wins **zero** cells across 39
|
| 75 |
runs, with `Observe×N` as its only action across the entire
|
| 76 |
budget on every cell.
|
| 77 |
|
|
|
|
| 86 |
calibrated to refuse action when uncertain, while 9B at least
|
| 87 |
attempts navigation. This is the "freeze and panic" failure class
|
| 88 |
predicted in PAPER_PLAN.md, now empirically demonstrated with a
|
| 89 |
+
clean ≥33pp delta (Plus 0% vs the other two models 33-42%).
|
| 90 |
|
| 91 |
### Implication for the paper
|
| 92 |
+
The bench surfaces a failure mode that **appears in Plus alone** within the Qwen3.x family. RTS-style benchmarks
|
|
|
|
| 93 |
are sensitive to a commitment-vs-conservatism axis that doesn't
|
| 94 |
show up in static QA benchmarks. The user (and the field) should
|
| 95 |
not assume bigger = better on real-time decision tasks.
|
|
|
|
| 107 |
|
| 108 |
## F3 — Target initial visibility predicts win rate
|
| 109 |
|
| 110 |
+
Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
|
| 111 |
"target in initial sight":
|
| 112 |
- `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
|
| 113 |
- `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
|
|
|
|
| 120 |
visibility. This validates the bench's Perception axis: model can
|
| 121 |
ACT when target is given; model FAILS when target requires search.
|
| 122 |
|
| 123 |
+
## Engine vs Scenario vs Model attribution (93 cells)
|
| 124 |
|
| 125 |
- **Engine bugs**: 0 attributable to the engine in the sample.
|
| 126 |
(3 pre-existing engine P0s — per-player cash race, proc
|
|
|
|
| 133 |
auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
|
| 134 |
defensive cash-strip commit b77e43d preempts the entire
|
| 135 |
regression class for 62 packs.)
|
| 136 |
+
- **Model failures**: all 68 losses + 3 draws are model-side. The
|
| 137 |
+
triage script's classifier puts 46 in F1 (passive/walk-only,
|
| 138 |
+
no special verb), 4 in F2 (superweapon mis-aim), and 18 in
|
| 139 |
"verb invoked but still lost" (the third class — model used
|
| 140 |
the right verbs but couldn't sequence them effectively;
|
| 141 |
examples include econ-multi-patch-allocation losing with
|
|
|
|
| 158 |
|
| 159 |
## Data integrity
|
| 160 |
|
| 161 |
+
- **All 93 cells captured in full untruncated JSONL** with per-turn
|
| 162 |
PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
|
| 163 |
- Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
|
| 164 |
- Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
|