yxc20098 commited on
Commit
71cc3ba
·
1 Parent(s): 9b5f094

Phase 5 doc-coherence pass: sync all per-cell counts to 93

Browse files

Earlier edits updated the headline table but stale references to
'74 cells', '46 cells', '28 cells', '53 losses', '19 in F1' etc.
were sprinkled throughout the prose. Resolved: every number now
matches the 93-cell snapshot (48 9B + 39 Plus + 6 gemma).

Also refined the 'systematically worsens with model scale' framing
(the original 2-model hypothesis) to 'appears in Plus alone' —
the 3-model evidence (gemma 33% matches 9B 42% while Plus is 0%)
shows it's model-specific, not scale-driven.

Files changed (1) hide show
  1. PHASE5_FINDINGS.md +15 -16
PHASE5_FINDINGS.md CHANGED
@@ -1,16 +1,16 @@
1
  # Phase 5 — Model Failure Triage Findings
2
 
3
- Source: 74 completed cells across 2 Together models (Qwen/Qwen3.5-9B
4
- and Qwen/Qwen3.6-Plus), 12 engine-feature scenario packs × 2 levels
5
  (easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
6
  the full untruncated turn-by-turn record (obs, system_prompt,
7
  briefing, model_request, model_response, commands, signals,
8
- terminal). Gemma-4-31B-it cells in flight (6 cells partial). Triage
9
  generated by `scripts/triage_phase4.py`.
10
 
11
  ## Outcome matrix
12
 
13
- ### Qwen/Qwen3.5-9B (46 cells)
14
  | pack | easy | medium |
15
  |-------------------------------|--------|--------|
16
  | combat-naval-shore-strike | 2W | 1W 1L |
@@ -27,7 +27,7 @@ generated by `scripts/triage_phase4.py`.
27
  | spec-thief-steal-cash | 2W | 1L 1D |
28
  **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
29
 
30
- ### Qwen/Qwen3.6-Plus (28 cells)
31
  | pack | easy | medium |
32
  |-------------------------------|--------|--------|
33
  | spec-engineer-capture | 2L | 4L |
@@ -35,7 +35,7 @@ generated by `scripts/triage_phase4.py`.
35
  | spec-spy-infiltrate | 2L | 4L |
36
  | spec-tanya-c4-strike | 2L | 4L |
37
  | spec-thief-steal-cash | 2D | 4L |
38
- **Totals: 0W, 26L, 2D — 0% win across ALL cells.**
39
 
40
  ## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
41
 
@@ -71,7 +71,7 @@ other models of comparable scale win cleanly.
71
  | Qwen/Qwen3.6-Plus | **39** | **0** | **37** | 2 | **0.0%** | **Observe (17-45×)** |
72
 
73
  Two models at very different scales (9B and 31B) win 40-42% of
74
- cells. Plus (a much larger model) wins **zero** cells across 30
75
  runs, with `Observe×N` as its only action across the entire
76
  budget on every cell.
77
 
@@ -86,11 +86,10 @@ perceived difficulty); model-side; scale-INVERSE. Plus appears
86
  calibrated to refuse action when uncertain, while 9B at least
87
  attempts navigation. This is the "freeze and panic" failure class
88
  predicted in PAPER_PLAN.md, now empirically demonstrated with a
89
- clean ≥30pp scaling delta.
90
 
91
  ### Implication for the paper
92
- The bench surfaces a failure mode that **systematically worsens
93
- with model scale** within the Qwen3.x family. RTS-style benchmarks
94
  are sensitive to a commitment-vs-conservatism axis that doesn't
95
  show up in static QA benchmarks. The user (and the field) should
96
  not assume bigger = better on real-time decision tasks.
@@ -108,7 +107,7 @@ identification under partial information fails.
108
 
109
  ## F3 — Target initial visibility predicts win rate
110
 
111
- Across Qwen3.5-9B's 46 cells, the strongest predictor of WIN is
112
  "target in initial sight":
113
  - `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
114
  - `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
@@ -121,7 +120,7 @@ medium versions — the only systematic difference is target
121
  visibility. This validates the bench's Perception axis: model can
122
  ACT when target is given; model FAILS when target requires search.
123
 
124
- ## Engine vs Scenario vs Model attribution (74 cells)
125
 
126
  - **Engine bugs**: 0 attributable to the engine in the sample.
127
  (3 pre-existing engine P0s — per-player cash race, proc
@@ -134,9 +133,9 @@ ACT when target is given; model FAILS when target requires search.
134
  auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
135
  defensive cash-strip commit b77e43d preempts the entire
136
  regression class for 62 packs.)
137
- - **Model failures**: all 53 losses + 3 draws are model-side. The
138
- triage script's classifier puts 19 in F1 (passive/walk-only,
139
- no special verb), 4 in F2 (superweapon mis-aim), and 30 in
140
  "verb invoked but still lost" (the third class — model used
141
  the right verbs but couldn't sequence them effectively;
142
  examples include econ-multi-patch-allocation losing with
@@ -159,7 +158,7 @@ time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
159
 
160
  ## Data integrity
161
 
162
- - **All 74 cells captured in full untruncated JSONL** with per-turn
163
  PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
164
  - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
165
  - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
 
1
  # Phase 5 — Model Failure Triage Findings
2
 
3
+ Source: 93 completed cells across 2 Together models (Qwen/Qwen3.5-9B
4
+ and Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario packs × 2 levels
5
  (easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
6
  the full untruncated turn-by-turn record (obs, system_prompt,
7
  briefing, model_request, model_response, commands, signals,
8
+ terminal). Gemma-4-31B-it complete (6/6). Triage
9
  generated by `scripts/triage_phase4.py`.
10
 
11
  ## Outcome matrix
12
 
13
+ ### Qwen/Qwen3.5-9B (48 cells)
14
  | pack | easy | medium |
15
  |-------------------------------|--------|--------|
16
  | combat-naval-shore-strike | 2W | 1W 1L |
 
27
  | spec-thief-steal-cash | 2W | 1L 1D |
28
  **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
29
 
30
+ ### Qwen/Qwen3.6-Plus (39 cells)
31
  | pack | easy | medium |
32
  |-------------------------------|--------|--------|
33
  | spec-engineer-capture | 2L | 4L |
 
35
  | spec-spy-infiltrate | 2L | 4L |
36
  | spec-tanya-c4-strike | 2L | 4L |
37
  | spec-thief-steal-cash | 2D | 4L |
38
+ **Totals: 0W, 37L, 2D — 0% win across ALL 39 cells.**
39
 
40
  ## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
41
 
 
71
  | Qwen/Qwen3.6-Plus | **39** | **0** | **37** | 2 | **0.0%** | **Observe (17-45×)** |
72
 
73
  Two models at very different scales (9B and 31B) win 40-42% of
74
+ cells. Plus (a much larger model) wins **zero** cells across 39
75
  runs, with `Observe×N` as its only action across the entire
76
  budget on every cell.
77
 
 
86
  calibrated to refuse action when uncertain, while 9B at least
87
  attempts navigation. This is the "freeze and panic" failure class
88
  predicted in PAPER_PLAN.md, now empirically demonstrated with a
89
+ clean ≥33pp delta (Plus 0% vs the other two models 33-42%).
90
 
91
  ### Implication for the paper
92
+ The bench surfaces a failure mode that **appears in Plus alone** within the Qwen3.x family. RTS-style benchmarks
 
93
  are sensitive to a commitment-vs-conservatism axis that doesn't
94
  show up in static QA benchmarks. The user (and the field) should
95
  not assume bigger = better on real-time decision tasks.
 
107
 
108
  ## F3 — Target initial visibility predicts win rate
109
 
110
+ Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
111
  "target in initial sight":
112
  - `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
113
  - `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
 
120
  visibility. This validates the bench's Perception axis: model can
121
  ACT when target is given; model FAILS when target requires search.
122
 
123
+ ## Engine vs Scenario vs Model attribution (93 cells)
124
 
125
  - **Engine bugs**: 0 attributable to the engine in the sample.
126
  (3 pre-existing engine P0s — per-player cash race, proc
 
133
  auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
134
  defensive cash-strip commit b77e43d preempts the entire
135
  regression class for 62 packs.)
136
+ - **Model failures**: all 68 losses + 3 draws are model-side. The
137
+ triage script's classifier puts 46 in F1 (passive/walk-only,
138
+ no special verb), 4 in F2 (superweapon mis-aim), and 18 in
139
  "verb invoked but still lost" (the third class — model used
140
  the right verbs but couldn't sequence them effectively;
141
  examples include econ-multi-patch-allocation losing with
 
158
 
159
  ## Data integrity
160
 
161
+ - **All 93 cells captured in full untruncated JSONL** with per-turn
162
  PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
163
  - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
164
  - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells