yxc20098 commited on
Commit
385aa0a
·
1 Parent(s): 01806f8

Phase 5: retract F1 (Plus passivity) — Together adapter bug

Browse files

Together AI's Qwen3.6-Plus endpoint returns
finish_reason='tool_calls' and usage.text_tokens > 0 but emits
empty tool_calls and empty content in the message body — only
the reasoning channel is populated. Plus IS generating tool
calls server-side; Together's adapter drops them on the wire.

Verified by direct httpx tests (3 reasoning-disable variants
tried, none recover the calls) and confirmed against Together's
own model page, which does NOT list function calling for
Qwen3.6-Plus.

Implications for Phase 5:
- F1 'Plus is passive across all 55 cells' is retracted —
Plus's 0/55 win rate is a measurement artefact (bench falls
back to default Observe when tool_calls is empty), not a
model property.
- F3 (target initial visibility predicts win rate, validated
on 9B's 48 cells) is promoted to the new Phase 5 headline.
- Plus cells remain on disk (data/runs/paper-v1-plus-medium/
+ paper-v1-engine-feature-packs/) for future re-analysis if
Together fixes the adapter, or rerun via OpenRouter.

Working serverless Together roster for the paper now confirmed
as 3 models: Qwen3.5-9B, gemma-4-31B-it, Kimi-K2.6 — plus
Qwen3.5-397B-A17B (smoke-verified) as a fourth.

Files changed (1) hide show
  1. PHASE5_FINDINGS.md +152 -112
PHASE5_FINDINGS.md CHANGED
@@ -1,12 +1,20 @@
1
  # Phase 5 — Model Failure Triage Findings
2
 
3
- Source: 112 completed cells across 2 Together models (Qwen/Qwen3.5-9B
4
- and Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario packs × 2 levels
5
- (easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
6
- the full untruncated turn-by-turn record (obs, system_prompt,
7
- briefing, model_request, model_response, commands, signals,
8
- terminal). Gemma-4-31B-it complete (6/6). Triage
9
- generated by `scripts/triage_phase4.py`.
 
 
 
 
 
 
 
 
10
 
11
  ## Outcome matrix
12
 
@@ -27,85 +35,106 @@ generated by `scripts/triage_phase4.py`.
27
  | spec-thief-steal-cash | 2W | 1L 1D |
28
  **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
29
 
30
- ### Qwen/Qwen3.6-Plus (55 cells)
31
  | pack | easy | medium |
32
  |-------------------------------|--------|--------|
33
- | spec-engineer-capture | 2L | 4L |
34
- | spec-nuke-strike | 2L | - |
35
- | spec-spy-infiltrate | 2L | 4L |
36
- | spec-tanya-c4-strike | 2L | 4L |
37
- | spec-thief-steal-cash | 2D | 4L |
38
- **Totals: 0W, 53L, 2D 0% win across ALL 55 cells.**
39
-
40
- ## Headline finding F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
41
-
42
- **Initial framing:** scale-INVERSE within Qwen family. Updated
43
- framing after gemma-4-31B-it cells landed: **F1 is Plus-SPECIFIC,
44
- not a general scale phenomenon.**
45
-
46
- Cross-model evidence on the discriminating pack `spec-tanya-c4-strike`
47
- medium (proc in initial sight, requires walk + C4):
48
-
49
- | model | win rate | dominant loss-cell commands |
50
- |--------------------|--------------|------------------------------|
51
- | Qwen/Qwen3.5-9B | **2W / 2L** (50%) | (wins both cells) |
52
- | google/gemma-4-31B-it | **2W / 2L** (50%) | (wins both cells) |
53
- | Qwen/Qwen3.6-Plus | **0W / 4L** (0%) | `Obs×23` (pure passivity) |
54
-
55
- Same pack, three Together models: 9B and 31B WIN, Plus REFUSES. The
56
- failure isn't tied to model SIZE — it's tied to a Plus-family
57
- **over-conservative calibration** that doesn't show up in gemma-31B
58
- (comparable scale) or 9B.
59
-
60
- **Refined headline:** Qwen3.6-Plus on this benchmark exhibits a
61
- model-specific refusal mode where it issues only `Observe` for the
62
- entire decision budget, even on trivially-winnable cells where
63
- other models of comparable scale win cleanly.
64
-
65
- **Final 83-cell numbers** (after orphan-dir cleanup; tasks/triage_phase4.py):
66
-
67
- | model | cells | W | L | D | win rate | dominant loss verb |
68
- |--------------------|-------|-----|-----|-----|----------|--------------------|
69
- | Qwen/Qwen3.5-9B | 48 | 20 | 27 | 1 | **41.7%**| MoveUnits (33×) |
70
- | google/gemma-4-31B-it | **9** | **5** | **4** | 0 | **55.6%**| MoveUnits (33-34×) |
71
- | Qwen/Qwen3.6-Plus | **55** | **0** | **53** | 2 | **0.0%** | **Observe (17-45×)** |
72
-
73
- Two models at very different scales (9B and 31B) win 40-42% of
74
- cells. Plus (a much larger model) wins **zero** cells across 55
75
- runs, with `Observe×N` as its only action across the entire
76
- budget on every cell.
77
-
78
- The most striking comparison is `spec-tanya-c4-strike` — which
79
- Qwen3.5-9B wins 4/4 (proc in initial sight, walk + C4 = trivial).
80
- Qwen3.6-Plus loses 6/6 cells on the same pack with `Obs×23-34`
81
- across the entire budget. The pack is fully winnable for 9B; Plus
82
- issues only `Observe`.
83
-
84
- **Classification:** Reasoning axis (commitment failure under
85
- perceived difficulty); model-side; scale-INVERSE. Plus appears
86
- calibrated to refuse action when uncertain, while 9B at least
87
- attempts navigation. This is the "freeze and panic" failure class
88
- predicted in PAPER_PLAN.md, now empirically demonstrated with a
89
- clean ≥33pp delta (Plus 0% vs the other two models 33-42%).
90
-
91
- ### Implication for the paper
92
- The bench surfaces a failure mode that **appears in Plus alone** within the Qwen3.x family. RTS-style benchmarks
93
- are sensitive to a commitment-vs-conservatism axis that doesn't
94
- show up in static QA benchmarks. The user (and the field) should
95
- not assume bigger = better on real-time decision tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  ## F2 — Superweapon mis-aim (Reasoning/Action axis)
98
 
99
- Both Qwen3.5-9B and Qwen3.6-Plus lose all `spec-nuke-strike` easy
100
- cells. Qwen3.5-9B INVOKES the verb (Observe×12-20 + FireSuperweapon
101
- ×5-8) but targets the wrong cell. Qwen3.6-Plus (just observed)
102
- falls under F1.
103
 
104
- **Classification:** Reasoning-axis spatial-commit failure. The
105
- verb is available, the charge timer is met, but cluster-centre
106
  identification under partial information fails.
107
 
108
- ## F3 — Target initial visibility predicts win rate
109
 
110
  Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
111
  "target in initial sight":
@@ -120,7 +149,9 @@ medium versions — the only systematic difference is target
120
  visibility. This validates the bench's Perception axis: model can
121
  ACT when target is given; model FAILS when target requires search.
122
 
123
- ## Engine vs Scenario vs Model attribution (112 cells)
 
 
124
 
125
  - **Engine bugs**: 0 attributable to the engine in the sample.
126
  (3 pre-existing engine P0s — per-player cash race, proc
@@ -128,19 +159,14 @@ ACT when target is given; model FAILS when target requires search.
128
  session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
129
  tests + bench engine-feature tests now green.)
130
  - **Scenario defects**: 0 attributable to scenarios in the sample.
131
- (3 audit-flagged defects spec-spy-infiltrate DRAW,
132
- combat-naval-shore-strike auto-WIN, mid-economy-under-fire
133
- auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
134
- defensive cash-strip commit b77e43d preempts the entire
135
  regression class for 62 packs.)
136
- - **Model failures**: all 84 losses + 3 draws are model-side. The
137
- triage script's classifier puts 62 in F1 (passive/walk-only,
138
- no special verb), 4 in F2 (superweapon mis-aim), and 18 in
139
- "verb invoked but still lost" (the third class model used
140
- the right verbs but couldn't sequence them effectively;
141
- examples include econ-multi-patch-allocation losing with
142
- `PlaceBuilding×44 Harvest×40 Build×40`, model spamming
143
- actions without strategic effect).
144
 
145
  ## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)
146
 
@@ -156,32 +182,46 @@ Economy packs (build-or-die throughput) dominate the 0W list — a
156
  signal that the model struggles with multi-step build chains under
157
  time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
158
 
 
 
 
 
 
 
 
 
 
 
 
159
  ## Data integrity
160
 
161
  - **All 112 cells captured in full untruncated JSONL** with per-turn
162
- PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
 
 
163
  - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
164
  - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
165
  with a `terminal:` line; partial cells re-run cleanly.
166
 
167
- ## Phase 5 status: COMPLETE
168
-
169
- The headline F1 finding is now established with ≥30pp scaling
170
- delta evidence. F2 + F3 documented. The collection continues
171
- accumulating in background (the script's resume logic makes
172
- this safe to extend without re-doing completed cells), but
173
- NOTHING in the additional data is likely to overturn F1's
174
- scale-INVERSE direction. Paper-ready.
175
-
176
- ## Next paper-prep steps (out of scope for this Phase 5)
177
-
178
- 1. Cross-link F1 into PAPER_PLAN.md §3 Findings as the headline
179
- result.
180
- 2. Add gemma-4-31B-it cells (in flight) as a third model column;
181
- if gemma also matches Plus's passivity, the scale-INVERSE
182
- becomes a general finding (not Qwen-specific).
183
- 3. Add Kimi-K2.6 as a fourth model; Kimi's reasoning chain may
184
- distinguish it from Qwen's passivity pattern.
185
- 4. Run perception-sweep cells (structured/vision/image ×
186
- fog/no-fog) on the same packs to test F3 with controlled
 
187
  visibility variation.
 
1
  # Phase 5 — Model Failure Triage Findings
2
 
3
+ Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B,
4
+ Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario
5
+ packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell
6
+ JSONL captures the full untruncated turn-by-turn record (obs,
7
+ system_prompt, briefing, model_request, model_response, commands,
8
+ signals, terminal). Triage generated by `scripts/triage_phase4.py`.
9
+
10
+ > **⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is
11
+ > passive" — is RETRACTED. Root cause is a Together-API adapter bug
12
+ > dropping Plus's tool_calls from the wire response. Plus IS reasoning
13
+ > and emitting tool calls server-side; the bench parser receives an
14
+ > empty `tool_calls` list and falls back to the default Observe. See
15
+ > §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now
16
+ > the F3 perception-axis result (target visibility predicts win rate),
17
+ > documented below.**
18
 
19
  ## Outcome matrix
20
 
 
35
  | spec-thief-steal-cash | 2W | 1L 1D |
36
  **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
37
 
38
+ ### google/gemma-4-31B-it (9 cells, partial)
39
  | pack | easy | medium |
40
  |-------------------------------|--------|--------|
41
+ | spec-tanya-c4-strike | 2W | 1W 1L |
42
+ | spec-engineer-capture | 2W | - |
43
+ | (others in flight) | | |
44
+ **Partial: 5W / 4L / 0D = 55.6% win**
45
+
46
+ ### Qwen/Qwen3.6-Plus (55 cells, **EXCLUDED from headline**)
47
+ All cells issued `Observe` only (default fallback) due to the adapter
48
+ bug described in §F1-RETRACTED. The 0/55 win rate is a **measurement
49
+ artefact**, not a model property. Cells remain on disk for future
50
+ re-analysis once the adapter is fixed.
51
+
52
+ ## F1-RETRACTED Together adapter drops Plus's tool_calls
53
+
54
+ **What we originally claimed (now retracted):** Plus exhibited a
55
+ model-specific "freeze and panic" passivity where it issued only
56
+ `Observe` across the entire decision budget on every cell, despite
57
+ 9B and 31B winning the same packs.
58
+
59
+ **What's actually happening:**
60
+
61
+ Every Plus turn's raw Together response has this exact shape:
62
+
63
+ ```json
64
+ {
65
+ "choices": [{
66
+ "message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."},
67
+ "finish_reason": "tool_calls"
68
+ }],
69
+ "usage": {
70
+ "completion_tokens": 345,
71
+ "completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69}
72
+ }
73
+ }
74
+ ```
75
+
76
+ Three pieces of evidence prove Plus DID emit tool calls:
77
+
78
+ 1. `finish_reason: "tool_calls"` the API itself reports the
79
+ completion ended on tool-call emission.
80
+ 2. `completion_tokens_details.text_tokens: 69` — Plus produced 69
81
+ non-reasoning tokens (the tool-call JSON), but they're absent
82
+ from `message.content` and `message.tool_calls`.
83
+ 3. The `reasoning` channel consistently ends with concrete intent
84
+ ("I'll move to (50, 20) to scout east") — Plus is reasoning
85
+ correctly and arriving at a specific action.
86
+
87
+ **Diagnosis:** Together's response adapter for Plus serialises the
88
+ reasoning channel but DROPS the actual tool-call structure from the
89
+ returned message. Bench's `_reply_from_data` parser
90
+ (`openra_bench/providers.py:413-423`) reads `msg.get("tool_calls") or []`
91
+ → empty → bench issues default `Command::Observe`.
92
+
93
+ **This is a Together backend bug, not a Plus model bug, and not a
94
+ bench parser bug.** Verified by:
95
+ - Direct httpx test outside bench: `tool_choice=auto` (streamed)
96
+ reasoning text only, `tool_calls=[]`, finish_reason=tool_calls.
97
+ - `tool_choice=required` (streamed) no completion at all.
98
+ - Bench's existing Plus tool-call scrub (task #84) covered the
99
+ history-shape side (empty `tool_calls: []` rejection); it does NOT
100
+ recover the dropped server-side tool calls.
101
+
102
+ **Implications:**
103
+
104
+ - The "Plus is passive" headline is **invalid**. The bench cannot
105
+ measure Plus's RTS reasoning at all through the Together endpoint
106
+ until the adapter is fixed.
107
+ - Per-pack outcomes for Plus on this dataset reflect "what happens
108
+ when the agent issues Observe every turn for 25 turns" (always a
109
+ loss/draw for packs that require any action).
110
+ - Paper-side: **omit Plus from headline model comparisons.** Either
111
+ add a clearly-labelled "Together adapter excludes Plus" footnote,
112
+ or rerun Plus through a different endpoint (OpenRouter, direct
113
+ Anthropic-style, or Together once they fix the adapter).
114
+
115
+ **Next steps:**
116
+ 1. (Done) Document the adapter bug here and in
117
+ `openra_bench/providers.py` (already notes Plus quirks).
118
+ 2. File upstream issue with Together support, including the
119
+ minimal reproduction (see snippet above + `usage.text_tokens > 0`
120
+ while `message` lacks both `content` and `tool_calls`).
121
+ 3. Optional workaround: write a "reasoning-channel fallback parser"
122
+ that extracts intent like `move_units` / `attack_unit` / numeric
123
+ coordinates from the reasoning text. Fragile and would conflate
124
+ model output with NLP-extraction error; better to wait for the
125
+ adapter fix or use a different endpoint.
126
 
127
  ## F2 — Superweapon mis-aim (Reasoning/Action axis)
128
 
129
+ Qwen3.5-9B loses all `spec-nuke-strike` easy cells. The model
130
+ INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets
131
+ the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED.
 
132
 
133
+ **Classification:** Reasoning-axis spatial-commit failure. The verb
134
+ is available, the charge timer is met, but cluster-centre
135
  identification under partial information fails.
136
 
137
+ ## F3 — Target initial visibility predicts win rate (headline)
138
 
139
  Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
140
  "target in initial sight":
 
149
  visibility. This validates the bench's Perception axis: model can
150
  ACT when target is given; model FAILS when target requires search.
151
 
152
+ This is now the **headline Phase-5 finding**, since F1 retracted.
153
+
154
+ ## Engine vs Scenario vs Model attribution
155
 
156
  - **Engine bugs**: 0 attributable to the engine in the sample.
157
  (3 pre-existing engine P0s — per-player cash race, proc
 
159
  session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
160
  tests + bench engine-feature tests now green.)
161
  - **Scenario defects**: 0 attributable to scenarios in the sample.
162
+ (3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3.
163
+ Bench's defensive cash-strip commit b77e43d preempts the entire
 
 
164
  regression class for 62 packs.)
165
+ - **Provider/adapter bugs**: 1 confirmed (Together drops Plus
166
+ tool_calls). Class: PROVIDER, not MODEL, not BENCH. See
167
+ §F1-RETRACTED.
168
+ - **Model failures (9B + gemma only)**: losses cluster on packs
169
+ where target requires search (F3). Plus excluded.
 
 
 
170
 
171
  ## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)
172
 
 
182
  signal that the model struggles with multi-step build chains under
183
  time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
184
 
185
+ ## Cell-count asymmetry note
186
+
187
+ The three models have different completed-cell counts (9B=48,
188
+ Plus=55, gemma=9) because the collection ran models sequentially
189
+ through the main 240-cell plan, then added side runs for Plus
190
+ (`paper-v1-plus-medium/`, 8 medium cells) and gemma
191
+ (`paper-v1-gemma-medium/`, 6 medium cells) to fill in coverage on
192
+ the discriminating `spec-tanya-c4-strike medium` cell. Collection
193
+ remains in flight; cells accumulate via `scripts/collect_eval_data.py
194
+ --resume`.
195
+
196
  ## Data integrity
197
 
198
  - **All 112 cells captured in full untruncated JSONL** with per-turn
199
+ PNG snapshots at `data/runs/paper-v1-*/`. No data loss. Plus's
200
+ cells remain available for re-analysis once the Together adapter
201
+ is fixed.
202
  - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
203
  - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
204
  with a `terminal:` line; partial cells re-run cleanly.
205
 
206
+ ## Phase 5 status: COMPLETE (F1 retracted, F3 promoted)
207
+
208
+ The collection continues accumulating in background. The
209
+ provider-bug finding is the most actionable next step: file with
210
+ Together, optionally implement a reasoning-channel fallback, and
211
+ rerun Plus through a different endpoint to get a real Plus signal.
212
+
213
+ ## Next paper-prep steps
214
+
215
+ 1. Cross-link F3 (perception-axis target visibility) into
216
+ PAPER_PLAN.md §3 Findings as the headline result.
217
+ 2. Add a "Provider failures we found" section to the paper covering
218
+ the Together-Plus adapter bug as an empirical observation about
219
+ the maturity of OSS-model tool-calling adapters that itself is
220
+ a finding of interest for the agent-benchmark community.
221
+ 3. Rerun Plus through an alternative endpoint (OpenRouter or fixed
222
+ Together) for the real Plus comparison once available.
223
+ 4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are
224
+ not adapter-dropped before drawing conclusions.
225
+ 5. Run perception-sweep cells (structured/vision/image ×
226
+ fog/no-fog) on the same packs to strengthen F3 with controlled
227
  visibility variation.