File size: 10,396 Bytes
94ab79b
 
385aa0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94ab79b
 
 
71cc3ba
3f122f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
385aa0a
3f122f2
 
385aa0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f122f2
 
 
385aa0a
 
 
3f122f2
385aa0a
 
3f122f2
 
385aa0a
3f122f2
71cc3ba
3f122f2
 
 
 
 
 
 
 
 
 
 
 
385aa0a
 
 
3f122f2
 
 
 
 
 
 
385aa0a
 
3f122f2
385aa0a
 
 
 
 
3f122f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
385aa0a
 
 
 
 
 
 
 
 
 
 
3f122f2
 
2dc6306
385aa0a
 
 
3f122f2
 
 
 
385aa0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f122f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# Phase 5 — Model Failure Triage Findings

Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B,
Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario
packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell
JSONL captures the full untruncated turn-by-turn record (obs,
system_prompt, briefing, model_request, model_response, commands,
signals, terminal). Triage generated by `scripts/triage_phase4.py`.

> **⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is
> passive" — is RETRACTED. Root cause is a Together-API adapter bug
> dropping Plus's tool_calls from the wire response. Plus IS reasoning
> and emitting tool calls server-side; the bench parser receives an
> empty `tool_calls` list and falls back to the default Observe. See
> §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now
> the F3 perception-axis result (target visibility predicts win rate),
> documented below.**

## Outcome matrix

### Qwen/Qwen3.5-9B (48 cells)
| pack                          | easy   | medium |
|-------------------------------|--------|--------|
| combat-naval-shore-strike     | 2W     | 1W 1L  |
| def-bridge-chokepoint         | 1W 1L  | 2W     |
| econ-contested-expansion      | 2L     | 2L     |
| econ-harvester-defense-raid   | 2W     | 2L     |
| econ-mine-and-grow            | 2L     | 2L     |
| econ-multi-patch-allocation   | 2L     | 2L     |
| econ-second-base-race         | 2W     | 2L     |
| spec-engineer-capture         | 2W     | 2L     |
| spec-nuke-strike              | 2L     | 2L     |
| spec-spy-infiltrate           | 2W     | 2L     |
| spec-tanya-c4-strike          | 2W     | 2W     ← perfect 4/4
| spec-thief-steal-cash         | 2W     | 1L 1D  |
**Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**

### google/gemma-4-31B-it (9 cells, partial)
| pack                          | easy   | medium |
|-------------------------------|--------|--------|
| spec-tanya-c4-strike          | 2W     | 1W 1L  |
| spec-engineer-capture         | 2W     | -      |
| (others in flight)            |        |        |
**Partial: 5W / 4L / 0D = 55.6% win**

### Qwen/Qwen3.6-Plus (55 cells, **EXCLUDED from headline**)
All cells issued `Observe` only (default fallback) due to the adapter
bug described in §F1-RETRACTED. The 0/55 win rate is a **measurement
artefact**, not a model property. Cells remain on disk for future
re-analysis once the adapter is fixed.

## F1-RETRACTED — Together adapter drops Plus's tool_calls

**What we originally claimed (now retracted):** Plus exhibited a
model-specific "freeze and panic" passivity where it issued only
`Observe` across the entire decision budget on every cell, despite
9B and 31B winning the same packs.

**What's actually happening:**

Every Plus turn's raw Together response has this exact shape:

```json
{
  "choices": [{
    "message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."},
    "finish_reason": "tool_calls"
  }],
  "usage": {
    "completion_tokens": 345,
    "completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69}
  }
}
```

Three pieces of evidence prove Plus DID emit tool calls:

1. `finish_reason: "tool_calls"` — the API itself reports the
   completion ended on tool-call emission.
2. `completion_tokens_details.text_tokens: 69` — Plus produced 69
   non-reasoning tokens (the tool-call JSON), but they're absent
   from `message.content` and `message.tool_calls`.
3. The `reasoning` channel consistently ends with concrete intent
   ("I'll move to (50, 20) to scout east") — Plus is reasoning
   correctly and arriving at a specific action.

**Diagnosis:** Together's response adapter for Plus serialises the
reasoning channel but DROPS the actual tool-call structure from the
returned message. Bench's `_reply_from_data` parser
(`openra_bench/providers.py:413-423`) reads `msg.get("tool_calls") or []`
→ empty → bench issues default `Command::Observe`.

**This is a Together backend bug, not a Plus model bug, and not a
bench parser bug.** Verified by:
- Direct httpx test outside bench: `tool_choice=auto` (streamed)
  → reasoning text only, `tool_calls=[]`, finish_reason=tool_calls.
- `tool_choice=required` (streamed) → no completion at all.
- Bench's existing Plus tool-call scrub (task #84) covered the
  history-shape side (empty `tool_calls: []` rejection); it does NOT
  recover the dropped server-side tool calls.

**Implications:**

- The "Plus is passive" headline is **invalid**. The bench cannot
  measure Plus's RTS reasoning at all through the Together endpoint
  until the adapter is fixed.
- Per-pack outcomes for Plus on this dataset reflect "what happens
  when the agent issues Observe every turn for 25 turns" (always a
  loss/draw for packs that require any action).
- Paper-side: **omit Plus from headline model comparisons.** Either
  add a clearly-labelled "Together adapter excludes Plus" footnote,
  or rerun Plus through a different endpoint (OpenRouter, direct
  Anthropic-style, or Together once they fix the adapter).

**Next steps:**
1. (Done) Document the adapter bug here and in
   `openra_bench/providers.py` (already notes Plus quirks).
2. File upstream issue with Together support, including the
   minimal reproduction (see snippet above + `usage.text_tokens > 0`
   while `message` lacks both `content` and `tool_calls`).
3. Optional workaround: write a "reasoning-channel fallback parser"
   that extracts intent like `move_units` / `attack_unit` / numeric
   coordinates from the reasoning text. Fragile and would conflate
   model output with NLP-extraction error; better to wait for the
   adapter fix or use a different endpoint.

## F2 — Superweapon mis-aim (Reasoning/Action axis)

Qwen3.5-9B loses all `spec-nuke-strike` easy cells. The model
INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets
the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED.

**Classification:** Reasoning-axis spatial-commit failure. The verb
is available, the charge timer is met, but cluster-centre
identification under partial information fails.

## F3 — Target initial visibility predicts win rate (headline)

Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
"target in initial sight":
- `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
- `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
- `spec-spy-infiltrate easy` (proc adjacent): 2W / 0L
- `spec-engineer-capture medium` (target 12 cells off-latitude): 0W / 2L
- `spec-spy-infiltrate medium` (target fogged): 0W / 2L

The same model wins the easy versions of these packs and loses the
medium versions — the only systematic difference is target
visibility. This validates the bench's Perception axis: model can
ACT when target is given; model FAILS when target requires search.

This is now the **headline Phase-5 finding**, since F1 retracted.

## Engine vs Scenario vs Model attribution

- **Engine bugs**: 0 attributable to the engine in the sample.
  (3 pre-existing engine P0s — per-player cash race, proc
  auto-spawn, production completion — were FIXED earlier this
  session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
  tests + bench engine-feature tests now green.)
- **Scenario defects**: 0 attributable to scenarios in the sample.
  (3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3.
  Bench's defensive cash-strip commit b77e43d preempts the entire
  regression class for 62 packs.)
- **Provider/adapter bugs**: 1 confirmed (Together drops Plus
  tool_calls). Class: PROVIDER, not MODEL, not BENCH. See
  §F1-RETRACTED.
- **Model failures (9B + gemma only)**: losses cluster on packs
  where target requires search (F3). Plus excluded.

## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)

Wins out of 2 seeds per pack:
- 2W: combat-naval-shore-strike, econ-harvester-defense-raid,
  econ-second-base-race, spec-engineer-capture, spec-spy-infiltrate,
  spec-tanya-c4-strike, spec-thief-steal-cash
- 1W: def-bridge-chokepoint
- 0W: econ-contested-expansion, econ-mine-and-grow,
  econ-multi-patch-allocation, spec-nuke-strike

Economy packs (build-or-die throughput) dominate the 0W list — a
signal that the model struggles with multi-step build chains under
time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).

## Cell-count asymmetry note

The three models have different completed-cell counts (9B=48,
Plus=55, gemma=9) because the collection ran models sequentially
through the main 240-cell plan, then added side runs for Plus
(`paper-v1-plus-medium/`, 8 medium cells) and gemma
(`paper-v1-gemma-medium/`, 6 medium cells) to fill in coverage on
the discriminating `spec-tanya-c4-strike medium` cell. Collection
remains in flight; cells accumulate via `scripts/collect_eval_data.py
--resume`.

## Data integrity

- **All 112 cells captured in full untruncated JSONL** with per-turn
  PNG snapshots at `data/runs/paper-v1-*/`. No data loss. Plus's
  cells remain available for re-analysis once the Together adapter
  is fixed.
- Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
- Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
  with a `terminal:` line; partial cells re-run cleanly.

## Phase 5 status: COMPLETE (F1 retracted, F3 promoted)

The collection continues accumulating in background. The
provider-bug finding is the most actionable next step: file with
Together, optionally implement a reasoning-channel fallback, and
rerun Plus through a different endpoint to get a real Plus signal.

## Next paper-prep steps

1. Cross-link F3 (perception-axis target visibility) into
   PAPER_PLAN.md §3 Findings as the headline result.
2. Add a "Provider failures we found" section to the paper covering
   the Together-Plus adapter bug as an empirical observation about
   the maturity of OSS-model tool-calling adapters — that itself is
   a finding of interest for the agent-benchmark community.
3. Rerun Plus through an alternative endpoint (OpenRouter or fixed
   Together) for the real Plus comparison once available.
4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are
   not adapter-dropped before drawing conclusions.
5. Run perception-sweep cells (structured/vision/image ×
   fog/no-fog) on the same packs to strengthen F3 with controlled
   visibility variation.