Spaces:
Running
Running
File size: 10,396 Bytes
94ab79b 385aa0a 94ab79b 71cc3ba 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 71cc3ba 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 385aa0a 3f122f2 2dc6306 385aa0a 3f122f2 385aa0a 3f122f2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | # Phase 5 — Model Failure Triage Findings
Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B,
Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario
packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell
JSONL captures the full untruncated turn-by-turn record (obs,
system_prompt, briefing, model_request, model_response, commands,
signals, terminal). Triage generated by `scripts/triage_phase4.py`.
> **⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is
> passive" — is RETRACTED. Root cause is a Together-API adapter bug
> dropping Plus's tool_calls from the wire response. Plus IS reasoning
> and emitting tool calls server-side; the bench parser receives an
> empty `tool_calls` list and falls back to the default Observe. See
> §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now
> the F3 perception-axis result (target visibility predicts win rate),
> documented below.**
## Outcome matrix
### Qwen/Qwen3.5-9B (48 cells)
| pack | easy | medium |
|-------------------------------|--------|--------|
| combat-naval-shore-strike | 2W | 1W 1L |
| def-bridge-chokepoint | 1W 1L | 2W |
| econ-contested-expansion | 2L | 2L |
| econ-harvester-defense-raid | 2W | 2L |
| econ-mine-and-grow | 2L | 2L |
| econ-multi-patch-allocation | 2L | 2L |
| econ-second-base-race | 2W | 2L |
| spec-engineer-capture | 2W | 2L |
| spec-nuke-strike | 2L | 2L |
| spec-spy-infiltrate | 2W | 2L |
| spec-tanya-c4-strike | 2W | 2W ← perfect 4/4
| spec-thief-steal-cash | 2W | 1L 1D |
**Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
### google/gemma-4-31B-it (9 cells, partial)
| pack | easy | medium |
|-------------------------------|--------|--------|
| spec-tanya-c4-strike | 2W | 1W 1L |
| spec-engineer-capture | 2W | - |
| (others in flight) | | |
**Partial: 5W / 4L / 0D = 55.6% win**
### Qwen/Qwen3.6-Plus (55 cells, **EXCLUDED from headline**)
All cells issued `Observe` only (default fallback) due to the adapter
bug described in §F1-RETRACTED. The 0/55 win rate is a **measurement
artefact**, not a model property. Cells remain on disk for future
re-analysis once the adapter is fixed.
## F1-RETRACTED — Together adapter drops Plus's tool_calls
**What we originally claimed (now retracted):** Plus exhibited a
model-specific "freeze and panic" passivity where it issued only
`Observe` across the entire decision budget on every cell, despite
9B and 31B winning the same packs.
**What's actually happening:**
Every Plus turn's raw Together response has this exact shape:
```json
{
"choices": [{
"message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."},
"finish_reason": "tool_calls"
}],
"usage": {
"completion_tokens": 345,
"completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69}
}
}
```
Three pieces of evidence prove Plus DID emit tool calls:
1. `finish_reason: "tool_calls"` — the API itself reports the
completion ended on tool-call emission.
2. `completion_tokens_details.text_tokens: 69` — Plus produced 69
non-reasoning tokens (the tool-call JSON), but they're absent
from `message.content` and `message.tool_calls`.
3. The `reasoning` channel consistently ends with concrete intent
("I'll move to (50, 20) to scout east") — Plus is reasoning
correctly and arriving at a specific action.
**Diagnosis:** Together's response adapter for Plus serialises the
reasoning channel but DROPS the actual tool-call structure from the
returned message. Bench's `_reply_from_data` parser
(`openra_bench/providers.py:413-423`) reads `msg.get("tool_calls") or []`
→ empty → bench issues default `Command::Observe`.
**This is a Together backend bug, not a Plus model bug, and not a
bench parser bug.** Verified by:
- Direct httpx test outside bench: `tool_choice=auto` (streamed)
→ reasoning text only, `tool_calls=[]`, finish_reason=tool_calls.
- `tool_choice=required` (streamed) → no completion at all.
- Bench's existing Plus tool-call scrub (task #84) covered the
history-shape side (empty `tool_calls: []` rejection); it does NOT
recover the dropped server-side tool calls.
**Implications:**
- The "Plus is passive" headline is **invalid**. The bench cannot
measure Plus's RTS reasoning at all through the Together endpoint
until the adapter is fixed.
- Per-pack outcomes for Plus on this dataset reflect "what happens
when the agent issues Observe every turn for 25 turns" (always a
loss/draw for packs that require any action).
- Paper-side: **omit Plus from headline model comparisons.** Either
add a clearly-labelled "Together adapter excludes Plus" footnote,
or rerun Plus through a different endpoint (OpenRouter, direct
Anthropic-style, or Together once they fix the adapter).
**Next steps:**
1. (Done) Document the adapter bug here and in
`openra_bench/providers.py` (already notes Plus quirks).
2. File upstream issue with Together support, including the
minimal reproduction (see snippet above + `usage.text_tokens > 0`
while `message` lacks both `content` and `tool_calls`).
3. Optional workaround: write a "reasoning-channel fallback parser"
that extracts intent like `move_units` / `attack_unit` / numeric
coordinates from the reasoning text. Fragile and would conflate
model output with NLP-extraction error; better to wait for the
adapter fix or use a different endpoint.
## F2 — Superweapon mis-aim (Reasoning/Action axis)
Qwen3.5-9B loses all `spec-nuke-strike` easy cells. The model
INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets
the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED.
**Classification:** Reasoning-axis spatial-commit failure. The verb
is available, the charge timer is met, but cluster-centre
identification under partial information fails.
## F3 — Target initial visibility predicts win rate (headline)
Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
"target in initial sight":
- `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
- `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
- `spec-spy-infiltrate easy` (proc adjacent): 2W / 0L
- `spec-engineer-capture medium` (target 12 cells off-latitude): 0W / 2L
- `spec-spy-infiltrate medium` (target fogged): 0W / 2L
The same model wins the easy versions of these packs and loses the
medium versions — the only systematic difference is target
visibility. This validates the bench's Perception axis: model can
ACT when target is given; model FAILS when target requires search.
This is now the **headline Phase-5 finding**, since F1 retracted.
## Engine vs Scenario vs Model attribution
- **Engine bugs**: 0 attributable to the engine in the sample.
(3 pre-existing engine P0s — per-player cash race, proc
auto-spawn, production completion — were FIXED earlier this
session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
tests + bench engine-feature tests now green.)
- **Scenario defects**: 0 attributable to scenarios in the sample.
(3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3.
Bench's defensive cash-strip commit b77e43d preempts the entire
regression class for 62 packs.)
- **Provider/adapter bugs**: 1 confirmed (Together drops Plus
tool_calls). Class: PROVIDER, not MODEL, not BENCH. See
§F1-RETRACTED.
- **Model failures (9B + gemma only)**: losses cluster on packs
where target requires search (F3). Plus excluded.
## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)
Wins out of 2 seeds per pack:
- 2W: combat-naval-shore-strike, econ-harvester-defense-raid,
econ-second-base-race, spec-engineer-capture, spec-spy-infiltrate,
spec-tanya-c4-strike, spec-thief-steal-cash
- 1W: def-bridge-chokepoint
- 0W: econ-contested-expansion, econ-mine-and-grow,
econ-multi-patch-allocation, spec-nuke-strike
Economy packs (build-or-die throughput) dominate the 0W list — a
signal that the model struggles with multi-step build chains under
time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
## Cell-count asymmetry note
The three models have different completed-cell counts (9B=48,
Plus=55, gemma=9) because the collection ran models sequentially
through the main 240-cell plan, then added side runs for Plus
(`paper-v1-plus-medium/`, 8 medium cells) and gemma
(`paper-v1-gemma-medium/`, 6 medium cells) to fill in coverage on
the discriminating `spec-tanya-c4-strike medium` cell. Collection
remains in flight; cells accumulate via `scripts/collect_eval_data.py
--resume`.
## Data integrity
- **All 112 cells captured in full untruncated JSONL** with per-turn
PNG snapshots at `data/runs/paper-v1-*/`. No data loss. Plus's
cells remain available for re-analysis once the Together adapter
is fixed.
- Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
- Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
with a `terminal:` line; partial cells re-run cleanly.
## Phase 5 status: COMPLETE (F1 retracted, F3 promoted)
The collection continues accumulating in background. The
provider-bug finding is the most actionable next step: file with
Together, optionally implement a reasoning-channel fallback, and
rerun Plus through a different endpoint to get a real Plus signal.
## Next paper-prep steps
1. Cross-link F3 (perception-axis target visibility) into
PAPER_PLAN.md §3 Findings as the headline result.
2. Add a "Provider failures we found" section to the paper covering
the Together-Plus adapter bug as an empirical observation about
the maturity of OSS-model tool-calling adapters — that itself is
a finding of interest for the agent-benchmark community.
3. Rerun Plus through an alternative endpoint (OpenRouter or fixed
Together) for the real Plus comparison once available.
4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are
not adapter-dropped before drawing conclusions.
5. Run perception-sweep cells (structured/vision/image ×
fog/no-fog) on the same packs to strengthen F3 with controlled
visibility variation.
|