Spaces:
Running
Running
| # Phase 5 — Model Failure Triage Findings | |
| Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B, | |
| Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario | |
| packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell | |
| JSONL captures the full untruncated turn-by-turn record (obs, | |
| system_prompt, briefing, model_request, model_response, commands, | |
| signals, terminal). Triage generated by `scripts/triage_phase4.py`. | |
| > **⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is | |
| > passive" — is RETRACTED. Root cause is a Together-API adapter bug | |
| > dropping Plus's tool_calls from the wire response. Plus IS reasoning | |
| > and emitting tool calls server-side; the bench parser receives an | |
| > empty `tool_calls` list and falls back to the default Observe. See | |
| > §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now | |
| > the F3 perception-axis result (target visibility predicts win rate), | |
| > documented below.** | |
| ## Outcome matrix | |
| ### Qwen/Qwen3.5-9B (48 cells) | |
| | pack | easy | medium | | |
| |-------------------------------|--------|--------| | |
| | combat-naval-shore-strike | 2W | 1W 1L | | |
| | def-bridge-chokepoint | 1W 1L | 2W | | |
| | econ-contested-expansion | 2L | 2L | | |
| | econ-harvester-defense-raid | 2W | 2L | | |
| | econ-mine-and-grow | 2L | 2L | | |
| | econ-multi-patch-allocation | 2L | 2L | | |
| | econ-second-base-race | 2W | 2L | | |
| | spec-engineer-capture | 2W | 2L | | |
| | spec-nuke-strike | 2L | 2L | | |
| | spec-spy-infiltrate | 2W | 2L | | |
| | spec-tanya-c4-strike | 2W | 2W ← perfect 4/4 | |
| | spec-thief-steal-cash | 2W | 1L 1D | | |
| **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium** | |
| ### google/gemma-4-31B-it (9 cells, partial) | |
| | pack | easy | medium | | |
| |-------------------------------|--------|--------| | |
| | spec-tanya-c4-strike | 2W | 1W 1L | | |
| | spec-engineer-capture | 2W | - | | |
| | (others in flight) | | | | |
| **Partial: 5W / 4L / 0D = 55.6% win** | |
| ### Qwen/Qwen3.6-Plus (55 cells, **EXCLUDED from headline**) | |
| All cells issued `Observe` only (default fallback) due to the adapter | |
| bug described in §F1-RETRACTED. The 0/55 win rate is a **measurement | |
| artefact**, not a model property. Cells remain on disk for future | |
| re-analysis once the adapter is fixed. | |
| ## F1-RETRACTED — Together adapter drops Plus's tool_calls | |
| **What we originally claimed (now retracted):** Plus exhibited a | |
| model-specific "freeze and panic" passivity where it issued only | |
| `Observe` across the entire decision budget on every cell, despite | |
| 9B and 31B winning the same packs. | |
| **What's actually happening:** | |
| Every Plus turn's raw Together response has this exact shape: | |
| ```json | |
| { | |
| "choices": [{ | |
| "message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."}, | |
| "finish_reason": "tool_calls" | |
| }], | |
| "usage": { | |
| "completion_tokens": 345, | |
| "completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69} | |
| } | |
| } | |
| ``` | |
| Three pieces of evidence prove Plus DID emit tool calls: | |
| 1. `finish_reason: "tool_calls"` — the API itself reports the | |
| completion ended on tool-call emission. | |
| 2. `completion_tokens_details.text_tokens: 69` — Plus produced 69 | |
| non-reasoning tokens (the tool-call JSON), but they're absent | |
| from `message.content` and `message.tool_calls`. | |
| 3. The `reasoning` channel consistently ends with concrete intent | |
| ("I'll move to (50, 20) to scout east") — Plus is reasoning | |
| correctly and arriving at a specific action. | |
| **Diagnosis:** Together's response adapter for Plus serialises the | |
| reasoning channel but DROPS the actual tool-call structure from the | |
| returned message. Bench's `_reply_from_data` parser | |
| (`openra_bench/providers.py:413-423`) reads `msg.get("tool_calls") or []` | |
| → empty → bench issues default `Command::Observe`. | |
| **This is a Together backend bug, not a Plus model bug, and not a | |
| bench parser bug.** Verified by: | |
| - Direct httpx test outside bench: `tool_choice=auto` (streamed) | |
| → reasoning text only, `tool_calls=[]`, finish_reason=tool_calls. | |
| - `tool_choice=required` (streamed) → no completion at all. | |
| - Bench's existing Plus tool-call scrub (task #84) covered the | |
| history-shape side (empty `tool_calls: []` rejection); it does NOT | |
| recover the dropped server-side tool calls. | |
| **Implications:** | |
| - The "Plus is passive" headline is **invalid**. The bench cannot | |
| measure Plus's RTS reasoning at all through the Together endpoint | |
| until the adapter is fixed. | |
| - Per-pack outcomes for Plus on this dataset reflect "what happens | |
| when the agent issues Observe every turn for 25 turns" (always a | |
| loss/draw for packs that require any action). | |
| - Paper-side: **omit Plus from headline model comparisons.** Either | |
| add a clearly-labelled "Together adapter excludes Plus" footnote, | |
| or rerun Plus through a different endpoint (OpenRouter, direct | |
| Anthropic-style, or Together once they fix the adapter). | |
| **Next steps:** | |
| 1. (Done) Document the adapter bug here and in | |
| `openra_bench/providers.py` (already notes Plus quirks). | |
| 2. File upstream issue with Together support, including the | |
| minimal reproduction (see snippet above + `usage.text_tokens > 0` | |
| while `message` lacks both `content` and `tool_calls`). | |
| 3. Optional workaround: write a "reasoning-channel fallback parser" | |
| that extracts intent like `move_units` / `attack_unit` / numeric | |
| coordinates from the reasoning text. Fragile and would conflate | |
| model output with NLP-extraction error; better to wait for the | |
| adapter fix or use a different endpoint. | |
| ## F2 — Superweapon mis-aim (Reasoning/Action axis) | |
| Qwen3.5-9B loses all `spec-nuke-strike` easy cells. The model | |
| INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets | |
| the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED. | |
| **Classification:** Reasoning-axis spatial-commit failure. The verb | |
| is available, the charge timer is met, but cluster-centre | |
| identification under partial information fails. | |
| ## F3 — Target initial visibility predicts win rate (headline) | |
| Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is | |
| "target in initial sight": | |
| - `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L | |
| - `spec-engineer-capture easy` (target 4 cells east): 2W / 0L | |
| - `spec-spy-infiltrate easy` (proc adjacent): 2W / 0L | |
| - `spec-engineer-capture medium` (target 12 cells off-latitude): 0W / 2L | |
| - `spec-spy-infiltrate medium` (target fogged): 0W / 2L | |
| The same model wins the easy versions of these packs and loses the | |
| medium versions — the only systematic difference is target | |
| visibility. This validates the bench's Perception axis: model can | |
| ACT when target is given; model FAILS when target requires search. | |
| This is now the **headline Phase-5 finding**, since F1 retracted. | |
| ## Engine vs Scenario vs Model attribution | |
| - **Engine bugs**: 0 attributable to the engine in the sample. | |
| (3 pre-existing engine P0s — per-player cash race, proc | |
| auto-spawn, production completion — were FIXED earlier this | |
| session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration | |
| tests + bench engine-feature tests now green.) | |
| - **Scenario defects**: 0 attributable to scenarios in the sample. | |
| (3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3. | |
| Bench's defensive cash-strip commit b77e43d preempts the entire | |
| regression class for 62 packs.) | |
| - **Provider/adapter bugs**: 1 confirmed (Together drops Plus | |
| tool_calls). Class: PROVIDER, not MODEL, not BENCH. See | |
| §F1-RETRACTED. | |
| - **Model failures (9B + gemma only)**: losses cluster on packs | |
| where target requires search (F3). Plus excluded. | |
| ## Per-pack difficulty ranking (Qwen3.5-9B, easy tier) | |
| Wins out of 2 seeds per pack: | |
| - 2W: combat-naval-shore-strike, econ-harvester-defense-raid, | |
| econ-second-base-race, spec-engineer-capture, spec-spy-infiltrate, | |
| spec-tanya-c4-strike, spec-thief-steal-cash | |
| - 1W: def-bridge-chokepoint | |
| - 0W: econ-contested-expansion, econ-mine-and-grow, | |
| econ-multi-patch-allocation, spec-nuke-strike | |
| Economy packs (build-or-die throughput) dominate the 0W list — a | |
| signal that the model struggles with multi-step build chains under | |
| time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim). | |
| ## Cell-count asymmetry note | |
| The three models have different completed-cell counts (9B=48, | |
| Plus=55, gemma=9) because the collection ran models sequentially | |
| through the main 240-cell plan, then added side runs for Plus | |
| (`paper-v1-plus-medium/`, 8 medium cells) and gemma | |
| (`paper-v1-gemma-medium/`, 6 medium cells) to fill in coverage on | |
| the discriminating `spec-tanya-c4-strike medium` cell. Collection | |
| remains in flight; cells accumulate via `scripts/collect_eval_data.py | |
| --resume`. | |
| ## Data integrity | |
| - **All 112 cells captured in full untruncated JSONL** with per-turn | |
| PNG snapshots at `data/runs/paper-v1-*/`. No data loss. Plus's | |
| cells remain available for re-analysis once the Together adapter | |
| is fixed. | |
| - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests). | |
| - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells | |
| with a `terminal:` line; partial cells re-run cleanly. | |
| ## Phase 5 status: COMPLETE (F1 retracted, F3 promoted) | |
| The collection continues accumulating in background. The | |
| provider-bug finding is the most actionable next step: file with | |
| Together, optionally implement a reasoning-channel fallback, and | |
| rerun Plus through a different endpoint to get a real Plus signal. | |
| ## Next paper-prep steps | |
| 1. Cross-link F3 (perception-axis target visibility) into | |
| PAPER_PLAN.md §3 Findings as the headline result. | |
| 2. Add a "Provider failures we found" section to the paper covering | |
| the Together-Plus adapter bug as an empirical observation about | |
| the maturity of OSS-model tool-calling adapters — that itself is | |
| a finding of interest for the agent-benchmark community. | |
| 3. Rerun Plus through an alternative endpoint (OpenRouter or fixed | |
| Together) for the real Plus comparison once available. | |
| 4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are | |
| not adapter-dropped before drawing conclusions. | |
| 5. Run perception-sweep cells (structured/vision/image × | |
| fog/no-fog) on the same packs to strengthen F3 with controlled | |
| visibility variation. | |