Spaces:

qpluslab
/

OpenRA-Bench

Running

yxc20098 commited on May 23

Commit

385aa0a

1 Parent(s): 01806f8

Phase 5: retract F1 (Plus passivity) — Together adapter bug

Together AI's Qwen3.6-Plus endpoint returns
finish_reason='tool_calls' and usage.text_tokens > 0 but emits
empty tool_calls and empty content in the message body — only
the reasoning channel is populated. Plus IS generating tool
calls server-side; Together's adapter drops them on the wire.

Verified by direct httpx tests (3 reasoning-disable variants
tried, none recover the calls) and confirmed against Together's
own model page, which does NOT list function calling for
Qwen3.6-Plus.

Implications for Phase 5:
- F1 'Plus is passive across all 55 cells' is retracted —
Plus's 0/55 win rate is a measurement artefact (bench falls
back to default Observe when tool_calls is empty), not a
model property.
- F3 (target initial visibility predicts win rate, validated
on 9B's 48 cells) is promoted to the new Phase 5 headline.
- Plus cells remain on disk (data/runs/paper-v1-plus-medium/
+ paper-v1-engine-feature-packs/) for future re-analysis if
Together fixes the adapter, or rerun via OpenRouter.

Working serverless Together roster for the paper now confirmed
as 3 models: Qwen3.5-9B, gemma-4-31B-it, Kimi-K2.6 — plus
Qwen3.5-397B-A17B (smoke-verified) as a fourth.

Files changed (1) hide show

PHASE5_FINDINGS.md +152 -112

PHASE5_FINDINGS.md CHANGED Viewed

@@ -1,12 +1,20 @@
 # Phase 5 — Model Failure Triage Findings
-Source: 112 completed cells across 2 Together models (Qwen/Qwen3.5-9B
-and Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario packs × 2 levels
-(easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures
-the full untruncated turn-by-turn record (obs, system_prompt,
-briefing, model_request, model_response, commands, signals,
-terminal). Gemma-4-31B-it complete (6/6). Triage
-generated by `scripts/triage_phase4.py`.
 ## Outcome matrix
@@ -27,85 +35,106 @@ generated by `scripts/triage_phase4.py`.
 | spec-thief-steal-cash         | 2W     | 1L 1D  |
 **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
-### Qwen/Qwen3.6-Plus (55 cells)
 | pack                          | easy   | medium |
 |-------------------------------|--------|--------|
-| spec-engineer-capture         | 2L     | 4L     |
-| spec-nuke-strike              | 2L     | -      |
-| spec-spy-infiltrate           | 2L     | 4L     |
-| spec-tanya-c4-strike          | 2L     | 4L     |
-| spec-thief-steal-cash         | 2D     | 4L     |
-**Totals: 0W, 53L, 2D — 0% win across ALL 55 cells.**
-## Headline finding — F1: Qwen3.6-Plus's MODEL-SPECIFIC passivity
-**Initial framing:** scale-INVERSE within Qwen family. Updated
-framing after gemma-4-31B-it cells landed: **F1 is Plus-SPECIFIC,
-not a general scale phenomenon.**
-Cross-model evidence on the discriminating pack `spec-tanya-c4-strike`
-medium (proc in initial sight, requires walk + C4):
-| model              | win rate     | dominant loss-cell commands |
-|--------------------|--------------|------------------------------|
-| Qwen/Qwen3.5-9B    | **2W / 2L** (50%) | (wins both cells) |
-| google/gemma-4-31B-it | **2W / 2L** (50%) | (wins both cells)         |
-| Qwen/Qwen3.6-Plus  | **0W / 4L** (0%)  | `Obs×23` (pure passivity) |
-Same pack, three Together models: 9B and 31B WIN, Plus REFUSES. The
-failure isn't tied to model SIZE — it's tied to a Plus-family
-**over-conservative calibration** that doesn't show up in gemma-31B
-(comparable scale) or 9B.
-**Refined headline:** Qwen3.6-Plus on this benchmark exhibits a
-model-specific refusal mode where it issues only `Observe` for the
-entire decision budget, even on trivially-winnable cells where
-other models of comparable scale win cleanly.
-**Final 83-cell numbers** (after orphan-dir cleanup; tasks/triage_phase4.py):
-| model              | cells | W   | L   | D   | win rate | dominant loss verb |
-|--------------------|-------|-----|-----|-----|----------|--------------------|
-| Qwen/Qwen3.5-9B    | 48    | 20  | 27  | 1   | **41.7%**| MoveUnits (33×)    |
-| google/gemma-4-31B-it | **9**  | **5**  | **4**  | 0   | **55.6%**| MoveUnits (33-34×) |
-| Qwen/Qwen3.6-Plus  | **55** | **0** | **53** | 2   | **0.0%** | **Observe (17-45×)** |
-Two models at very different scales (9B and 31B) win 40-42% of
-cells. Plus (a much larger model) wins **zero** cells across 55
-runs, with `Observe×N` as its only action across the entire
-budget on every cell.
-The most striking comparison is `spec-tanya-c4-strike` — which
-Qwen3.5-9B wins 4/4 (proc in initial sight, walk + C4 = trivial).
-Qwen3.6-Plus loses 6/6 cells on the same pack with `Obs×23-34`
-across the entire budget. The pack is fully winnable for 9B; Plus
-issues only `Observe`.
-**Classification:** Reasoning axis (commitment failure under
-perceived difficulty); model-side; scale-INVERSE. Plus appears
-calibrated to refuse action when uncertain, while 9B at least
-attempts navigation. This is the "freeze and panic" failure class
-predicted in PAPER_PLAN.md, now empirically demonstrated with a
-clean ≥33pp delta (Plus 0% vs the other two models 33-42%).
-### Implication for the paper
-The bench surfaces a failure mode that **appears in Plus alone** within the Qwen3.x family. RTS-style benchmarks
-are sensitive to a commitment-vs-conservatism axis that doesn't
-show up in static QA benchmarks. The user (and the field) should
-not assume bigger = better on real-time decision tasks.
 ## F2 — Superweapon mis-aim (Reasoning/Action axis)
-Both Qwen3.5-9B and Qwen3.6-Plus lose all `spec-nuke-strike` easy
-cells. Qwen3.5-9B INVOKES the verb (Observe×12-20 + FireSuperweapon
-×5-8) but targets the wrong cell. Qwen3.6-Plus (just observed)
-falls under F1.
-**Classification:** Reasoning-axis spatial-commit failure. The
-verb is available, the charge timer is met, but cluster-centre
 identification under partial information fails.
-## F3 — Target initial visibility predicts win rate
 Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
 "target in initial sight":
@@ -120,7 +149,9 @@ medium versions — the only systematic difference is target
 visibility. This validates the bench's Perception axis: model can
 ACT when target is given; model FAILS when target requires search.
-## Engine vs Scenario vs Model attribution (112 cells)
 - **Engine bugs**: 0 attributable to the engine in the sample.
   (3 pre-existing engine P0s — per-player cash race, proc
@@ -128,19 +159,14 @@ ACT when target is given; model FAILS when target requires search.
   session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
   tests + bench engine-feature tests now green.)
 - **Scenario defects**: 0 attributable to scenarios in the sample.
-  (3 audit-flagged defects — spec-spy-infiltrate DRAW,
-  combat-naval-shore-strike auto-WIN, mid-economy-under-fire
-  auto-WIN — were FIXED: commits 4ebeee5 + 9000fe3. Bench's
-  defensive cash-strip commit b77e43d preempts the entire
   regression class for 62 packs.)
-- **Model failures**: all 84 losses + 3 draws are model-side. The
-  triage script's classifier puts 62 in F1 (passive/walk-only,
-  no special verb), 4 in F2 (superweapon mis-aim), and 18 in
-  "verb invoked but still lost" (the third class — model used
-  the right verbs but couldn't sequence them effectively;
-  examples include econ-multi-patch-allocation losing with
-  `PlaceBuilding×44 Harvest×40 Build×40`, model spamming
-  actions without strategic effect).
 ## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)
@@ -156,32 +182,46 @@ Economy packs (build-or-die throughput) dominate the 0W list — a
 signal that the model struggles with multi-step build chains under
 time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
 ## Data integrity
 - **All 112 cells captured in full untruncated JSONL** with per-turn
-  PNG snapshots at `data/runs/paper-v1-*/`. No data loss.
 - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
 - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
   with a `terminal:` line; partial cells re-run cleanly.
-## Phase 5 status: COMPLETE
-The headline F1 finding is now established with ≥30pp scaling
-delta evidence. F2 + F3 documented. The collection continues
-accumulating in background (the script's resume logic makes
-this safe to extend without re-doing completed cells), but
-NOTHING in the additional data is likely to overturn F1's
-scale-INVERSE direction. Paper-ready.
-## Next paper-prep steps (out of scope for this Phase 5)
-1. Cross-link F1 into PAPER_PLAN.md §3 Findings as the headline
-   result.
-2. Add gemma-4-31B-it cells (in flight) as a third model column;
-   if gemma also matches Plus's passivity, the scale-INVERSE
-   becomes a general finding (not Qwen-specific).
-3. Add Kimi-K2.6 as a fourth model; Kimi's reasoning chain may
-   distinguish it from Qwen's passivity pattern.
-4. Run perception-sweep cells (structured/vision/image ×
-   fog/no-fog) on the same packs to test F3 with controlled
    visibility variation.

 # Phase 5 — Model Failure Triage Findings
+Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B,
+Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario
+packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell
+JSONL captures the full untruncated turn-by-turn record (obs,
+system_prompt, briefing, model_request, model_response, commands,
+signals, terminal). Triage generated by `scripts/triage_phase4.py`.
+> **⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is
+> passive" — is RETRACTED. Root cause is a Together-API adapter bug
+> dropping Plus's tool_calls from the wire response. Plus IS reasoning
+> and emitting tool calls server-side; the bench parser receives an
+> empty `tool_calls` list and falls back to the default Observe. See
+> §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now
+> the F3 perception-axis result (target visibility predicts win rate),
+> documented below.**
 ## Outcome matrix
 | spec-thief-steal-cash         | 2W     | 1L 1D  |
 **Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium**
+### google/gemma-4-31B-it (9 cells, partial)
 | pack                          | easy   | medium |
 |-------------------------------|--------|--------|
+| spec-tanya-c4-strike          | 2W     | 1W 1L  |
+| spec-engineer-capture         | 2W     | -      |
+| (others in flight)            |        |        |
+**Partial: 5W / 4L / 0D = 55.6% win**
+### Qwen/Qwen3.6-Plus (55 cells, **EXCLUDED from headline**)
+All cells issued `Observe` only (default fallback) due to the adapter
+bug described in §F1-RETRACTED. The 0/55 win rate is a **measurement
+artefact**, not a model property. Cells remain on disk for future
+re-analysis once the adapter is fixed.
+## F1-RETRACTED — Together adapter drops Plus's tool_calls
+**What we originally claimed (now retracted):** Plus exhibited a
+model-specific "freeze and panic" passivity where it issued only
+`Observe` across the entire decision budget on every cell, despite
+9B and 31B winning the same packs.
+**What's actually happening:**
+Every Plus turn's raw Together response has this exact shape:
+```json
+{
+  "choices": [{
+    "message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."},
+    "finish_reason": "tool_calls"
+  }],
+  "usage": {
+    "completion_tokens": 345,
+    "completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69}
+  }
+}
+```
+Three pieces of evidence prove Plus DID emit tool calls:
+1. `finish_reason: "tool_calls"` — the API itself reports the
+   completion ended on tool-call emission.
+2. `completion_tokens_details.text_tokens: 69` — Plus produced 69
+   non-reasoning tokens (the tool-call JSON), but they're absent
+   from `message.content` and `message.tool_calls`.
+3. The `reasoning` channel consistently ends with concrete intent
+   ("I'll move to (50, 20) to scout east") — Plus is reasoning
+   correctly and arriving at a specific action.
+**Diagnosis:** Together's response adapter for Plus serialises the
+reasoning channel but DROPS the actual tool-call structure from the
+returned message. Bench's `_reply_from_data` parser
+(`openra_bench/providers.py:413-423`) reads `msg.get("tool_calls") or []`
+→ empty → bench issues default `Command::Observe`.
+**This is a Together backend bug, not a Plus model bug, and not a
+bench parser bug.** Verified by:
+- Direct httpx test outside bench: `tool_choice=auto` (streamed)
+  → reasoning text only, `tool_calls=[]`, finish_reason=tool_calls.
+- `tool_choice=required` (streamed) → no completion at all.
+- Bench's existing Plus tool-call scrub (task #84) covered the
+  history-shape side (empty `tool_calls: []` rejection); it does NOT
+  recover the dropped server-side tool calls.
+**Implications:**
+- The "Plus is passive" headline is **invalid**. The bench cannot
+  measure Plus's RTS reasoning at all through the Together endpoint
+  until the adapter is fixed.
+- Per-pack outcomes for Plus on this dataset reflect "what happens
+  when the agent issues Observe every turn for 25 turns" (always a
+  loss/draw for packs that require any action).
+- Paper-side: **omit Plus from headline model comparisons.** Either
+  add a clearly-labelled "Together adapter excludes Plus" footnote,
+  or rerun Plus through a different endpoint (OpenRouter, direct
+  Anthropic-style, or Together once they fix the adapter).
+**Next steps:**
+1. (Done) Document the adapter bug here and in
+   `openra_bench/providers.py` (already notes Plus quirks).
+2. File upstream issue with Together support, including the
+   minimal reproduction (see snippet above + `usage.text_tokens > 0`
+   while `message` lacks both `content` and `tool_calls`).
+3. Optional workaround: write a "reasoning-channel fallback parser"
+   that extracts intent like `move_units` / `attack_unit` / numeric
+   coordinates from the reasoning text. Fragile and would conflate
+   model output with NLP-extraction error; better to wait for the
+   adapter fix or use a different endpoint.
 ## F2 — Superweapon mis-aim (Reasoning/Action axis)
+Qwen3.5-9B loses all `spec-nuke-strike` easy cells. The model
+INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets
+the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED.
+**Classification:** Reasoning-axis spatial-commit failure. The verb
+is available, the charge timer is met, but cluster-centre
 identification under partial information fails.
+## F3 — Target initial visibility predicts win rate (headline)
 Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
 "target in initial sight":
 visibility. This validates the bench's Perception axis: model can
 ACT when target is given; model FAILS when target requires search.
+This is now the **headline Phase-5 finding**, since F1 retracted.
+## Engine vs Scenario vs Model attribution
 - **Engine bugs**: 0 attributable to the engine in the sample.
   (3 pre-existing engine P0s — per-player cash race, proc
   session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
   tests + bench engine-feature tests now green.)
 - **Scenario defects**: 0 attributable to scenarios in the sample.
+  (3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3.
+  Bench's defensive cash-strip commit b77e43d preempts the entire
   regression class for 62 packs.)
+- **Provider/adapter bugs**: 1 confirmed (Together drops Plus
+  tool_calls). Class: PROVIDER, not MODEL, not BENCH. See
+  §F1-RETRACTED.
+- **Model failures (9B + gemma only)**: losses cluster on packs
+  where target requires search (F3). Plus excluded.
 ## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)
 signal that the model struggles with multi-step build chains under
 time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).
+## Cell-count asymmetry note
+The three models have different completed-cell counts (9B=48,
+Plus=55, gemma=9) because the collection ran models sequentially
+through the main 240-cell plan, then added side runs for Plus
+(`paper-v1-plus-medium/`, 8 medium cells) and gemma
+(`paper-v1-gemma-medium/`, 6 medium cells) to fill in coverage on
+the discriminating `spec-tanya-c4-strike medium` cell. Collection
+remains in flight; cells accumulate via `scripts/collect_eval_data.py
+--resume`.
 ## Data integrity
 - **All 112 cells captured in full untruncated JSONL** with per-turn
+  PNG snapshots at `data/runs/paper-v1-*/`. No data loss. Plus's
+  cells remain available for re-analysis once the Together adapter
+  is fixed.
 - Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
 - Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
   with a `terminal:` line; partial cells re-run cleanly.
+## Phase 5 status: COMPLETE (F1 retracted, F3 promoted)
+The collection continues accumulating in background. The
+provider-bug finding is the most actionable next step: file with
+Together, optionally implement a reasoning-channel fallback, and
+rerun Plus through a different endpoint to get a real Plus signal.
+## Next paper-prep steps
+1. Cross-link F3 (perception-axis target visibility) into
+   PAPER_PLAN.md §3 Findings as the headline result.
+2. Add a "Provider failures we found" section to the paper covering
+   the Together-Plus adapter bug as an empirical observation about
+   the maturity of OSS-model tool-calling adapters — that itself is
+   a finding of interest for the agent-benchmark community.
+3. Rerun Plus through an alternative endpoint (OpenRouter or fixed
+   Together) for the real Plus comparison once available.
+4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are
+   not adapter-dropped before drawing conclusions.
+5. Run perception-sweep cells (structured/vision/image ×
+   fog/no-fog) on the same packs to strengthen F3 with controlled
    visibility variation.