OpenRA-Bench / PHASE5_FINDINGS.md
yxc20098's picture
Phase 5: retract F1 (Plus passivity) — Together adapter bug
385aa0a
|
Raw
History Blame Contribute Delete
10.4 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Phase 5 — Model Failure Triage Findings

Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B, Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell JSONL captures the full untruncated turn-by-turn record (obs, system_prompt, briefing, model_request, model_response, commands, signals, terminal). Triage generated by scripts/triage_phase4.py.

⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is passive" — is RETRACTED. Root cause is a Together-API adapter bug dropping Plus's tool_calls from the wire response. Plus IS reasoning and emitting tool calls server-side; the bench parser receives an empty tool_calls list and falls back to the default Observe. See §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now the F3 perception-axis result (target visibility predicts win rate), documented below.

Outcome matrix

Qwen/Qwen3.5-9B (48 cells)

pack easy medium
combat-naval-shore-strike 2W 1W 1L
def-bridge-chokepoint 1W 1L 2W
econ-contested-expansion 2L 2L
econ-harvester-defense-raid 2W 2L
econ-mine-and-grow 2L 2L
econ-multi-patch-allocation 2L 2L
econ-second-base-race 2W 2L
spec-engineer-capture 2W 2L
spec-nuke-strike 2L 2L
spec-spy-infiltrate 2W 2L
spec-tanya-c4-strike 2W 2W ← perfect 4/4
spec-thief-steal-cash 2W 1L 1D
Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium

google/gemma-4-31B-it (9 cells, partial)

pack easy medium
spec-tanya-c4-strike 2W 1W 1L
spec-engineer-capture 2W -
(others in flight)
Partial: 5W / 4L / 0D = 55.6% win

Qwen/Qwen3.6-Plus (55 cells, EXCLUDED from headline)

All cells issued Observe only (default fallback) due to the adapter bug described in §F1-RETRACTED. The 0/55 win rate is a measurement artefact, not a model property. Cells remain on disk for future re-analysis once the adapter is fixed.

F1-RETRACTED — Together adapter drops Plus's tool_calls

What we originally claimed (now retracted): Plus exhibited a model-specific "freeze and panic" passivity where it issued only Observe across the entire decision budget on every cell, despite 9B and 31B winning the same packs.

What's actually happening:

Every Plus turn's raw Together response has this exact shape:

{
  "choices": [{
    "message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."},
    "finish_reason": "tool_calls"
  }],
  "usage": {
    "completion_tokens": 345,
    "completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69}
  }
}

Three pieces of evidence prove Plus DID emit tool calls:

  1. finish_reason: "tool_calls" — the API itself reports the completion ended on tool-call emission.
  2. completion_tokens_details.text_tokens: 69 — Plus produced 69 non-reasoning tokens (the tool-call JSON), but they're absent from message.content and message.tool_calls.
  3. The reasoning channel consistently ends with concrete intent ("I'll move to (50, 20) to scout east") — Plus is reasoning correctly and arriving at a specific action.

Diagnosis: Together's response adapter for Plus serialises the reasoning channel but DROPS the actual tool-call structure from the returned message. Bench's _reply_from_data parser (openra_bench/providers.py:413-423) reads msg.get("tool_calls") or [] → empty → bench issues default Command::Observe.

This is a Together backend bug, not a Plus model bug, and not a bench parser bug. Verified by:

  • Direct httpx test outside bench: tool_choice=auto (streamed) → reasoning text only, tool_calls=[], finish_reason=tool_calls.
  • tool_choice=required (streamed) → no completion at all.
  • Bench's existing Plus tool-call scrub (task #84) covered the history-shape side (empty tool_calls: [] rejection); it does NOT recover the dropped server-side tool calls.

Implications:

  • The "Plus is passive" headline is invalid. The bench cannot measure Plus's RTS reasoning at all through the Together endpoint until the adapter is fixed.
  • Per-pack outcomes for Plus on this dataset reflect "what happens when the agent issues Observe every turn for 25 turns" (always a loss/draw for packs that require any action).
  • Paper-side: omit Plus from headline model comparisons. Either add a clearly-labelled "Together adapter excludes Plus" footnote, or rerun Plus through a different endpoint (OpenRouter, direct Anthropic-style, or Together once they fix the adapter).

Next steps:

  1. (Done) Document the adapter bug here and in openra_bench/providers.py (already notes Plus quirks).
  2. File upstream issue with Together support, including the minimal reproduction (see snippet above + usage.text_tokens > 0 while message lacks both content and tool_calls).
  3. Optional workaround: write a "reasoning-channel fallback parser" that extracts intent like move_units / attack_unit / numeric coordinates from the reasoning text. Fragile and would conflate model output with NLP-extraction error; better to wait for the adapter fix or use a different endpoint.

F2 — Superweapon mis-aim (Reasoning/Action axis)

Qwen3.5-9B loses all spec-nuke-strike easy cells. The model INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED.

Classification: Reasoning-axis spatial-commit failure. The verb is available, the charge timer is met, but cluster-centre identification under partial information fails.

F3 — Target initial visibility predicts win rate (headline)

Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is "target in initial sight":

  • spec-tanya-c4-strike (target adjacent at spawn): 4W / 4L
  • spec-engineer-capture easy (target 4 cells east): 2W / 0L
  • spec-spy-infiltrate easy (proc adjacent): 2W / 0L
  • spec-engineer-capture medium (target 12 cells off-latitude): 0W / 2L
  • spec-spy-infiltrate medium (target fogged): 0W / 2L

The same model wins the easy versions of these packs and loses the medium versions — the only systematic difference is target visibility. This validates the bench's Perception axis: model can ACT when target is given; model FAILS when target requires search.

This is now the headline Phase-5 finding, since F1 retracted.

Engine vs Scenario vs Model attribution

  • Engine bugs: 0 attributable to the engine in the sample. (3 pre-existing engine P0s — per-player cash race, proc auto-spawn, production completion — were FIXED earlier this session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration tests + bench engine-feature tests now green.)
  • Scenario defects: 0 attributable to scenarios in the sample. (3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3. Bench's defensive cash-strip commit b77e43d preempts the entire regression class for 62 packs.)
  • Provider/adapter bugs: 1 confirmed (Together drops Plus tool_calls). Class: PROVIDER, not MODEL, not BENCH. See §F1-RETRACTED.
  • Model failures (9B + gemma only): losses cluster on packs where target requires search (F3). Plus excluded.

Per-pack difficulty ranking (Qwen3.5-9B, easy tier)

Wins out of 2 seeds per pack:

  • 2W: combat-naval-shore-strike, econ-harvester-defense-raid, econ-second-base-race, spec-engineer-capture, spec-spy-infiltrate, spec-tanya-c4-strike, spec-thief-steal-cash
  • 1W: def-bridge-chokepoint
  • 0W: econ-contested-expansion, econ-mine-and-grow, econ-multi-patch-allocation, spec-nuke-strike

Economy packs (build-or-die throughput) dominate the 0W list — a signal that the model struggles with multi-step build chains under time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).

Cell-count asymmetry note

The three models have different completed-cell counts (9B=48, Plus=55, gemma=9) because the collection ran models sequentially through the main 240-cell plan, then added side runs for Plus (paper-v1-plus-medium/, 8 medium cells) and gemma (paper-v1-gemma-medium/, 6 medium cells) to fill in coverage on the discriminating spec-tanya-c4-strike medium cell. Collection remains in flight; cells accumulate via scripts/collect_eval_data.py --resume.

Data integrity

  • All 112 cells captured in full untruncated JSONL with per-turn PNG snapshots at data/runs/paper-v1-*/. No data loss. Plus's cells remain available for re-analysis once the Together adapter is fixed.
  • Plumbing pinned by tests/test_data_collection.py (3 sub-tests).
  • Resume-safe: scripts/collect_eval_data.py --resume skips cells with a terminal: line; partial cells re-run cleanly.

Phase 5 status: COMPLETE (F1 retracted, F3 promoted)

The collection continues accumulating in background. The provider-bug finding is the most actionable next step: file with Together, optionally implement a reasoning-channel fallback, and rerun Plus through a different endpoint to get a real Plus signal.

Next paper-prep steps

  1. Cross-link F3 (perception-axis target visibility) into PAPER_PLAN.md §3 Findings as the headline result.
  2. Add a "Provider failures we found" section to the paper covering the Together-Plus adapter bug as an empirical observation about the maturity of OSS-model tool-calling adapters — that itself is a finding of interest for the agent-benchmark community.
  3. Rerun Plus through an alternative endpoint (OpenRouter or fixed Together) for the real Plus comparison once available.
  4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are not adapter-dropped before drawing conclusions.
  5. Run perception-sweep cells (structured/vision/image × fog/no-fog) on the same packs to strengthen F3 with controlled visibility variation.