Spaces:

qpluslab
/

OpenRA-Bench

Running

App Files Files Community

OpenRA-Bench / PHASE5_FINDINGS.md

yxc20098

Phase 5: retract F1 (Plus passivity) — Together adapter bug

385aa0a about 1 month ago

preview code

Raw

History Blame Contribute Delete

10.4 kB

	# Phase 5 — Model Failure Triage Findings

	Source: 112 completed cells across 3 Together models (Qwen/Qwen3.5-9B,
	Qwen/Qwen3.6-Plus, google/gemma-4-31B-it), 12 engine-feature scenario
	packs × 2 levels (easy, medium) × 2 seeds × vision fog mode. Per-cell
	JSONL captures the full untruncated turn-by-turn record (obs,
	system_prompt, briefing, model_request, model_response, commands,
	signals, terminal). Triage generated by `scripts/triage_phase4.py`.

	> **⚠ CORRECTION (2026-05-23): the original F1 headline — "Plus is
	> passive" — is RETRACTED. Root cause is a Together-API adapter bug
	> dropping Plus's tool_calls from the wire response. Plus IS reasoning
	> and emitting tool calls server-side; the bench parser receives an
	> empty `tool_calls` list and falls back to the default Observe. See
	> §F1-RETRACTED below for the full diagnosis. Phase-5 headline is now
	> the F3 perception-axis result (target visibility predicts win rate),
	> documented below.**

	## Outcome matrix

	### Qwen/Qwen3.5-9B (48 cells)
	\| pack \| easy \| medium \|
	\|-------------------------------\|--------\|--------\|
	\| combat-naval-shore-strike \| 2W \| 1W 1L \|
	\| def-bridge-chokepoint \| 1W 1L \| 2W \|
	\| econ-contested-expansion \| 2L \| 2L \|
	\| econ-harvester-defense-raid \| 2W \| 2L \|
	\| econ-mine-and-grow \| 2L \| 2L \|
	\| econ-multi-patch-allocation \| 2L \| 2L \|
	\| econ-second-base-race \| 2W \| 2L \|
	\| spec-engineer-capture \| 2W \| 2L \|
	\| spec-nuke-strike \| 2L \| 2L \|
	\| spec-spy-infiltrate \| 2W \| 2L \|
	\| spec-tanya-c4-strike \| 2W \| 2W ← perfect 4/4
	\| spec-thief-steal-cash \| 2W \| 1L 1D \|
	Totals: 22W, 23L, 1D = 48% win on easy, 17% win on medium

	### google/gemma-4-31B-it (9 cells, partial)
	\| pack \| easy \| medium \|
	\|-------------------------------\|--------\|--------\|
	\| spec-tanya-c4-strike \| 2W \| 1W 1L \|
	\| spec-engineer-capture \| 2W \| - \|
	\| (others in flight) \| \| \|
	Partial: 5W / 4L / 0D = 55.6% win

	### Qwen/Qwen3.6-Plus (55 cells, EXCLUDED from headline)
	All cells issued `Observe` only (default fallback) due to the adapter
	bug described in §F1-RETRACTED. The 0/55 win rate is a **measurement
	artefact**, not a model property. Cells remain on disk for future
	re-analysis once the adapter is fixed.

	## F1-RETRACTED — Together adapter drops Plus's tool_calls

	What we originally claimed (now retracted): Plus exhibited a
	model-specific "freeze and panic" passivity where it issued only
	`Observe` across the entire decision budget on every cell, despite
	9B and 31B winning the same packs.

	What's actually happening:

	Every Plus turn's raw Together response has this exact shape:

	```json
	{
	"choices": [{
	"message": {"role": "assistant", "reasoning": "I need to move Tanya east to scout..."},
	"finish_reason": "tool_calls"
	}],
	"usage": {
	"completion_tokens": 345,
	"completion_tokens_details": {"reasoning_tokens": 276, "text_tokens": 69}
	}
	}
	```

	Three pieces of evidence prove Plus DID emit tool calls:

	1. `finish_reason: "tool_calls"` — the API itself reports the
	completion ended on tool-call emission.
	2. `completion_tokens_details.text_tokens: 69` — Plus produced 69
	non-reasoning tokens (the tool-call JSON), but they're absent
	from `message.content` and `message.tool_calls`.
	3. The `reasoning` channel consistently ends with concrete intent
	("I'll move to (50, 20) to scout east") — Plus is reasoning
	correctly and arriving at a specific action.

	Diagnosis: Together's response adapter for Plus serialises the
	reasoning channel but DROPS the actual tool-call structure from the
	returned message. Bench's `_reply_from_data` parser
	(`openra_bench/providers.py:413-423`) reads `msg.get("tool_calls") or []`
	→ empty → bench issues default `Command::Observe`.

	**This is a Together backend bug, not a Plus model bug, and not a
	bench parser bug.** Verified by:
	- Direct httpx test outside bench: `tool_choice=auto` (streamed)
	→ reasoning text only, `tool_calls=[]`, finish_reason=tool_calls.
	- `tool_choice=required` (streamed) → no completion at all.
	- Bench's existing Plus tool-call scrub (task #84) covered the
	history-shape side (empty `tool_calls: []` rejection); it does NOT
	recover the dropped server-side tool calls.

	Implications:

	- The "Plus is passive" headline is invalid. The bench cannot
	measure Plus's RTS reasoning at all through the Together endpoint
	until the adapter is fixed.
	- Per-pack outcomes for Plus on this dataset reflect "what happens
	when the agent issues Observe every turn for 25 turns" (always a
	loss/draw for packs that require any action).
	- Paper-side: omit Plus from headline model comparisons. Either
	add a clearly-labelled "Together adapter excludes Plus" footnote,
	or rerun Plus through a different endpoint (OpenRouter, direct
	Anthropic-style, or Together once they fix the adapter).

	Next steps:
	1. (Done) Document the adapter bug here and in
	`openra_bench/providers.py` (already notes Plus quirks).
	2. File upstream issue with Together support, including the
	minimal reproduction (see snippet above + `usage.text_tokens > 0`
	while `message` lacks both `content` and `tool_calls`).
	3. Optional workaround: write a "reasoning-channel fallback parser"
	that extracts intent like `move_units` / `attack_unit` / numeric
	coordinates from the reasoning text. Fragile and would conflate
	model output with NLP-extraction error; better to wait for the
	adapter fix or use a different endpoint.

	## F2 — Superweapon mis-aim (Reasoning/Action axis)

	Qwen3.5-9B loses all `spec-nuke-strike` easy cells. The model
	INVOKES the verb (Observe×12-20 + FireSuperweapon×5-8) but targets
	the wrong cell. Plus's cells on this pack are unusable per §F1-RETRACTED.

	Classification: Reasoning-axis spatial-commit failure. The verb
	is available, the charge timer is met, but cluster-centre
	identification under partial information fails.

	## F3 — Target initial visibility predicts win rate (headline)

	Across Qwen3.5-9B's 48 cells, the strongest predictor of WIN is
	"target in initial sight":
	- `spec-tanya-c4-strike` (target adjacent at spawn): 4W / 4L
	- `spec-engineer-capture easy` (target 4 cells east): 2W / 0L
	- `spec-spy-infiltrate easy` (proc adjacent): 2W / 0L
	- `spec-engineer-capture medium` (target 12 cells off-latitude): 0W / 2L
	- `spec-spy-infiltrate medium` (target fogged): 0W / 2L

	The same model wins the easy versions of these packs and loses the
	medium versions — the only systematic difference is target
	visibility. This validates the bench's Perception axis: model can
	ACT when target is given; model FAILS when target requires search.

	This is now the headline Phase-5 finding, since F1 retracted.

	## Engine vs Scenario vs Model attribution

	- Engine bugs: 0 attributable to the engine in the sample.
	(3 pre-existing engine P0s — per-player cash race, proc
	auto-spawn, production completion — were FIXED earlier this
	session: commits a5014a5 + a84a3d7 + b77e43d. All Rust integration
	tests + bench engine-feature tests now green.)
	- Scenario defects: 0 attributable to scenarios in the sample.
	(3 audit-flagged defects fixed: commits 4ebeee5 + 9000fe3.
	Bench's defensive cash-strip commit b77e43d preempts the entire
	regression class for 62 packs.)
	- Provider/adapter bugs: 1 confirmed (Together drops Plus
	tool_calls). Class: PROVIDER, not MODEL, not BENCH. See
	§F1-RETRACTED.
	- Model failures (9B + gemma only): losses cluster on packs
	where target requires search (F3). Plus excluded.

	## Per-pack difficulty ranking (Qwen3.5-9B, easy tier)

	Wins out of 2 seeds per pack:
	- 2W: combat-naval-shore-strike, econ-harvester-defense-raid,
	econ-second-base-race, spec-engineer-capture, spec-spy-infiltrate,
	spec-tanya-c4-strike, spec-thief-steal-cash
	- 1W: def-bridge-chokepoint
	- 0W: econ-contested-expansion, econ-mine-and-grow,
	econ-multi-patch-allocation, spec-nuke-strike

	Economy packs (build-or-die throughput) dominate the 0W list — a
	signal that the model struggles with multi-step build chains under
	time pressure. spec-nuke-strike's 0W aligns with F2 (mis-aim).

	## Cell-count asymmetry note

	The three models have different completed-cell counts (9B=48,
	Plus=55, gemma=9) because the collection ran models sequentially
	through the main 240-cell plan, then added side runs for Plus
	(`paper-v1-plus-medium/`, 8 medium cells) and gemma
	(`paper-v1-gemma-medium/`, 6 medium cells) to fill in coverage on
	the discriminating `spec-tanya-c4-strike medium` cell. Collection
	remains in flight; cells accumulate via `scripts/collect_eval_data.py
	--resume`.

	## Data integrity

	- All 112 cells captured in full untruncated JSONL with per-turn
	PNG snapshots at `data/runs/paper-v1-*/`. No data loss. Plus's
	cells remain available for re-analysis once the Together adapter
	is fixed.
	- Plumbing pinned by `tests/test_data_collection.py` (3 sub-tests).
	- Resume-safe: `scripts/collect_eval_data.py --resume` skips cells
	with a `terminal:` line; partial cells re-run cleanly.

	## Phase 5 status: COMPLETE (F1 retracted, F3 promoted)

	The collection continues accumulating in background. The
	provider-bug finding is the most actionable next step: file with
	Together, optionally implement a reasoning-channel fallback, and
	rerun Plus through a different endpoint to get a real Plus signal.

	## Next paper-prep steps

	1. Cross-link F3 (perception-axis target visibility) into
	PAPER_PLAN.md §3 Findings as the headline result.
	2. Add a "Provider failures we found" section to the paper covering
	the Together-Plus adapter bug as an empirical observation about
	the maturity of OSS-model tool-calling adapters — that itself is
	a finding of interest for the agent-benchmark community.
	3. Rerun Plus through an alternative endpoint (OpenRouter or fixed
	Together) for the real Plus comparison once available.
	4. Add Kimi-K2.6 as a fourth model; verify Kimi's tool-calls are
	not adapter-dropped before drawing conclusions.
	5. Run perception-sweep cells (structured/vision/image ×
	fog/no-fog) on the same packs to strengthen F3 with controlled
	visibility variation.