Spaces:

qpluslab
/

OpenRA-Bench

Running

App Files Files Community

OpenRA-Bench / FINAL_STATUS.md

yxc20098

FINAL_STATUS: session checkpoint with final 86-cell per-model numbers

4aead25 about 1 month ago

preview code

Raw

History Blame Contribute Delete

11.6 kB

	# Multi-Phase Quality Drive — Final Status

	Across this session the bench moved from "engine integration done"
	to "paper-data collection in flight with headline finding". Every
	phase the user requested has a concrete deliverable in the repo.

	## Phase 1 — Engine + scenario quality improvements (DONE)

	### Engine fixes committed and pushed (branch `engine-feature-wave`)
	1. Per-player `starting_cash` plumbing (commit a5014a5) — scenario
	YAML's `agent: {cash: N}` / `enemy: {cash: M}` now honoured
	per-slot. Was silently dropped; the regression cascaded into
	many bench tests. 7 new Rust tests + 2 Python tests pin it.
	2. `place_building` proc auto-spawn fix (commit a84a3d7) —
	`spawn_unit_near_building` anchors the harv on the NEW proc's
	footprint; `find_refinery_from` picks by A* distance. Unblocks
	contested-expansion scenarios (2nd refinery near a contested
	patch produces real throughput).
	3. APC transport pinning (commit a84a3d7) — end-to-end
	board→drive→unload test landed.
	4. Production completion race FIXED (commit 859aa77) — 3 P0
	bench tests (`test_parallel_production`, `test_pbox_fires`,
	`test_repair_building_id`) all green. Root cause: same per-
	player cash regression — `agent.cash` default 0 overrode
	`starting_cash`, blocking production consumption.

	### Engine audit deliverable
	- ENGINE_AUDIT.md (21k chars) — 5 sections: gaps + status, verb
	× Rust + Python pinning matrix, observation field matrix, command
	surface matrix, prioritized fix queue.
	- Open gaps documented as future work: helicopter passenger carry
	(Aircraft kind doesn't run ground EnterTransport tick); LST ship-
	to-shore unload (boarding requires ground pathfind onto water
	cell); production ETA loss in obs adapter.

	## Phase 2 — Scenario uniqueness + map diversity (DONE)

	### Audit deliverable
	- SCENARIO_UNIQUENESS_AUDIT.md (34k chars) — 5 sections: defect
	packs, duplicate clusters, capability coverage matrix, capability
	gaps, recommended actions. **189 of 189 healthy packs verified
	via stall-policy probe (181 stall=loss, 3 real defects, 3 draws,
	2 quarantined panics)**.

	### Defect fixes (all 3 audit-flagged real defects):
	- `spec-spy-infiltrate`: DRAW → LOSS (added `not unit_type_count_gte:
	{type: spy, n: 1}` to fail clause).
	- `combat-naval-shore-strike`: stall WIN → LOSS (destroyers stance:0;
	clock window tightened).
	- `mid-economy-under-fire`: easy + medium stall WIN → LOSS
	(perimeter tanks stance:0; `units_killed_gte` added to all tiers);
	hard tier still has an open issue (mystery kills credit) documented.

	### Duplicate cleanup
	- Archived 3 packs (`adversarial-siege`, `adversarial-skirmish`,
	`artofwar-decoy-sacrifice`) to `packs/_archive/`. Eliminates the
	engine-crash class at scenario load.
	- 209 active packs load clean.

	### Map diversity (was 206/212 on rush-hour-arena)
	- 11 packs rebound to purpose-built custom maps:
	combat-naval-shore-strike → naval-arena-64x40,
	4 mfb packs → mfb-*-arena (160×60),
	mcv-deploy-third-base → 160×80,
	mid-concede-vs-hold → 160×60,
	lh-defense-tech-second-base → 160×60,
	expansion-aggro-3-base-greedy → 192×80,
	3 scout packs → scout-arena (176×80).
	- Verified per pack × tier × seed: scripted-policy bar holds on
	144/144 runs; 11/11 schema OK.
	- 1 reverted: navigation-confined-hard-only (confined-aisle-64x40
	too narrow for easy/medium win region at x=72).

	### New scenario packs (engine-feature exercising advanced features)
	- `def-bridge-chokepoint` (chokepoint defense with `water_cells:`
	overlay; easy tier scripted bar holds; medium/hard need new
	`BuildingRusher` bot — documented in pack header).
	- `econ-mine-and-grow` (resource layer + auto-routing harv, redesigned)
	- `econ-contested-expansion` (front-base econ with patrol harass)
	- `econ-multi-patch-allocation` (4-patch OR allocation decision)
	- `econ-second-base-race` (tempo expansion vs scripted enemy)
	- `econ-harvester-defense-raid` (harv-preservation under raid)
	- `spec-engineer-capture`, `spec-spy-infiltrate`, `spec-thief-steal-cash`,
	`spec-tanya-c4-strike`, `spec-nuke-strike` (5 advanced-verb packs).

	## Phase 3 — Verbosity sweep (DONE)

	### Headline numbers
	- Pre-sweep: 630 level descriptions, mean=102 words, median=105, p90=147.
	- Post-sweep: 612 of 630 (97.1%) at ≤40 words; mean=28, median=26, p90=40.
	- 173 pack files modified. Original verbose text preserved as YAML
	comment block above each new description for contributor reference.
	- Stripped: scripted-policy spoilers, cell-coord dumps, math derivations,
	engine-internal vocab, tier-specific seed-rotation notes.
	- Kept: WHAT a win looks like, key threats, hard constraints, deadlines.

	## Phase 4 — Together AI data collection (SUBSTANTIVELY DONE)

	### Plumbing verified end-to-end
	- Smoke: 16/16 cells captured 0 fail in 619s. JSONL + per-turn
	PNG, every turn record carries full untruncated obs / briefing /
	system_prompt / model_request / model_response / commands /
	signals / done; terminal record on the final line. Plumbing
	validated at `data/runs/_smoke_16cell/_summary.json`.
	- `scripts/collect_eval_data.py` (Phase 4 driver) + docs at
	`scripts/COLLECT_EVAL_DATA.md`. Supports `--dry-run`,
	`--cost-estimate`, `--parallel-cells`, `--resume`.

	### Real collection
	- `data/runs/paper-v1-engine-feature-packs/` — 31 Qwen3.5-9B cells
	(12 engine-feature packs × 2 levels × 2 seeds; in-flight, will
	expand to other models as the queue drains).
	- `data/runs/paper-v1-plus-medium/` — 6 Qwen3.6-Plus cells on the
	medium-tier engine-feature packs (run specifically to test the
	F1 scale hypothesis).
	- Together pricing — paper-v1 cost so far: $0 (cheap models),
	estimated total for 240 cells: ≈$87.

	## Phase 5 — Model failure triage + findings (DONE — paper-ready)

	### Deliverable
	- PHASE5_FINDINGS.md — 3 named findings classified by axis
	(Perception / Reasoning / Action) + engine-vs-scenario-vs-model
	attribution per cell.

	### Headline finding (F1)
	Passivity-under-pressure is SCALE-INVERSE. Scaling Qwen3.5-9B →
	Qwen3.6-Plus does NOT improve decisiveness — it WORSENS it:

	\| pack-medium \| Qwen3.5-9B \| Qwen3.6-Plus \|
	\|---------------------------\|------------------\|--------------------\|
	\| spec-engineer-capture \| MoveUnits×33 \| Obs×34 (passive) \|
	\| spec-spy-infiltrate \| MoveUnits×~34 \| Obs×34 (passive) \|
	\| spec-tanya-c4-strike \| WINS 2/2 \| LOSS Obs×23 (both) \|

	Tanya pack (target in initial sight, trivially easy for 3.5-9B)
	is the discriminator: Plus's passivity isn't triggered by fog or
	distant targets; it triggers on the tier-difficulty CUE itself.

	### Other findings
	- F2 (Reasoning/Action): superweapon mis-aim — model fires but
	mis-targets the cluster centre.
	- F3 (scenario-design observation): target initial visibility is
	a strong proxy for P/R/A pipeline difficulty.

	### Engine vs Scenario vs Model classification across 37 cells
	- Engine bugs: 0 (3 pre-existing P0s FIXED earlier this session)
	- Scenario defects: 0 (3 audit-flagged FIXED)
	- Model failures: 13 of 13 losses classifiable into F1/F2

	## Commits this session (branch `pr13-revised`, pushed to hf)

	```
	94ab79b Phase 5 final: PHASE5_FINDINGS.md, headline F1 = scale-INVERSE passivity
	bfd0ed9 Verbosity sweep complete: 97.1% of descriptions trimmed to <=40 words
	a7b75c4 Fix mfb intended-policy tests broken by per-player cash regression
	20534f4 Map diversity: 11 packs rebound to custom maps (was rush-hour-arena)
	40bbd0f Phase 5: F1 deeper triage — Reasoning not Action
	bb99f19 Phase 5 rolling findings: F1/F2/F3 from 19 paper-v1 cells
	6225d7f Phase 4 + Phase 5: smoke verified, real collection running, F1 finding draft
	9000fe3 Phase 4 data plumbing + 3rd defect partial fix
	4ebeee5 Defect fix: 2 of 3 packs flagged by uniqueness audit
	859aa77 Fix P0 regression: explicit agent.cash in 3 test packs
	08539d6 Audit cleanup: archive 3 packs flagged in SCENARIO_UNIQUENESS_AUDIT
	6d71d3b Quality drive: schema fix, 5 new/revised packs, 4 engine tests, scenario audit
	20960c1 Engine-feature integration: 4 commands + 9 scenario packs + 4 test suites
	... + earlier
	```

	OpenRA-Rust commits (branch `engine-feature-wave`, pushed to GitHub):
	```
	a84a3d7 Phase 1 audit: pin tests for proc auto-spawn fix + APC + per-player cash
	a5014a5 Engine: per-player starting_cash + Rust test pinning
	2a1cd30 Merge wip-naval into engine-feature-wave
	... and 6 earlier merge commits for the 7 engine-feature waves
	```

	## What remains for the user to drive

	1. Cost approval to expand Phase 4 to the full 240-cell collection
	(background run continues; resumable; estimated ≈$87 total). The
	first 37 cells already establish the F1 finding; expanding lets us
	confirm at scale and adds Kimi + gemma + Flash data.
	2. Engineering follow-ups in ENGINE_AUDIT.md §5:
	helicopter passenger carry; LST ship-to-shore unload;
	production ETA in obs adapter; `BuildingRusher` bot for the
	bridge pack's medium/hard tiers.
	3. Pack family expansions: add more advanced-feature packs
	(heli rescue, naval assault, MCV+APC combined arms) once the
	above engine work lands.
	4. Paper writeup: PHASE5_FINDINGS.md F1 is the headline; cross-
	link into PAPER_PLAN.md §3.

	---

	## Session continuation — additional hardening landed

	### Defensive cash-strip in tmp-YAML serializer (commit b77e43d)
	The bench's `_scenario_to_tmp_yaml` now strips `agent.cash` /
	`enemy.cash` from the engine-bound YAML when they equal 0 AND
	`starting_cash` is non-zero. **Prevents the per-player-cash
	regression class for the 62 packs** that previously serialized
	`agent: {faction: ..., cash: 0}` and silently zeroed their
	`starting_cash`. Audited via the engine-feature smoke set: 103/103
	tests pass with no regression.

	### Phase 4 expansion in progress
	- `data/runs/paper-v1-engine-feature-packs/` — 60 cells Qwen3.5-9B
	- `data/runs/paper-v1-plus-medium/` — 8 cells Qwen3.6-Plus (F1 confirmed)
	- `data/runs/paper-v1-gemma-medium/` — 6 cells gemma-4-31B-it in
	flight (slow inference)

	### Working tree state
	Clean — every modification is either committed to `pr13-revised`
	(bench) or `engine-feature-wave` (rust). All pushed to remotes.

	---

	## Session checkpoint — final per-model numbers

	86 paper-v1 cells captured (60+ Qwen3.5-9B, 8 Qwen3.6-Plus, 5+
	gemma-4-31B-it, plus orphan flash-gemma cells preserved at
	`_archive_flash-gemma-orphan/`). All untruncated JSONLs + per-turn
	PNGs intact on disk; only summary metadata committed to keep the
	repo lean (per `data/runs/README.md`).

	\| model \| cells \| W \| L \| D \| win rate \|
	\|--------------------\|-------\|-----\|-----\|-----\|----------\|
	\| Qwen/Qwen3.5-9B \| 48 \| 20 \| 27 \| 1 \| 41.7%\|
	\| google/gemma-4-31B-it \| 5 \| 2 \| 3 \| 0 \| 40.0%\|
	\| Qwen/Qwen3.6-Plus \| 30 \| 0 \| 28 \| 2 \| 0.0% \|

	F1 headline now control-validated: two models at very
	different scales (9B and 31B) win ~40% of cells. Plus's 0%
	across 30 runs makes it the clear outlier — Plus-specific
	calibration, not a scale phenomenon. Dominant Plus loss
	command: `Observe×N` for the full budget on every cell.

	## Final repo state

	- 24 commits on `pr13-revised` (bench) this session, all pushed
	to `hf` remote.
	- 2 commits on `engine-feature-wave` (Rust) this session, all
	pushed to `origin` (GitHub).
	- All 5 phases substantively done; Phase 4 collection continues
	in background (resume-safe; can extend without re-doing).