Spaces:

build-small-hackathon
/

scrubdata

Running

App Files Files Community

scrubdata / docs /SCALING_ARM.md

OpenAI Codex

deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build

16dc556 11 days ago

preview code

Raw

History Blame Contribute Delete

3.16 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

W1.c — ≤32B Zero-Label Repair Scaling Arm (multi-family, zero-shot)

First scaling measurement for the verified-union planner: vanilla (NOT fine-tuned) 20–31B open-weights models dropped into the EXACT hospital pipeline the 4B fine-tune gate used — batched raw planner (batch_size=4, same scrubdata/prompt.py contract, temperature 0) → verify_plan(tau=0.5) → union with the grounded heuristic (mock_plan). Scored against hospital's 509 real errors with the eval/precision_curve.py repairs-only churn-neutral protocol. Protocol parity was verified by re-scoring the captured v6 plan through the same scorer: it reproduces the prior gate numbers exactly (gated 0.993/0.287, union 0.905/0.413).

Disclosure: ≤32B open-weights models measured via hosted inference for speed; all are locally deployable in principle.

model	params (B)	family	gated P @ C	union P @ C	validity	kept/dropped	runtime (s)
scrubdata-ft-v6 (Qwen3-4B fine-tune)	4	qwen3 (fine-tuned)	0.993 @ 0.287	0.905 @ 0.413	—	132/38	— (prior measurement)
gpt-oss:20b	20	openai/gpt-oss	1.0 @ 0.000*	0.845 @ 0.257*	0.0	0/0	360
devstral-small-2:24b	24	mistral/devstral	0.943 @ 0.426	0.915 @ 0.485	1.0	208/87	135
nemotron-3-nano:30b	30	nvidia/nemotron	1.0 @ 0.138	0.877 @ 0.336	0.4	63/6	114
gemma4:31b	31	google/gemma	0.943 @ 0.426	0.915 @ 0.485	1.0	209/28	104

* gpt-oss:20b is a serving-path failure, not a measured capability: the model generated ~4.8k tokens per planning call (done_reason=stop) but the Ollama Cloud proxy returned empty content and empty thinking on all 5 calls at both num_predict=4000 and 8000 (simple prompts work) — its "gated" point is the degenerate empty plan and its "union" point is the heuristic backstop alone. nemotron-3-nano produced valid JSON on only 2/5 batch calls at num_predict=8000 (long-thinking truncation); validity is part of the measurement.

Interpretation. Zero-shot capability at 24–31B does close — and slightly exceed — the 4B fine-tune's gap inside the same verifier harness: devstral-24B and gemma4-31B both land at union 0.915 precision @ 0.485 coverage vs the fine-tune's 0.905 @ 0.413, though the fine-tune remains the most precise gated planner (0.993 vs 0.943) and the only ≤4B point, while two of the four bigger families (gpt-oss, nemotron) fail on plan-schema validity before capability even gets measured. Gemma4-31B is the best family on balance: same gate point as devstral but cleaner raw plans (verifier dropped 28 entries vs devstral's 87 — vs 38 for the 4B fine-tune) and the fastest wall-clock (104s). The union still dominates everywhere: every model's union point adds coverage over its gated point at gate-passing precision, and it floors even the broken planners (nemotron 0.877 @ 0.336) because the grounded heuristic covers whatever the model misses.

Artifacts: eval/results/scaling_arm.json (rows + provenance), eval/results/scaling_<model>_hospital_raw_plan.json (captured raw plans), runner: eval/scaling_arm.py.