scrubdata / docs /SCALING_ARM.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
3.16 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

W1.c β€” ≀32B Zero-Label Repair Scaling Arm (multi-family, zero-shot)

First scaling measurement for the verified-union planner: vanilla (NOT fine-tuned) 20–31B open-weights models dropped into the EXACT hospital pipeline the 4B fine-tune gate used β€” batched raw planner (batch_size=4, same scrubdata/prompt.py contract, temperature 0) β†’ verify_plan(tau=0.5) β†’ union with the grounded heuristic (mock_plan). Scored against hospital's 509 real errors with the eval/precision_curve.py repairs-only churn-neutral protocol. Protocol parity was verified by re-scoring the captured v6 plan through the same scorer: it reproduces the prior gate numbers exactly (gated 0.993/0.287, union 0.905/0.413).

Disclosure: ≀32B open-weights models measured via hosted inference for speed; all are locally deployable in principle.

model params (B) family gated P @ C union P @ C validity kept/dropped runtime (s)
scrubdata-ft-v6 (Qwen3-4B fine-tune) 4 qwen3 (fine-tuned) 0.993 @ 0.287 0.905 @ 0.413 β€” 132/38 β€” (prior measurement)
gpt-oss:20b 20 openai/gpt-oss 1.0 @ 0.000* 0.845 @ 0.257* 0.0 0/0 360
devstral-small-2:24b 24 mistral/devstral 0.943 @ 0.426 0.915 @ 0.485 1.0 208/87 135
nemotron-3-nano:30b 30 nvidia/nemotron 1.0 @ 0.138 0.877 @ 0.336 0.4 63/6 114
gemma4:31b 31 google/gemma 0.943 @ 0.426 0.915 @ 0.485 1.0 209/28 104

* gpt-oss:20b is a serving-path failure, not a measured capability: the model generated ~4.8k tokens per planning call (done_reason=stop) but the Ollama Cloud proxy returned empty content and empty thinking on all 5 calls at both num_predict=4000 and 8000 (simple prompts work) β€” its "gated" point is the degenerate empty plan and its "union" point is the heuristic backstop alone. nemotron-3-nano produced valid JSON on only 2/5 batch calls at num_predict=8000 (long-thinking truncation); validity is part of the measurement.

Interpretation. Zero-shot capability at 24–31B does close β€” and slightly exceed β€” the 4B fine-tune's gap inside the same verifier harness: devstral-24B and gemma4-31B both land at union 0.915 precision @ 0.485 coverage vs the fine-tune's 0.905 @ 0.413, though the fine-tune remains the most precise gated planner (0.993 vs 0.943) and the only ≀4B point, while two of the four bigger families (gpt-oss, nemotron) fail on plan-schema validity before capability even gets measured. Gemma4-31B is the best family on balance: same gate point as devstral but cleaner raw plans (verifier dropped 28 entries vs devstral's 87 β€” vs 38 for the 4B fine-tune) and the fastest wall-clock (104s). The union still dominates everywhere: every model's union point adds coverage over its gated point at gate-passing precision, and it floors even the broken planners (nemotron 0.877 @ 0.336) because the grounded heuristic covers whatever the model misses.

Artifacts: eval/results/scaling_arm.json (rows + provenance), eval/results/scaling_<model>_hospital_raw_plan.json (captured raw plans), runner: eval/scaling_arm.py.