Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
W1.c β β€32B Zero-Label Repair Scaling Arm (multi-family, zero-shot)
First scaling measurement for the verified-union planner: vanilla (NOT fine-tuned)
20β31B open-weights models dropped into the EXACT hospital pipeline the 4B fine-tune
gate used β batched raw planner (batch_size=4, same scrubdata/prompt.py contract,
temperature 0) β verify_plan(tau=0.5) β union with the grounded heuristic
(mock_plan). Scored against hospital's 509 real errors with the
eval/precision_curve.py repairs-only churn-neutral protocol. Protocol parity was
verified by re-scoring the captured v6 plan through the same scorer: it reproduces the
prior gate numbers exactly (gated 0.993/0.287, union 0.905/0.413).
Disclosure: β€32B open-weights models measured via hosted inference for speed; all are locally deployable in principle.
| model | params (B) | family | gated P @ C | union P @ C | validity | kept/dropped | runtime (s) |
|---|---|---|---|---|---|---|---|
| scrubdata-ft-v6 (Qwen3-4B fine-tune) | 4 | qwen3 (fine-tuned) | 0.993 @ 0.287 | 0.905 @ 0.413 | β | 132/38 | β (prior measurement) |
| gpt-oss:20b | 20 | openai/gpt-oss | 1.0 @ 0.000* | 0.845 @ 0.257* | 0.0 | 0/0 | 360 |
| devstral-small-2:24b | 24 | mistral/devstral | 0.943 @ 0.426 | 0.915 @ 0.485 | 1.0 | 208/87 | 135 |
| nemotron-3-nano:30b | 30 | nvidia/nemotron | 1.0 @ 0.138 | 0.877 @ 0.336 | 0.4 | 63/6 | 114 |
| gemma4:31b | 31 | google/gemma | 0.943 @ 0.426 | 0.915 @ 0.485 | 1.0 | 209/28 | 104 |
* gpt-oss:20b is a serving-path failure, not a measured capability: the model
generated ~4.8k tokens per planning call (done_reason=stop) but the Ollama Cloud
proxy returned empty content and empty thinking on all 5 calls at both
num_predict=4000 and 8000 (simple prompts work) β its "gated" point is the degenerate
empty plan and its "union" point is the heuristic backstop alone. nemotron-3-nano
produced valid JSON on only 2/5 batch calls at num_predict=8000 (long-thinking
truncation); validity is part of the measurement.
Interpretation. Zero-shot capability at 24β31B does close β and slightly exceed β the 4B fine-tune's gap inside the same verifier harness: devstral-24B and gemma4-31B both land at union 0.915 precision @ 0.485 coverage vs the fine-tune's 0.905 @ 0.413, though the fine-tune remains the most precise gated planner (0.993 vs 0.943) and the only β€4B point, while two of the four bigger families (gpt-oss, nemotron) fail on plan-schema validity before capability even gets measured. Gemma4-31B is the best family on balance: same gate point as devstral but cleaner raw plans (verifier dropped 28 entries vs devstral's 87 β vs 38 for the 4B fine-tune) and the fastest wall-clock (104s). The union still dominates everywhere: every model's union point adds coverage over its gated point at gate-passing precision, and it floors even the broken planners (nemotron 0.877 @ 0.336) because the grounded heuristic covers whatever the model misses.
Artifacts: eval/results/scaling_arm.json (rows + provenance),
eval/results/scaling_<model>_hospital_raw_plan.json (captured raw plans),
runner: eval/scaling_arm.py.