Spaces:
Running
Running
| # W1.c β β€32B Zero-Label Repair Scaling Arm (multi-family, zero-shot) | |
| First scaling measurement for the verified-union planner: vanilla (NOT fine-tuned) | |
| 20β31B open-weights models dropped into the EXACT hospital pipeline the 4B fine-tune | |
| gate used β batched raw planner (batch_size=4, same `scrubdata/prompt.py` contract, | |
| temperature 0) β `verify_plan(tau=0.5)` β union with the grounded heuristic | |
| (`mock_plan`). Scored against hospital's 509 real errors with the | |
| `eval/precision_curve.py` repairs-only churn-neutral protocol. Protocol parity was | |
| verified by re-scoring the captured v6 plan through the same scorer: it reproduces the | |
| prior gate numbers exactly (gated 0.993/0.287, union 0.905/0.413). | |
| Disclosure: β€32B open-weights models measured via hosted inference for speed; all are | |
| locally deployable in principle. | |
| | model | params (B) | family | gated P @ C | union P @ C | validity | kept/dropped | runtime (s) | | |
| |---|---|---|---|---|---|---|---| | |
| | scrubdata-ft-v6 (Qwen3-4B fine-tune) | 4 | qwen3 (fine-tuned) | **0.993** @ 0.287 | 0.905 @ 0.413 | β | 132/38 | β (prior measurement) | | |
| | gpt-oss:20b | 20 | openai/gpt-oss | 1.0 @ 0.000* | 0.845 @ 0.257* | 0.0 | 0/0 | 360 | | |
| | devstral-small-2:24b | 24 | mistral/devstral | 0.943 @ 0.426 | 0.915 @ **0.485** | 1.0 | 208/87 | 135 | | |
| | nemotron-3-nano:30b | 30 | nvidia/nemotron | 1.0 @ 0.138 | 0.877 @ 0.336 | 0.4 | 63/6 | 114 | | |
| | gemma4:31b | 31 | google/gemma | 0.943 @ 0.426 | **0.915 @ 0.485** | 1.0 | 209/28 | 104 | | |
| \* gpt-oss:20b is a serving-path failure, not a measured capability: the model | |
| generated ~4.8k tokens per planning call (`done_reason=stop`) but the Ollama Cloud | |
| proxy returned empty `content` and empty `thinking` on all 5 calls at both | |
| num_predict=4000 and 8000 (simple prompts work) β its "gated" point is the degenerate | |
| empty plan and its "union" point is the heuristic backstop alone. nemotron-3-nano | |
| produced valid JSON on only 2/5 batch calls at num_predict=8000 (long-thinking | |
| truncation); validity is part of the measurement. | |
| **Interpretation.** Zero-shot capability at 24β31B does close β and slightly | |
| exceed β the 4B fine-tune's gap inside the same verifier harness: devstral-24B and | |
| gemma4-31B both land at union 0.915 precision @ 0.485 coverage vs the fine-tune's | |
| 0.905 @ 0.413, though the fine-tune remains the most precise gated planner | |
| (0.993 vs 0.943) and the only β€4B point, while two of the four bigger families | |
| (gpt-oss, nemotron) fail on plan-schema validity before capability even gets | |
| measured. Gemma4-31B is the best family on balance: same gate point as devstral but | |
| cleaner raw plans (verifier dropped 28 entries vs devstral's 87 β vs 38 for the 4B | |
| fine-tune) and the fastest wall-clock (104s). The union still dominates everywhere: | |
| every model's union point adds coverage over its gated point at gate-passing | |
| precision, and it floors even the broken planners (nemotron 0.877 @ 0.336) because | |
| the grounded heuristic covers whatever the model misses. | |
| Artifacts: `eval/results/scaling_arm.json` (rows + provenance), | |
| `eval/results/scaling_<model>_hospital_raw_plan.json` (captured raw plans), | |
| runner: `eval/scaling_arm.py`. | |