scrubdata / docs /SCALING_ARM.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
3.16 kB
# W1.c β€” ≀32B Zero-Label Repair Scaling Arm (multi-family, zero-shot)
First scaling measurement for the verified-union planner: vanilla (NOT fine-tuned)
20–31B open-weights models dropped into the EXACT hospital pipeline the 4B fine-tune
gate used β€” batched raw planner (batch_size=4, same `scrubdata/prompt.py` contract,
temperature 0) β†’ `verify_plan(tau=0.5)` β†’ union with the grounded heuristic
(`mock_plan`). Scored against hospital's 509 real errors with the
`eval/precision_curve.py` repairs-only churn-neutral protocol. Protocol parity was
verified by re-scoring the captured v6 plan through the same scorer: it reproduces the
prior gate numbers exactly (gated 0.993/0.287, union 0.905/0.413).
Disclosure: ≀32B open-weights models measured via hosted inference for speed; all are
locally deployable in principle.
| model | params (B) | family | gated P @ C | union P @ C | validity | kept/dropped | runtime (s) |
|---|---|---|---|---|---|---|---|
| scrubdata-ft-v6 (Qwen3-4B fine-tune) | 4 | qwen3 (fine-tuned) | **0.993** @ 0.287 | 0.905 @ 0.413 | β€” | 132/38 | β€” (prior measurement) |
| gpt-oss:20b | 20 | openai/gpt-oss | 1.0 @ 0.000* | 0.845 @ 0.257* | 0.0 | 0/0 | 360 |
| devstral-small-2:24b | 24 | mistral/devstral | 0.943 @ 0.426 | 0.915 @ **0.485** | 1.0 | 208/87 | 135 |
| nemotron-3-nano:30b | 30 | nvidia/nemotron | 1.0 @ 0.138 | 0.877 @ 0.336 | 0.4 | 63/6 | 114 |
| gemma4:31b | 31 | google/gemma | 0.943 @ 0.426 | **0.915 @ 0.485** | 1.0 | 209/28 | 104 |
\* gpt-oss:20b is a serving-path failure, not a measured capability: the model
generated ~4.8k tokens per planning call (`done_reason=stop`) but the Ollama Cloud
proxy returned empty `content` and empty `thinking` on all 5 calls at both
num_predict=4000 and 8000 (simple prompts work) β€” its "gated" point is the degenerate
empty plan and its "union" point is the heuristic backstop alone. nemotron-3-nano
produced valid JSON on only 2/5 batch calls at num_predict=8000 (long-thinking
truncation); validity is part of the measurement.
**Interpretation.** Zero-shot capability at 24–31B does close β€” and slightly
exceed β€” the 4B fine-tune's gap inside the same verifier harness: devstral-24B and
gemma4-31B both land at union 0.915 precision @ 0.485 coverage vs the fine-tune's
0.905 @ 0.413, though the fine-tune remains the most precise gated planner
(0.993 vs 0.943) and the only ≀4B point, while two of the four bigger families
(gpt-oss, nemotron) fail on plan-schema validity before capability even gets
measured. Gemma4-31B is the best family on balance: same gate point as devstral but
cleaner raw plans (verifier dropped 28 entries vs devstral's 87 β€” vs 38 for the 4B
fine-tune) and the fastest wall-clock (104s). The union still dominates everywhere:
every model's union point adds coverage over its gated point at gate-passing
precision, and it floors even the broken planners (nemotron 0.877 @ 0.336) because
the grounded heuristic covers whatever the model misses.
Artifacts: `eval/results/scaling_arm.json` (rows + provenance),
`eval/results/scaling_<model>_hospital_raw_plan.json` (captured raw plans),
runner: `eval/scaling_arm.py`.