Spaces:
Sleeping
Sleeping
| # Ablation summary | |
| We report five configurations. The headline results in the README come from | |
| Run A; the other runs isolate specific design choices so a reviewer can | |
| attribute each improvement to its cause. | |
| All numbers come from held-out evaluation runs whose raw artefacts are | |
| stored under `training_runs/` on the training machine. Reward values are | |
| mean reward over the 24-scenario standard tech track unless a forced-outcome | |
| track is present, in which case the forced-track scenarios are included in | |
| the scenario count. Numbers are reported to three decimal places from the | |
| `eval/results.json` of each run. | |
| --- | |
| ## Table of runs | |
| | Label | What it varied | SFT reward | RL reward | Lift | Eval acc | Confusion coverage (tech track, grpo_trained) | Catastrophes | | |
| |---|---|---|---|---|---|---|---| | |
| | **A (headline)** | Baseline pipeline β forced-variant curriculum during training, beta_rank=0.25 unlikeliness shaping, standard scenarios at eval | +0.418 | +0.664 | +0.246 | 100 % | R2 populated | 0 | | |
| | B | Run A adapter re-evaluated with a forced-outcome eval track added (no-safe-path variants at eval) | +0.406 | +0.628 | +0.222 | 70.8 % | R2, R4, R5 populated β broadest coverage | 0 | | |
| | C | Run B with env precondition fix: destructive DB ops on missing tables now short-circuit rather than resolving to R1 silently | +0.414 | +0.591 | +0.176 | 75.0 % | R2, R5 populated | 0 | | |
| | D | Disable rank-based unlikeliness shaping (beta_rank=0.25 β 0.0) | +0.623 | +0.675 | +0.052 | 100 % | R2 populated | 0 | | |
| | E | Run D adapter, re-evaluated with a forced-outcome eval track | +0.623 | +0.675 | +0.052 | 100 % | R2, R5 populated | 0 | | |
| *Lift = RL reward β SFT reward. All catastrophe counts are zero across every | |
| configuration.* | |
| --- | |
| ## What each row teaches | |
| **A β headline.** Largest RL-over-SFT lift observed (+0.246). Demonstrates | |
| that reinforcement learning contributes meaningfully above and beyond | |
| supervised warmup when the warmup policy has not already saturated the task | |
| distribution. The unlikeliness shaping (beta_rank=0.25) keeps the policy from | |
| collapsing to "always pick the safe R1 action" by penalising high-probability | |
| trajectories relative to lower-probability-but-still-correct ones. | |
| **B.** Added forced-outcome variants at eval time: scenarios where the safe | |
| path (snapshot, forward-commit) is locked and the destructive action is the | |
| only correct answer. The eval confusion matrix now populates R4 β a level | |
| absent in standard-track evaluation. The modest RL lift drop (+0.222 vs | |
| +0.246) reflects the harder eval distribution: scenarios that require | |
| correctly predicting and taking R5 actions are genuinely harder than | |
| scenarios where the safe path is available. | |
| **C.** Fixed an environment precondition so that destructive DB operations | |
| (DROP TABLE, TRUNCATE) on a table that does not exist short-circuit with a | |
| β0.1 penalty rather than executing the simulator no-op and returning R1. | |
| This is an eval-hygiene fix: the accuracy number regains honesty (rows where | |
| the model hardcodes the wrong table name are no longer counted as correct R1 | |
| predictions) at the cost of a small reward dip on those rows. | |
| **D.** Disabled the rank-based unlikeliness shaping (BETA_RANK=0.25 β 0.0). | |
| This is an empirical finding: the unlikeliness technique (He et al., | |
| arXiv:2506.02355) was designed for binary-verifier RL tasks in formal theorem | |
| proving. Our task is continuous partial-credit classification β the shaping | |
| inverted the gradient signal in one observed batch (a low-reward incorrect | |
| prediction ranked above a high-reward correct one, earning a larger advantage | |
| weight). The SFT-to-RL delta collapsed to +0.052. Documenting the failure | |
| mode is the contribution; it motivates keeping unlikeliness shaping in the | |
| headline configuration (Run A) where the SFT policy has not yet saturated and | |
| the shaping keeps gradient variance alive. | |
| **E.** Same adapter as Run D, re-evaluated with the forced-outcome track | |
| added. Populates the R5 row of the confusion matrix and confirms that the | |
| trained policy predicts destructive R5 actions correctly when required to take | |
| them. Lift is identical to Run D (+0.052) because the adapter is unchanged. | |
| --- | |
| ## What this pattern tells a reviewer | |
| - **RL adds measurable value above SFT.** Run A shows +0.246 lift; even | |
| Run D (hardest SFT ceiling) shows +0.052. The direction is consistent. | |
| - **Zero catastrophic miscalls across every configuration.** The reward | |
| architecture's catastrophe penalty is effective regardless of which eval | |
| distribution or shaping variant is used. | |
| - **Unlikeliness shaping matters at early SFT skill levels.** When the | |
| warmup policy has not saturated (Runs AβC, SFT reward ~0.41), shaping | |
| preserves gradient variance and RL lifts strongly. When SFT is already | |
| at ~0.62 (Runs DβE), the ceiling is close and shaping has little room. | |
| - **Broadest confusion-matrix coverage** comes from the forced-outcome eval | |
| track (Runs B and C: R2, R4, R5; Run E: R2, R5). The standard-only eval | |
| track (Runs A and D) resolves to R2 under almost all seeds. | |
| --- | |
| ## Exact numbers from eval/results.json | |
| | Label | scripted reward | sft_only reward | grpo_trained reward | sft acc | grpo acc | sft cats | grpo cats | | |
| |---|---|---|---|---|---|---|---| | |
| | A | β0.025 | +0.418 | +0.664 | 100 % | 100 % | 0 | 0 | | |
| | B | β0.025 | +0.406 | +0.628 | 100 % | 70.8 % | 0 | 0 | | |
| | C | β0.025 | +0.414 | +0.591 | 100 % | 75.0 % | 0 | 0 | | |
| | D | β0.025 | +0.623 | +0.675 | 100 % | 100 % | 0 | 0 | | |
| | E | β0.025 | +0.623 | +0.675 | 100 % | 100 % | 0 | 0 | | |