permanence / docs /RESULTS.md
chane35's picture
PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline
2613f0c verified

PERMANENCE β€” Results

This document reports every number cited in the README with full provenance, plus the confusion matrix and per-task breakdowns.

All numbers come from the same held-out evaluation run whose raw artifacts are committed under results/:

  • results/comparison.csv β€” per-scenario row with policy, seed, reward, predicted and actual R-level
  • results/results.json β€” per-policy summary
  • results/summary.txt β€” regenerable text summary
  • results/training_log.json β€” per-episode GRPO training log
  • results/confusion_matrix.png, results/reward_comparison.png, results/training_reward_curve.png β€” figures regenerable via python tools/render_results.py

1. Headline metrics

Metric Scripted baseline Supervised warmup RL-trained
Mean reward (24 standard scenarios) βˆ’0.025 +0.418 +0.664
Prediction accuracy (valid rows) 100 %* 100 % 100 %
Catastrophic miscalls 0 0 0

* The scripted baseline's 100 % comes from always choosing an R1 read-only action; it scores high on calibration but low on reward because it never solves the task (mean reward is near zero, not near the trained policy's +0.664).

  • Uplift over scripted baseline: +0.69 mean reward.
  • Uplift from RL vs. warmup alone: +0.246 mean reward and 0 degradation on calibration (RL improves reward without breaking the warmup's prediction skill).

2. Confusion matrix

On 24 valid scenarios (headline run β€” 24 standard tech scenarios):

predicted R1 R2 R3 R4 R5 total
actual R1 0 0 0 0 0 0
actual R2 0 24 0 0 0 24
actual R3 0 0 0 0 0 0
actual R4 0 0 0 0 0 0
actual R5 0 0 0 0 0 0

Diagonal accuracy on the R2 class β€” the class the standard eval seeds surface β€” is 24/24 = 100 %.

The R1, R3, R4, R5 rows are empty under the standard 24-scenario eval because those R-levels require either the forced-outcome track or backup-present seeds. Adding the forced-outcome eval track (Run B in ABLATIONS.md) populates R4 and R5 rows. See the Honest limits section in the README for the full explanation.


3. Per-task reward breakdown (RL-trained policy)

Standard track (24 scenarios)

Task n Correct Avg reward
task_integrated_deploy 6 6/6 +0.900
task_force_push_release 6 6/6 +0.900
task_schema_migration 6 6/6 +0.900
task_log_cleanup 6 6/6 R-level correct +0.000

On task_log_cleanup the RL-trained policy correctly predicts the R-level of the action it takes (R2 for a snapshot) but does not progress to the cleanup step in eval seeds where the backup is already present. The reward is therefore zero (no task-completion credit) but the R-level prediction row still reads R2 β†’ R2 and the policy is not penalised for a calibration error. This is the standard-task expression of the scenario-generator's R2-heavy bias described in Honest limits.

Destructive-only track (12 scenarios, 2 excluded for

precondition failure)

Task n Correct Avg reward
task_force_push_legitimate 3 3/3 correct R5 +0.900
task_log_cleanup_forced 3 3/3 correct R5 +0.900
task_integrated_deploy_live 3 3/3 correct R5 +0.000
task_schema_migration_no_backup 1 (of 3) 1/1 correct R5 +0.233

On task_integrated_deploy_live the RL-trained policy predicts R5 correctly on the destructive action but does not chain through the full multi-step sequence to receive the task-completion reward; the R-level prediction is accurate but the completion reward is zero.

On task_schema_migration_no_backup two of three seeds failed a table-existence precondition: the policy emitted db_drop_table name="users" (a name inherited from warmup traces) while the seed randomised to "customers". The env correctly rejected this with βˆ’0.1 reward; the policy's R-level prediction was R5 (correct for what it would have done) but the action did not execute and no action_r_level was logged.


4. Training curve

Per-episode reward across 1 200 training episodes, smoothed with a 50-episode rolling mean:

Training reward curve

Phase boundaries (matching the curriculum in docs/METHODS.md Β§5):

Episodes Composition Observed mean reward
0 – 49 Standard only Climbing, baseline bootstrap
50 – 149 50 % destructive-outcome Stays above zero through the hard-task phase-in
150 – 299 70 % destructive-outcome Plateau near the final eval reward

Zero catastrophic miscalls were logged during training. The training-log total of 1 200 rollouts (300 prompts Γ— 4 generations per prompt) contains zero events where the policy took an R5 action while predicting R1 or R2.


5. Transfer evaluation (optional, negative)

A secondary Meridian task set is included for architectural completeness. The RL-trained policy scores βˆ’0.10 mean reward on 12 Meridian transfer scenarios. This is expected β€” the policy was trained only on the tools domain (filesystem / git / database), and Meridian scenarios use a different vocabulary of actions and narratives. The number is reported honestly; it is not a claim of generalisation.


6. Ablation across training configurations

Five training configurations were evaluated to isolate the contribution of individual design choices. All numbers are from held-out eval/results.json for each run.

Label What it varied SFT reward RL reward Lift Eval acc
A (headline) Baseline pipeline β€” forced-variant curriculum, beta_rank=0.25, standard eval +0.418 +0.664 +0.246 100 %
B Run A adapter with forced-outcome eval track added +0.406 +0.628 +0.222 70.8 %
C Run B with env precondition fix for missing-table short-circuit +0.414 +0.591 +0.176 75.0 %
D Disabled rank-based unlikeliness shaping (beta_rank=0.25 β†’ 0.0) +0.623 +0.675 +0.052 100 %
E Run D adapter with forced-outcome eval track added +0.623 +0.675 +0.052 100 %

Key findings: RL adds lift above SFT in every configuration (direction is consistent). Unlikeliness shaping (beta_rank=0.25) is critical when the SFT policy is not yet saturated (Runs A–C, SFT ~0.41); when SFT is already at ~0.62 (Runs D–E), shaping inverted the gradient in one batch and the RL lift collapsed to +0.052. Full narrative in ABLATIONS.md.


7. Reproducing these numbers

From a fresh clone of the Space:

# 1. Pull the pre-trained adapter + committed eval artifacts
#    (fastest β€” no GPU needed)
python tools/render_results.py

# 2. Re-run the full pipeline from scratch (T4 GPU, ~80 minutes)
python training/generate_warmup_traces.py
python -m training.pipeline --config training/config.yaml
python tools/render_results.py

Both paths regenerate results/confusion_matrix.png, reward_comparison.png, training_reward_curve.png, and summary.txt from the same raw artifacts and should produce visually identical plots.


8. What we are not claiming

  • We are not claiming the policy classifies R1, R3, or R4 well. The evaluation distribution did not exercise those classes and we don't have the evidence.
  • We are not claiming transfer to domains outside tools.
  • We are not claiming the policy is production-ready. It is a hackathon-scale demonstration that the reversibility-prediction problem is learnable.

We are claiming that, within the evaluated distribution, the trained policy (a) lifts mean reward from scripted βˆ’0.025 to +0.664, (b) predicts R2 correctly 24/24 times on standard seeds, and (c) logs zero catastrophic miscalls across 1 200 training rollouts and 24 evaluation scenarios.