narcolepticchicken
/

occ-stack

narcolepticchicken commited on 26 days ago

Commit

a423787

verified ·

1 Parent(s): 8024ab2

Upload reports/grpo_hook_results.txt

Files changed (1) hide show

reports/grpo_hook_results.txt ADDED Viewed

+GRPO Hook Validation Results
+===========================
+Policy Comparison (1000 episodes each):
+  A: Always-confident  -> AvgR=0.946, R/Cost=0.946, Pass=70.5%
+  B: Calibrated+Abstain -> AvgR=0.696, R/Cost=0.464, Pass=69.4%
+  C: Random baseline   -> AvgR=0.619, R/Cost=0.619, Pass=50.6%
+  D: OCC-optimized     -> AvgR=0.830, R/Cost=1.038, Pass=73.5%
+  E: Gaming agent      -> AvgR=0.393, R/Cost=0.197, Pass=58.1%
+Key findings:
+- OCC-optimized achieves best reward-per-cost (1.038), 9.7% above always-confident
+- Gaming agent penalized to 0.197 R/Cost (5.3x worse than OCC)
+- 32% compute savings at iso-accuracy via tiered generation
+- GRPO group advantage distribution: mean=-0.000, std=0.978 (properly normalized)
+- Reward function correctly penalizes: confident-wrong (-0.5), gaming (-1.0), useless compute (-0.2)
+- Reward function correctly rewards: abstention (+0.3), calibration (+0.2), improvement (+0.5)