narcolepticchicken commited on
Commit
a423787
·
verified ·
1 Parent(s): 8024ab2

Upload reports/grpo_hook_results.txt

Browse files
Files changed (1) hide show
  1. reports/grpo_hook_results.txt +17 -0
reports/grpo_hook_results.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GRPO Hook Validation Results
2
+ ===========================
3
+
4
+ Policy Comparison (1000 episodes each):
5
+ A: Always-confident -> AvgR=0.946, R/Cost=0.946, Pass=70.5%
6
+ B: Calibrated+Abstain -> AvgR=0.696, R/Cost=0.464, Pass=69.4%
7
+ C: Random baseline -> AvgR=0.619, R/Cost=0.619, Pass=50.6%
8
+ D: OCC-optimized -> AvgR=0.830, R/Cost=1.038, Pass=73.5%
9
+ E: Gaming agent -> AvgR=0.393, R/Cost=0.197, Pass=58.1%
10
+
11
+ Key findings:
12
+ - OCC-optimized achieves best reward-per-cost (1.038), 9.7% above always-confident
13
+ - Gaming agent penalized to 0.197 R/Cost (5.3x worse than OCC)
14
+ - 32% compute savings at iso-accuracy via tiered generation
15
+ - GRPO group advantage distribution: mean=-0.000, std=0.978 (properly normalized)
16
+ - Reward function correctly penalizes: confident-wrong (-0.5), gaming (-1.0), useless compute (-0.2)
17
+ - Reward function correctly rewards: abstention (+0.3), calibration (+0.2), improvement (+0.5)