Jarrodbarnes commited on
Commit
4fd8b45
·
verified ·
1 Parent(s): be9461d

Update model card: align with published paper (arXiv:2602.07670), add selection strategy results, fix metrics

Browse files
Files changed (1) hide show
  1. README.md +35 -15
README.md CHANGED
@@ -23,17 +23,22 @@ model-index:
23
  name: KernelBench L1
24
  type: ScalingIntelligence/KernelBench
25
  metrics:
26
- - name: fast_1
27
  type: custom
28
- value: 30.6
29
- - name: correctness
 
 
 
30
  type: accuracy
31
- value: 91.5
32
  ---
33
 
34
  # KernelBench-RLVR-120b
35
 
36
- A 120B parameter LLM fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation, evaluated on KernelBench L1.
 
 
37
 
38
  ## Quick Start
39
 
@@ -75,25 +80,39 @@ This model was trained using an execution-grounded RL framework where:
75
  - Correctness: 98.4%
76
  - Mean Speedup: 0.87x on training distribution
77
 
78
- **Test-Time Evaluation (3 seeds, 10 tasks):**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- | Method | fast_1 | std | Correctness | Rollouts |
81
- |--------|--------|-----|-------------|----------|
82
- | **Batch-TTT BoA** | **30.6%** | 11.3% | 91.5% | 960 |
83
- | **SDPO Prompt-Only** | **30.4%** | 7.6% | 91.9% | 320 |
84
- | Best-of-N (K=64) | 30.9% | - | 87.2% | 320 |
85
 
86
  **Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.
87
 
88
  ## Key Findings
89
 
90
- This model was developed as part of research on test-time training efficiency for verifiable execution-grounded tasks. Three key findings emerged:
91
 
92
- 1. **Efficiency Frontier**: Performance peaks after 1-2 adaptation steps then regresses, indicating that checkpoint selection (Best-of-Adaptation) rather than extended training drives the benefit.
93
 
94
- 2. **Feedback Redundancy**: Rich tokenized execution feedback provides no lift over prompt-only self-distillation (+4.2% advantage for prompt-only across 3 seeds). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.
95
 
96
- 3. **Regime Dependence**: Adaptation helps when base policy achieves >30% coverage, but hurts performance below that threshold. Search maintains a diversity advantage in hard regimes.
97
 
98
  ## Hardware Requirements
99
 
@@ -134,6 +153,7 @@ Write an optimized CUDA kernel that computes the same result.
134
  - Hardware-specific optimizations (A100)
135
  - Extended test-time adaptation may cause regression (use BoA selection with early stopping)
136
  - Single model size evaluated (120B)
 
137
 
138
  ## Citation
139
 
 
23
  name: KernelBench L1
24
  type: ScalingIntelligence/KernelBench
25
  metrics:
26
+ - name: task_success_rate (K=64, 20 tasks)
27
  type: custom
28
+ value: 90.0
29
+ - name: fast_1 (K=1, per-sample)
30
+ type: custom
31
+ value: 53.3
32
+ - name: correctness (training dist.)
33
  type: accuracy
34
+ value: 98.4
35
  ---
36
 
37
  # KernelBench-RLVR-120b
38
 
39
+ A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.
40
+
41
+ **Paper**: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) | **Code**: [GitHub](https://github.com/jbarnes850/test-time-training)
42
 
43
  ## Quick Start
44
 
 
80
  - Correctness: 98.4%
81
  - Mean Speedup: 0.87x on training distribution
82
 
83
+ **Best-of-N Search (Full L1 Eval, 20 tasks):**
84
+ - 18/20 tasks (90%) achieve fast_1 = 1 at K=64
85
+ - Performance saturates at K=16 (99.9% on 5-task subsets)
86
+
87
+ **Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):**
88
+
89
+ | Strategy | fast_1 | std | Mean Speedup |
90
+ |----------|--------|-----|--------------|
91
+ | best-correct (Oracle) | 100% | 0% | 226.9x |
92
+ | **surprisal-guided-top3** | **100%** | **0%** | **139.0x** |
93
+ | **surprisal-guided** | **80%** | **0%** | **41.2x** |
94
+ | random-correct | 59.2% | 2.7% | 30.0x |
95
+ | confidence-guided | 50% | 14.1% | 11.6x |
96
+
97
+ **Test-Time Training Comparison (Subset 1, 3 seeds):**
98
 
99
+ | Method | fast_1 | std | Rollouts |
100
+ |--------|--------|-----|----------|
101
+ | Best-of-N (K=64) | 100% | 0% | 320 |
102
+ | Batch-TTT BoA | 30.6% | 11.3% | 960 |
103
+ | SDPO Prompt-Only | 30.4% | 7.6% | 320 |
104
 
105
  **Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.
106
 
107
  ## Key Findings
108
 
109
+ This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:
110
 
111
+ 1. **Surprisal-guided selection recovers oracle performance.** Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.
112
 
113
+ 2. **Search outperforms adaptation.** Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.
114
 
115
+ 3. **Feedback redundancy.** SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.
116
 
117
  ## Hardware Requirements
118
 
 
153
  - Hardware-specific optimizations (A100)
154
  - Extended test-time adaptation may cause regression (use BoA selection with early stopping)
155
  - Single model size evaluated (120B)
156
+ - Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently
157
 
158
  ## Citation
159