File size: 6,602 Bytes
ce98efc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fd8b45
ce98efc
4fd8b45
 
 
 
 
ce98efc
4fd8b45
ce98efc
 
 
 
4fd8b45
 
 
ce98efc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fd8b45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce98efc
4fd8b45
 
 
 
 
ce98efc
 
 
 
 
4fd8b45
ce98efc
4fd8b45
ce98efc
4fd8b45
ce98efc
4fd8b45
ce98efc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fd8b45
ce98efc
 
 
be9461d
 
ce98efc
be9461d
 
 
ce98efc
be9461d
 
 
ce98efc
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
base_model: openai/gpt-oss-120b
tags:
  - gpu-kernel
  - cuda
  - code-generation
  - reinforcement-learning
  - grpo
  - kernelbench
datasets:
  - ScalingIntelligence/KernelBench
language:
  - en
pipeline_tag: text-generation
model-index:
  - name: KernelBench-RLVR-120b
    results:
      - task:
          type: text-generation
          name: GPU Kernel Generation
        dataset:
          name: KernelBench L1
          type: ScalingIntelligence/KernelBench
        metrics:
          - name: task_success_rate (K=64, 20 tasks)
            type: custom
            value: 90.0
          - name: fast_1 (K=1, per-sample)
            type: custom
            value: 53.3
          - name: correctness (training dist.)
            type: accuracy
            value: 98.4
---

# KernelBench-RLVR-120b

A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.

**Paper**: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) | **Code**: [GitHub](https://github.com/jbarnes850/test-time-training)

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Jarrodbarnes/KernelBench-RLVR-120b",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/KernelBench-RLVR-120b")
```

## Model Description

This model was trained using an execution-grounded RL framework where:

1. **Environment**: KernelBench provides deterministic execution feedback via CUDA compiler and GPU hardware
2. **Reward**: Raw speedup (correctness-gated) normalized by running baseline
3. **Algorithm**: GRPO with group-relative advantages
4. **Evaluation**: Same evaluator as training (no reward hacking possible)

| Parameter | Value |
|-----------|-------|
| Base Model | openai/gpt-oss-120b |
| Algorithm | GRPO (Group Relative Policy Optimization) |
| LoRA Rank | 16 |
| Training Steps | 40 |
| Learning Rate | 1e-5 |
| Temperature | 0.25 |
| Max Tokens | 1024 |
| Training Tasks | 80 (KernelBench L1 train split) |

## Evaluation Results

**Training Checkpoint (Step 40):**
- Correctness: 98.4%
- Mean Speedup: 0.87x on training distribution

**Best-of-N Search (Full L1 Eval, 20 tasks):**
- 18/20 tasks (90%) achieve fast_1 = 1 at K=64
- Performance saturates at K=16 (99.9% on 5-task subsets)

**Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):**

| Strategy | fast_1 | std | Mean Speedup |
|----------|--------|-----|--------------|
| best-correct (Oracle) | 100% | 0% | 226.9x |
| **surprisal-guided-top3** | **100%** | **0%** | **139.0x** |
| **surprisal-guided** | **80%** | **0%** | **41.2x** |
| random-correct | 59.2% | 2.7% | 30.0x |
| confidence-guided | 50% | 14.1% | 11.6x |

**Test-Time Training Comparison (Subset 1, 3 seeds):**

| Method | fast_1 | std | Rollouts |
|--------|--------|-----|----------|
| Best-of-N (K=64) | 100% | 0% | 320 |
| Batch-TTT BoA | 30.6% | 11.3% | 960 |
| SDPO Prompt-Only | 30.4% | 7.6% | 320 |

**Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.

## Key Findings

This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:

1. **Surprisal-guided selection recovers oracle performance.** Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.

2. **Search outperforms adaptation.** Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.

3. **Feedback redundancy.** SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.

## Hardware Requirements

- **GPU Memory**: ~240GB for bf16 inference (e.g., 8x A100 40GB, 4x A100 80GB, or 3x H100)
- **Disk Space**: ~240GB for model weights
- **Recommended**: Use `device_map="auto"` for automatic multi-GPU distribution

For single-GPU inference, consider using quantization:
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "Jarrodbarnes/KernelBench-RLVR-120b",
    quantization_config=quantization_config,
    device_map="auto"
)
```

## Intended Use

This model is designed for GPU kernel optimization research. Given a PyTorch reference implementation, it generates optimized CUDA kernel code.

**Input format:**
```
Given the following PyTorch reference implementation:

```python
[reference code]
```

Write an optimized CUDA kernel that computes the same result.
```

## Limitations

- Evaluated on KernelBench L1 only (250 ML workloads)
- Hardware-specific optimizations (A100)
- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
- Single model size evaluated (120B)
- Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently

## Citation

If you use this model, please cite [our paper](http://arxiv.org/abs/2602.07670):

```bibtex
@article{barnes2026surprisal,
  title={Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies
         for Execution-Grounded Code Generation},
  author={Barnes, Jarrod},
  journal={arXiv preprint arXiv:2602.07670},
  year={2026},
  url={http://arxiv.org/abs/2602.07670}
}
```

## Related Work

- [KernelBench](https://github.com/ScalingIntelligence/KernelBench) - Ouyang et al., 2025
- [TTT-Discover](https://arxiv.org/abs/2601.16175) - Yuksekgonul et al., 2026
- [SDPO](https://arxiv.org/abs/2601.20802) - Zeng et al., 2026
- [Scalable Power Sampling](https://arxiv.org/abs/2601.21590) - Ji et al., 2026