File size: 6,602 Bytes
ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc 4fd8b45 ce98efc be9461d ce98efc be9461d ce98efc be9461d ce98efc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: apache-2.0
base_model: openai/gpt-oss-120b
tags:
- gpu-kernel
- cuda
- code-generation
- reinforcement-learning
- grpo
- kernelbench
datasets:
- ScalingIntelligence/KernelBench
language:
- en
pipeline_tag: text-generation
model-index:
- name: KernelBench-RLVR-120b
results:
- task:
type: text-generation
name: GPU Kernel Generation
dataset:
name: KernelBench L1
type: ScalingIntelligence/KernelBench
metrics:
- name: task_success_rate (K=64, 20 tasks)
type: custom
value: 90.0
- name: fast_1 (K=1, per-sample)
type: custom
value: 53.3
- name: correctness (training dist.)
type: accuracy
value: 98.4
---
# KernelBench-RLVR-120b
A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.
**Paper**: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) | **Code**: [GitHub](https://github.com/jbarnes850/test-time-training)
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Jarrodbarnes/KernelBench-RLVR-120b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/KernelBench-RLVR-120b")
```
## Model Description
This model was trained using an execution-grounded RL framework where:
1. **Environment**: KernelBench provides deterministic execution feedback via CUDA compiler and GPU hardware
2. **Reward**: Raw speedup (correctness-gated) normalized by running baseline
3. **Algorithm**: GRPO with group-relative advantages
4. **Evaluation**: Same evaluator as training (no reward hacking possible)
| Parameter | Value |
|-----------|-------|
| Base Model | openai/gpt-oss-120b |
| Algorithm | GRPO (Group Relative Policy Optimization) |
| LoRA Rank | 16 |
| Training Steps | 40 |
| Learning Rate | 1e-5 |
| Temperature | 0.25 |
| Max Tokens | 1024 |
| Training Tasks | 80 (KernelBench L1 train split) |
## Evaluation Results
**Training Checkpoint (Step 40):**
- Correctness: 98.4%
- Mean Speedup: 0.87x on training distribution
**Best-of-N Search (Full L1 Eval, 20 tasks):**
- 18/20 tasks (90%) achieve fast_1 = 1 at K=64
- Performance saturates at K=16 (99.9% on 5-task subsets)
**Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):**
| Strategy | fast_1 | std | Mean Speedup |
|----------|--------|-----|--------------|
| best-correct (Oracle) | 100% | 0% | 226.9x |
| **surprisal-guided-top3** | **100%** | **0%** | **139.0x** |
| **surprisal-guided** | **80%** | **0%** | **41.2x** |
| random-correct | 59.2% | 2.7% | 30.0x |
| confidence-guided | 50% | 14.1% | 11.6x |
**Test-Time Training Comparison (Subset 1, 3 seeds):**
| Method | fast_1 | std | Rollouts |
|--------|--------|-----|----------|
| Best-of-N (K=64) | 100% | 0% | 320 |
| Batch-TTT BoA | 30.6% | 11.3% | 960 |
| SDPO Prompt-Only | 30.4% | 7.6% | 320 |
**Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.
## Key Findings
This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:
1. **Surprisal-guided selection recovers oracle performance.** Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.
2. **Search outperforms adaptation.** Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.
3. **Feedback redundancy.** SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.
## Hardware Requirements
- **GPU Memory**: ~240GB for bf16 inference (e.g., 8x A100 40GB, 4x A100 80GB, or 3x H100)
- **Disk Space**: ~240GB for model weights
- **Recommended**: Use `device_map="auto"` for automatic multi-GPU distribution
For single-GPU inference, consider using quantization:
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"Jarrodbarnes/KernelBench-RLVR-120b",
quantization_config=quantization_config,
device_map="auto"
)
```
## Intended Use
This model is designed for GPU kernel optimization research. Given a PyTorch reference implementation, it generates optimized CUDA kernel code.
**Input format:**
```
Given the following PyTorch reference implementation:
```python
[reference code]
```
Write an optimized CUDA kernel that computes the same result.
```
## Limitations
- Evaluated on KernelBench L1 only (250 ML workloads)
- Hardware-specific optimizations (A100)
- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
- Single model size evaluated (120B)
- Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently
## Citation
If you use this model, please cite [our paper](http://arxiv.org/abs/2602.07670):
```bibtex
@article{barnes2026surprisal,
title={Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies
for Execution-Grounded Code Generation},
author={Barnes, Jarrod},
journal={arXiv preprint arXiv:2602.07670},
year={2026},
url={http://arxiv.org/abs/2602.07670}
}
```
## Related Work
- [KernelBench](https://github.com/ScalingIntelligence/KernelBench) - Ouyang et al., 2025
- [TTT-Discover](https://arxiv.org/abs/2601.16175) - Yuksekgonul et al., 2026
- [SDPO](https://arxiv.org/abs/2601.20802) - Zeng et al., 2026
- [Scalable Power Sampling](https://arxiv.org/abs/2601.21590) - Ji et al., 2026
|