Jarrodbarnes's picture
Update model card: align with published paper (arXiv:2602.07670), add selection strategy results, fix metrics
4fd8b45 verified
---
license: apache-2.0
base_model: openai/gpt-oss-120b
tags:
- gpu-kernel
- cuda
- code-generation
- reinforcement-learning
- grpo
- kernelbench
datasets:
- ScalingIntelligence/KernelBench
language:
- en
pipeline_tag: text-generation
model-index:
- name: KernelBench-RLVR-120b
results:
- task:
type: text-generation
name: GPU Kernel Generation
dataset:
name: KernelBench L1
type: ScalingIntelligence/KernelBench
metrics:
- name: task_success_rate (K=64, 20 tasks)
type: custom
value: 90.0
- name: fast_1 (K=1, per-sample)
type: custom
value: 53.3
- name: correctness (training dist.)
type: accuracy
value: 98.4
---
# KernelBench-RLVR-120b
A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.
**Paper**: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) | **Code**: [GitHub](https://github.com/jbarnes850/test-time-training)
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Jarrodbarnes/KernelBench-RLVR-120b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/KernelBench-RLVR-120b")
```
## Model Description
This model was trained using an execution-grounded RL framework where:
1. **Environment**: KernelBench provides deterministic execution feedback via CUDA compiler and GPU hardware
2. **Reward**: Raw speedup (correctness-gated) normalized by running baseline
3. **Algorithm**: GRPO with group-relative advantages
4. **Evaluation**: Same evaluator as training (no reward hacking possible)
| Parameter | Value |
|-----------|-------|
| Base Model | openai/gpt-oss-120b |
| Algorithm | GRPO (Group Relative Policy Optimization) |
| LoRA Rank | 16 |
| Training Steps | 40 |
| Learning Rate | 1e-5 |
| Temperature | 0.25 |
| Max Tokens | 1024 |
| Training Tasks | 80 (KernelBench L1 train split) |
## Evaluation Results
**Training Checkpoint (Step 40):**
- Correctness: 98.4%
- Mean Speedup: 0.87x on training distribution
**Best-of-N Search (Full L1 Eval, 20 tasks):**
- 18/20 tasks (90%) achieve fast_1 = 1 at K=64
- Performance saturates at K=16 (99.9% on 5-task subsets)
**Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):**
| Strategy | fast_1 | std | Mean Speedup |
|----------|--------|-----|--------------|
| best-correct (Oracle) | 100% | 0% | 226.9x |
| **surprisal-guided-top3** | **100%** | **0%** | **139.0x** |
| **surprisal-guided** | **80%** | **0%** | **41.2x** |
| random-correct | 59.2% | 2.7% | 30.0x |
| confidence-guided | 50% | 14.1% | 11.6x |
**Test-Time Training Comparison (Subset 1, 3 seeds):**
| Method | fast_1 | std | Rollouts |
|--------|--------|-----|----------|
| Best-of-N (K=64) | 100% | 0% | 320 |
| Batch-TTT BoA | 30.6% | 11.3% | 960 |
| SDPO Prompt-Only | 30.4% | 7.6% | 320 |
**Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.
## Key Findings
This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:
1. **Surprisal-guided selection recovers oracle performance.** Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.
2. **Search outperforms adaptation.** Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.
3. **Feedback redundancy.** SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.
## Hardware Requirements
- **GPU Memory**: ~240GB for bf16 inference (e.g., 8x A100 40GB, 4x A100 80GB, or 3x H100)
- **Disk Space**: ~240GB for model weights
- **Recommended**: Use `device_map="auto"` for automatic multi-GPU distribution
For single-GPU inference, consider using quantization:
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"Jarrodbarnes/KernelBench-RLVR-120b",
quantization_config=quantization_config,
device_map="auto"
)
```
## Intended Use
This model is designed for GPU kernel optimization research. Given a PyTorch reference implementation, it generates optimized CUDA kernel code.
**Input format:**
```
Given the following PyTorch reference implementation:
```python
[reference code]
```
Write an optimized CUDA kernel that computes the same result.
```
## Limitations
- Evaluated on KernelBench L1 only (250 ML workloads)
- Hardware-specific optimizations (A100)
- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
- Single model size evaluated (120B)
- Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently
## Citation
If you use this model, please cite [our paper](http://arxiv.org/abs/2602.07670):
```bibtex
@article{barnes2026surprisal,
title={Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies
for Execution-Grounded Code Generation},
author={Barnes, Jarrod},
journal={arXiv preprint arXiv:2602.07670},
year={2026},
url={http://arxiv.org/abs/2602.07670}
}
```
## Related Work
- [KernelBench](https://github.com/ScalingIntelligence/KernelBench) - Ouyang et al., 2025
- [TTT-Discover](https://arxiv.org/abs/2601.16175) - Yuksekgonul et al., 2026
- [SDPO](https://arxiv.org/abs/2601.20802) - Zeng et al., 2026
- [Scalable Power Sampling](https://arxiv.org/abs/2601.21590) - Ji et al., 2026