Update model card: align with published paper (arXiv:2602.07670), add selection strategy results, fix metrics
4fd8b45
verified
| license: apache-2.0 | |
| base_model: openai/gpt-oss-120b | |
| tags: | |
| - gpu-kernel | |
| - cuda | |
| - code-generation | |
| - reinforcement-learning | |
| - grpo | |
| - kernelbench | |
| datasets: | |
| - ScalingIntelligence/KernelBench | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| model-index: | |
| - name: KernelBench-RLVR-120b | |
| results: | |
| - task: | |
| type: text-generation | |
| name: GPU Kernel Generation | |
| dataset: | |
| name: KernelBench L1 | |
| type: ScalingIntelligence/KernelBench | |
| metrics: | |
| - name: task_success_rate (K=64, 20 tasks) | |
| type: custom | |
| value: 90.0 | |
| - name: fast_1 (K=1, per-sample) | |
| type: custom | |
| value: 53.3 | |
| - name: correctness (training dist.) | |
| type: accuracy | |
| value: 98.4 | |
| # KernelBench-RLVR-120b | |
| A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost. | |
| **Paper**: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) | **Code**: [GitHub](https://github.com/jbarnes850/test-time-training) | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Jarrodbarnes/KernelBench-RLVR-120b", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/KernelBench-RLVR-120b") | |
| ``` | |
| ## Model Description | |
| This model was trained using an execution-grounded RL framework where: | |
| 1. **Environment**: KernelBench provides deterministic execution feedback via CUDA compiler and GPU hardware | |
| 2. **Reward**: Raw speedup (correctness-gated) normalized by running baseline | |
| 3. **Algorithm**: GRPO with group-relative advantages | |
| 4. **Evaluation**: Same evaluator as training (no reward hacking possible) | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Base Model | openai/gpt-oss-120b | | |
| | Algorithm | GRPO (Group Relative Policy Optimization) | | |
| | LoRA Rank | 16 | | |
| | Training Steps | 40 | | |
| | Learning Rate | 1e-5 | | |
| | Temperature | 0.25 | | |
| | Max Tokens | 1024 | | |
| | Training Tasks | 80 (KernelBench L1 train split) | | |
| ## Evaluation Results | |
| **Training Checkpoint (Step 40):** | |
| - Correctness: 98.4% | |
| - Mean Speedup: 0.87x on training distribution | |
| **Best-of-N Search (Full L1 Eval, 20 tasks):** | |
| - 18/20 tasks (90%) achieve fast_1 = 1 at K=64 | |
| - Performance saturates at K=16 (99.9% on 5-task subsets) | |
| **Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):** | |
| | Strategy | fast_1 | std | Mean Speedup | | |
| |----------|--------|-----|--------------| | |
| | best-correct (Oracle) | 100% | 0% | 226.9x | | |
| | **surprisal-guided-top3** | **100%** | **0%** | **139.0x** | | |
| | **surprisal-guided** | **80%** | **0%** | **41.2x** | | |
| | random-correct | 59.2% | 2.7% | 30.0x | | |
| | confidence-guided | 50% | 14.1% | 11.6x | | |
| **Test-Time Training Comparison (Subset 1, 3 seeds):** | |
| | Method | fast_1 | std | Rollouts | | |
| |--------|--------|-----|----------| | |
| | Best-of-N (K=64) | 100% | 0% | 320 | | |
| | Batch-TTT BoA | 30.6% | 11.3% | 960 | | |
| | SDPO Prompt-Only | 30.4% | 7.6% | 320 | | |
| **Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x. | |
| ## Key Findings | |
| This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings: | |
| 1. **Surprisal-guided selection recovers oracle performance.** Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost. | |
| 2. **Search outperforms adaptation.** Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions. | |
| 3. **Feedback redundancy.** SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant. | |
| ## Hardware Requirements | |
| - **GPU Memory**: ~240GB for bf16 inference (e.g., 8x A100 40GB, 4x A100 80GB, or 3x H100) | |
| - **Disk Space**: ~240GB for model weights | |
| - **Recommended**: Use `device_map="auto"` for automatic multi-GPU distribution | |
| For single-GPU inference, consider using quantization: | |
| ```python | |
| from transformers import AutoModelForCausalLM, BitsAndBytesConfig | |
| quantization_config = BitsAndBytesConfig(load_in_4bit=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Jarrodbarnes/KernelBench-RLVR-120b", | |
| quantization_config=quantization_config, | |
| device_map="auto" | |
| ) | |
| ``` | |
| ## Intended Use | |
| This model is designed for GPU kernel optimization research. Given a PyTorch reference implementation, it generates optimized CUDA kernel code. | |
| **Input format:** | |
| ``` | |
| Given the following PyTorch reference implementation: | |
| ```python | |
| [reference code] | |
| ``` | |
| Write an optimized CUDA kernel that computes the same result. | |
| ``` | |
| ## Limitations | |
| - Evaluated on KernelBench L1 only (250 ML workloads) | |
| - Hardware-specific optimizations (A100) | |
| - Extended test-time adaptation may cause regression (use BoA selection with early stopping) | |
| - Single model size evaluated (120B) | |
| - Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently | |
| ## Citation | |
| If you use this model, please cite [our paper](http://arxiv.org/abs/2602.07670): | |
| ```bibtex | |
| @article{barnes2026surprisal, | |
| title={Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies | |
| for Execution-Grounded Code Generation}, | |
| author={Barnes, Jarrod}, | |
| journal={arXiv preprint arXiv:2602.07670}, | |
| year={2026}, | |
| url={http://arxiv.org/abs/2602.07670} | |
| } | |
| ``` | |
| ## Related Work | |
| - [KernelBench](https://github.com/ScalingIntelligence/KernelBench) - Ouyang et al., 2025 | |
| - [TTT-Discover](https://arxiv.org/abs/2601.16175) - Yuksekgonul et al., 2026 | |
| - [SDPO](https://arxiv.org/abs/2601.20802) - Zeng et al., 2026 | |
| - [Scalable Power Sampling](https://arxiv.org/abs/2601.21590) - Ji et al., 2026 | |