Update model card: align with published paper (arXiv:2602.07670), add selection strategy results, fix metrics

4fd8b45 verified 7 days ago

6.6 kB

	---
	license: apache-2.0
	base_model: openai/gpt-oss-120b
	tags:
	- gpu-kernel
	- cuda
	- code-generation
	- reinforcement-learning
	- grpo
	- kernelbench
	datasets:
	- ScalingIntelligence/KernelBench
	language:
	- en
	pipeline_tag: text-generation
	model-index:
	- name: KernelBench-RLVR-120b
	results:
	- task:
	type: text-generation
	name: GPU Kernel Generation
	dataset:
	name: KernelBench L1
	type: ScalingIntelligence/KernelBench
	metrics:
	- name: task_success_rate (K=64, 20 tasks)
	type: custom
	value: 90.0
	- name: fast_1 (K=1, per-sample)
	type: custom
	value: 53.3
	- name: correctness (training dist.)
	type: accuracy
	value: 98.4
	---

	# KernelBench-RLVR-120b

	A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.

	Paper: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) \| Code: [GitHub](https://github.com/jbarnes850/test-time-training)

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"Jarrodbarnes/KernelBench-RLVR-120b",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/KernelBench-RLVR-120b")
	```

	## Model Description

	This model was trained using an execution-grounded RL framework where:

	1. Environment: KernelBench provides deterministic execution feedback via CUDA compiler and GPU hardware
	2. Reward: Raw speedup (correctness-gated) normalized by running baseline
	3. Algorithm: GRPO with group-relative advantages
	4. Evaluation: Same evaluator as training (no reward hacking possible)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| openai/gpt-oss-120b \|
	\| Algorithm \| GRPO (Group Relative Policy Optimization) \|
	\| LoRA Rank \| 16 \|
	\| Training Steps \| 40 \|
	\| Learning Rate \| 1e-5 \|
	\| Temperature \| 0.25 \|
	\| Max Tokens \| 1024 \|
	\| Training Tasks \| 80 (KernelBench L1 train split) \|

	## Evaluation Results

	Training Checkpoint (Step 40):
	- Correctness: 98.4%
	- Mean Speedup: 0.87x on training distribution

	Best-of-N Search (Full L1 Eval, 20 tasks):
	- 18/20 tasks (90%) achieve fast_1 = 1 at K=64
	- Performance saturates at K=16 (99.9% on 5-task subsets)

	Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):

	\| Strategy \| fast_1 \| std \| Mean Speedup \|
	\|----------\|--------\|-----\|--------------\|
	\| best-correct (Oracle) \| 100% \| 0% \| 226.9x \|
	\| surprisal-guided-top3 \| 100% \| 0% \| 139.0x \|
	\| surprisal-guided \| 80% \| 0% \| 41.2x \|
	\| random-correct \| 59.2% \| 2.7% \| 30.0x \|
	\| confidence-guided \| 50% \| 14.1% \| 11.6x \|

	Test-Time Training Comparison (Subset 1, 3 seeds):

	\| Method \| fast_1 \| std \| Rollouts \|
	\|--------\|--------\|-----\|----------\|
	\| Best-of-N (K=64) \| 100% \| 0% \| 320 \|
	\| Batch-TTT BoA \| 30.6% \| 11.3% \| 960 \|
	\| SDPO Prompt-Only \| 30.4% \| 7.6% \| 320 \|

	Note: fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.

	## Key Findings

	This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:

	1. Surprisal-guided selection recovers oracle performance. Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.

	2. Search outperforms adaptation. Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.

	3. Feedback redundancy. SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.

	## Hardware Requirements

	- GPU Memory: ~240GB for bf16 inference (e.g., 8x A100 40GB, 4x A100 80GB, or 3x H100)
	- Disk Space: ~240GB for model weights
	- Recommended: Use `device_map="auto"` for automatic multi-GPU distribution

	For single-GPU inference, consider using quantization:
	```python
	from transformers import AutoModelForCausalLM, BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(load_in_4bit=True)
	model = AutoModelForCausalLM.from_pretrained(
	"Jarrodbarnes/KernelBench-RLVR-120b",
	quantization_config=quantization_config,
	device_map="auto"
	)
	```

	## Intended Use

	This model is designed for GPU kernel optimization research. Given a PyTorch reference implementation, it generates optimized CUDA kernel code.

	Input format:
	```
	Given the following PyTorch reference implementation:

	```python
	[reference code]
	```

	Write an optimized CUDA kernel that computes the same result.
	```

	## Limitations

	- Evaluated on KernelBench L1 only (250 ML workloads)
	- Hardware-specific optimizations (A100)
	- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
	- Single model size evaluated (120B)
	- Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently

	## Citation

	If you use this model, please cite [our paper](http://arxiv.org/abs/2602.07670):

	```bibtex
	@article{barnes2026surprisal,
	title={Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies
	for Execution-Grounded Code Generation},
	author={Barnes, Jarrod},
	journal={arXiv preprint arXiv:2602.07670},
	year={2026},
	url={http://arxiv.org/abs/2602.07670}
	}
	```

	## Related Work

	- [KernelBench](https://github.com/ScalingIntelligence/KernelBench) - Ouyang et al., 2025
	- [TTT-Discover](https://arxiv.org/abs/2601.16175) - Yuksekgonul et al., 2026
	- [SDPO](https://arxiv.org/abs/2601.20802) - Zeng et al., 2026
	- [Scalable Power Sampling](https://arxiv.org/abs/2601.21590) - Ji et al., 2026