Update model card: align with published paper (arXiv:2602.07670), add selection strategy results, fix metrics
Browse files
README.md
CHANGED
|
@@ -23,17 +23,22 @@ model-index:
|
|
| 23 |
name: KernelBench L1
|
| 24 |
type: ScalingIntelligence/KernelBench
|
| 25 |
metrics:
|
| 26 |
-
- name:
|
| 27 |
type: custom
|
| 28 |
-
value:
|
| 29 |
-
- name:
|
|
|
|
|
|
|
|
|
|
| 30 |
type: accuracy
|
| 31 |
-
value:
|
| 32 |
---
|
| 33 |
|
| 34 |
# KernelBench-RLVR-120b
|
| 35 |
|
| 36 |
-
A 120B
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Quick Start
|
| 39 |
|
|
@@ -75,25 +80,39 @@ This model was trained using an execution-grounded RL framework where:
|
|
| 75 |
- Correctness: 98.4%
|
| 76 |
- Mean Speedup: 0.87x on training distribution
|
| 77 |
|
| 78 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
| Method | fast_1 | std |
|
| 81 |
-
|--------|--------|-----|----------
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
|
| 86 |
**Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.
|
| 87 |
|
| 88 |
## Key Findings
|
| 89 |
|
| 90 |
-
This model was developed as part of research on test-time
|
| 91 |
|
| 92 |
-
1. **
|
| 93 |
|
| 94 |
-
2. **
|
| 95 |
|
| 96 |
-
3. **
|
| 97 |
|
| 98 |
## Hardware Requirements
|
| 99 |
|
|
@@ -134,6 +153,7 @@ Write an optimized CUDA kernel that computes the same result.
|
|
| 134 |
- Hardware-specific optimizations (A100)
|
| 135 |
- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
|
| 136 |
- Single model size evaluated (120B)
|
|
|
|
| 137 |
|
| 138 |
## Citation
|
| 139 |
|
|
|
|
| 23 |
name: KernelBench L1
|
| 24 |
type: ScalingIntelligence/KernelBench
|
| 25 |
metrics:
|
| 26 |
+
- name: task_success_rate (K=64, 20 tasks)
|
| 27 |
type: custom
|
| 28 |
+
value: 90.0
|
| 29 |
+
- name: fast_1 (K=1, per-sample)
|
| 30 |
+
type: custom
|
| 31 |
+
value: 53.3
|
| 32 |
+
- name: correctness (training dist.)
|
| 33 |
type: accuracy
|
| 34 |
+
value: 98.4
|
| 35 |
---
|
| 36 |
|
| 37 |
# KernelBench-RLVR-120b
|
| 38 |
|
| 39 |
+
A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in [Surprisal-Guided Selection](http://arxiv.org/abs/2602.07670), where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.
|
| 40 |
+
|
| 41 |
+
**Paper**: [arXiv:2602.07670](http://arxiv.org/abs/2602.07670) | **Code**: [GitHub](https://github.com/jbarnes850/test-time-training)
|
| 42 |
|
| 43 |
## Quick Start
|
| 44 |
|
|
|
|
| 80 |
- Correctness: 98.4%
|
| 81 |
- Mean Speedup: 0.87x on training distribution
|
| 82 |
|
| 83 |
+
**Best-of-N Search (Full L1 Eval, 20 tasks):**
|
| 84 |
+
- 18/20 tasks (90%) achieve fast_1 = 1 at K=64
|
| 85 |
+
- Performance saturates at K=16 (99.9% on 5-task subsets)
|
| 86 |
+
|
| 87 |
+
**Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):**
|
| 88 |
+
|
| 89 |
+
| Strategy | fast_1 | std | Mean Speedup |
|
| 90 |
+
|----------|--------|-----|--------------|
|
| 91 |
+
| best-correct (Oracle) | 100% | 0% | 226.9x |
|
| 92 |
+
| **surprisal-guided-top3** | **100%** | **0%** | **139.0x** |
|
| 93 |
+
| **surprisal-guided** | **80%** | **0%** | **41.2x** |
|
| 94 |
+
| random-correct | 59.2% | 2.7% | 30.0x |
|
| 95 |
+
| confidence-guided | 50% | 14.1% | 11.6x |
|
| 96 |
+
|
| 97 |
+
**Test-Time Training Comparison (Subset 1, 3 seeds):**
|
| 98 |
|
| 99 |
+
| Method | fast_1 | std | Rollouts |
|
| 100 |
+
|--------|--------|-----|----------|
|
| 101 |
+
| Best-of-N (K=64) | 100% | 0% | 320 |
|
| 102 |
+
| Batch-TTT BoA | 30.6% | 11.3% | 960 |
|
| 103 |
+
| SDPO Prompt-Only | 30.4% | 7.6% | 320 |
|
| 104 |
|
| 105 |
**Note:** fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.
|
| 106 |
|
| 107 |
## Key Findings
|
| 108 |
|
| 109 |
+
This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:
|
| 110 |
|
| 111 |
+
1. **Surprisal-guided selection recovers oracle performance.** Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.
|
| 112 |
|
| 113 |
+
2. **Search outperforms adaptation.** Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.
|
| 114 |
|
| 115 |
+
3. **Feedback redundancy.** SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.
|
| 116 |
|
| 117 |
## Hardware Requirements
|
| 118 |
|
|
|
|
| 153 |
- Hardware-specific optimizations (A100)
|
| 154 |
- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
|
| 155 |
- Single model size evaluated (120B)
|
| 156 |
+
- Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently
|
| 157 |
|
| 158 |
## Citation
|
| 159 |
|