File size: 7,361 Bytes
96c8d72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "kernrl: Teaching LLMs to Write Fast GPU Kernels"
thumbnail: /blog/assets/kernrl/thumbnail.png
authors:
- user: Infatoshi
date: 2026-01-20
tags:
- openenv
- reinforcement-learning
- cuda
- triton
- gpu
- grpo
---

# kernrl: Teaching LLMs to Write Fast GPU Kernels

What if we could train language models to write optimized GPU code? Not just syntactically correct code, but kernels that actually run faster than PyTorch's defaults?

That's the goal of **kernrl** - an RL environment where agents learn to optimize GPU kernels through trial and error, receiving real performance feedback from actual hardware.

## The Problem: GPU Programming is Hard

Writing efficient GPU code requires understanding memory hierarchies, thread synchronization, and hardware-specific optimizations. Even experienced engineers spend significant time tuning kernels for different architectures.

Consider a simple softmax operation. PyTorch's implementation works, but a hand-tuned Triton kernel can be 2-5x faster by:
- Fusing the max, subtract, exp, sum, and divide operations
- Using efficient memory access patterns
- Avoiding unnecessary global memory round-trips

```python
# PyTorch baseline - multiple kernel launches
def softmax(x):
    max_val = x.max(dim=-1, keepdim=True)
    x = x - max_val
    exp_x = torch.exp(x)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

# Triton kernel - single fused operation
@triton.jit
def softmax_kernel(input_ptr, output_ptr, n_cols, BLOCK_SIZE: tl.constexpr):
    row_idx = tl.program_id(0)
    col_offsets = tl.arange(0, BLOCK_SIZE)
    mask = col_offsets < n_cols

    row = tl.load(input_ptr + row_idx * n_cols + col_offsets, mask=mask, other=-float('inf'))
    row_max = tl.max(row, axis=0)
    numerator = tl.exp(row - row_max)
    softmax_output = numerator / tl.sum(numerator, axis=0)

    tl.store(output_ptr + row_idx * n_cols + col_offsets, softmax_output, mask=mask)
```

## Enter kernrl

kernrl is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment that frames GPU kernel optimization as an RL problem:

- **State**: A PyTorch reference implementation + GPU info
- **Action**: CUDA/Triton kernel code
- **Reward**: Based on compilation, correctness, and speedup

The environment evaluates submitted kernels on real GPU hardware, providing concrete feedback:

```python
from kernrl import kernrl_env, KernelAction

env = kernrl_env(base_url="http://localhost:8000")
obs = env.reset(problem_id="L1_23_Softmax")

# Submit a kernel
result = env.step(KernelAction(code=triton_kernel_code))

print(f"Compiled: {result.observation.compilation_success}")
print(f"Correct: {result.observation.correctness_pass}")
print(f"Speedup: {result.observation.speedup}x")
```

## 89 Problems Across 10 Difficulty Levels

kernrl includes a diverse problem set spanning from basic operations to cutting-edge architectures:

| Level | Category | Examples |
|-------|----------|----------|
| 1 | Simple Ops | matmul, softmax, conv2d, layernorm |
| 2 | Fused Ops | matmul+GELU+softmax, conv+batchnorm |
| 3 | Attention | Vision attention, causal attention, transformer blocks |
| 4 | Novel Layers | DeepSeek MLA, MoE, GQA, FP8 matmul, INT4 GEMM |
| 5 | Scientific | N-body simulation, stencils, sparse matrix ops |
| 6 | Graphics | Ray tracing, histogram, bilateral filter |
| 7 | Signal | FFT, convolution, median filter |
| 8 | Video | Motion estimation, optical flow, deblocking |
| 9 | Primitives | Prefix scan, radix sort, stream compaction |
| 10 | Cryptography | SHA-256, AES, ChaCha20 |

Level 4 is particularly interesting - it includes architectures like DeepSeek's Multi-head Latent Attention and Mixture of Experts that weren't in most training data, testing whether models can truly reason about kernel optimization rather than memorize solutions.

## Training with GRPO

We use TRL's GRPOTrainer with a custom rollout function that interacts with the kernrl environment:

```python
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openenv import generate_rollout_completions

def rollout_func(prompts, trainer):
    # Generate kernel code
    outputs = generate_rollout_completions(trainer, prompts)

    # Evaluate in environment
    env_rewards = []
    for completion in outputs:
        code = extract_code(completion)
        result = env.step(KernelAction(code=code))
        env_rewards.append(compute_reward(result))

    return {
        "prompt_ids": [...],
        "completion_ids": [...],
        "logprobs": [...],
        "env_reward": env_rewards,
    }

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-Coder-1.5B-Instruct",
    reward_funcs=[reward_from_env],
    rollout_func=rollout_func,
    args=GRPOConfig(use_vllm=True),
)
trainer.train()
```

The reward structure encourages incremental progress:
- **+0.1** for successful compilation
- **+0.3** for correctness (output matches reference within tolerance)
- **+0.3** for beating the baseline
- **+0.0 to +0.6** bonus scaled by log2(speedup)

This means a model can learn even before achieving speedups - first learn to write valid code, then correct code, then fast code.

## Why This Matters

### For AI Research
GPU kernel optimization is a well-defined domain with clear metrics (correctness + speed), making it ideal for studying:
- How LLMs reason about low-level code optimization
- Whether RL can teach models skills not present in training data
- Transfer learning between GPU architectures (H100 vs B200)

### For Practical Applications
As AI models grow, inference costs dominate. A model that can automatically optimize kernels could:
- Reduce serving costs by finding faster implementations
- Adapt to new hardware without manual tuning
- Enable efficient deployment on edge devices

### For the Community
kernrl provides:
- A standardized benchmark for kernel optimization capabilities
- Real hardware evaluation (not just syntax checking)
- Integration with the broader OpenEnv ecosystem

## Try It Yourself

**HuggingFace Space**: [huggingface.co/spaces/Infatoshi/kernrl](https://huggingface.co/spaces/Infatoshi/kernrl)

**Training Notebook**: [huggingface.co/Infatoshi/kernrl-training](https://huggingface.co/Infatoshi/kernrl-training)

**OpenEnv PR**: [github.com/meta-pytorch/OpenEnv/pull/308](https://github.com/meta-pytorch/OpenEnv/pull/308)

To run locally with GPU:

```bash
# Clone OpenEnv
git clone https://github.com/meta-pytorch/OpenEnv.git
cd OpenEnv/envs/kernrl

# Install
pip install -e .

# Start server
uvicorn kernrl.server.app:app --host 0.0.0.0 --port 8000
```

Or with Docker:

```bash
docker build -t kernrl -f server/Dockerfile .
docker run --gpus all -p 8000:8000 kernrl
```

## What's Next

We're excited to see what the community builds with kernrl:

- **Curriculum learning**: Start with L1, progressively add harder problems
- **Multi-turn optimization**: Let models iterate based on profiling feedback
- **Architecture-specific training**: Specialize models for H100 vs B200
- **Novel reward shaping**: Incorporate memory bandwidth, occupancy metrics

The code is open source and contributions are welcome. Whether you're interested in RL, GPU programming, or both - we'd love to see what optimizations your models can discover.

---

*kernrl was built for the OpenEnv Challenge. Special thanks to the Meta PyTorch team for the OpenEnv framework and Hugging Face for TRL.*