The Method Behind OUROBOROS: Make the Reward Hard to Fool

Published June 16, 2026

I have been working on a project called OUROBOROS. The demo version is called Kernel Mint, but the demo is not the main point.

The method is the point.

OUROBOROS is a verifier-guided self-distillation loop for training models to write GPU kernels. The model is not trusted directly. It proposes code. A referee compiles it, checks it against PyTorch, times it against strong compiler baselines, rejects the broken or suspicious attempts, and only lets verified winners become training signal.

That sounds simple, but it changes the shape of the problem. Instead of asking "can a model write impressive looking code?", the question becomes:

Can we build a reward channel that is hard enough to fool that the model can safely learn from it?

For this project, the answer was yes, in a narrow but useful domain: Triton kernels for memory-bound transformer fusion ops.

The broader point

GPU kernels are the proof-of-work, not the boundary of the idea.

The broader claim is that specialist models do not always need huge human-labeled datasets. In domains where answers can be checked cheaply and reliably, the verifier can become the teacher.

The loop is:

model proposes
verifier checks
verified winners become training data
repeat

That applies wherever the output can be executed, tested, or otherwise checked: code optimization, SQL, parsers, theorem proving, data transforms, simulation tasks, compiler tuning, and other narrow domains with hard feedback.

The hard part is not always getting a bigger model or collecting more labels. Sometimes the hard part is building the verifier well enough that the model can safely learn from it.

Why GPU kernels were the test case

GPU kernels are a good place to test this idea because the output has to satisfy two things at once.

First, it has to be correct. A kernel that is fast but returns the wrong answer is useless.

Second, it has to actually be fast. Not just faster than plain eager PyTorch, which is often an easy baseline for fused ops, but faster than PyTorch's compiler path too.

That gives us a rare kind of code-generation task where the reward can be measured instead of guessed. We do not need a human to say whether the code "looks good". We can compile it, run it, compare it to a reference, and time it.

That is why the referee is the core artifact in OUROBOROS.

The referee

Every candidate kernel goes through the same basic path:

Import the candidate in an isolated worker.
Compile the Triton kernel.
Check allclose against a PyTorch reference.
Run the check across awkward shapes, dtypes, magnitudes, and the benchmark shape itself.
Reject kernels that mutate their inputs.
Time with CUDA events after warmup.
Compare against eager PyTorch, torch.compile, and torch.compile max-autotune.
Re-run winners through stability and shape-grid gates.

The important part is that correctness comes before speed. A fast incorrect kernel is not a partial success. It is a failure.

The second important part is that the referee is treated as code that can be attacked. If the reward function can be gamed, the model will eventually find the hole, or at least learn from examples that depend on the hole.

So the referee ships with negative controls.

Negative controls for the reward

I included three exploit kernels as regression tests for the referee.

The first exploit is shape specialization. A kernel can hard-code behavior for the public benchmark shape and return garbage elsewhere. A weak benchmark might still reward it. OUROBOROS folds the benchmark shape into the correctness sweep and checks other shapes too, so this fails.

The second exploit is memoization by input pointer. A kernel can cache the output for an input address and then do no real work on later timed calls. OUROBOROS pokes the input before timed iterations and verifies the final timed output against the live input state, so a stale cached answer fails.

The third exploit is input mutation. A kernel can write into its input so the reused benchmark tensors look partly finished. OUROBOROS enforces the contract that run() returns a fresh output and does not mutate its inputs.

These are not theoretical warnings. They are implemented as tests. The self-test passes good kernels, rejects wrong kernels, and rejects these reward-hacking kernels on both an RTX 4090 and an H200.

That is the main methodological lesson for me: if a model is trained against a reward, the reward should have its own adversarial test suite.

The training loop

The training loop has two stages.

The first stage is supervised fine-tuning on verified kernels. The dataset is not human-labeled in the usual sense. A kernel enters the corpus only after it compiles, matches PyTorch, and beats the compiler baseline under the referee.

The second stage is verifier-guided search and self-distillation.

For each operation, the model samples a group of candidate kernels. The referee scores them. Broken kernels get no credit. Correct and fast kernels get rewarded. The best verified candidate from a round is then distilled back into the model.

The simple version is:

sample kernels
run the referee
keep the best verified winner
train the model to imitate that winner
repeat

This is why I call it self-distillation. The model is not learning from a hidden expert. It is learning from its own verified successes.

The loop also uses canonicalization so cosmetic rewrites of the same kernel are not treated as new discoveries. This matters because otherwise the search can waste time measuring duplicates.

What the results showed

The 27B run produced 76 verified compiler-beating kernels on H200. Of those, 69 went through the five-run stability gate. The other 7 are kept separately as single-shot probes on problems the model had not trained on.

The 69 stability-gated kernels all beat torch.compile max-autotune reproducibly in the recorded gate. Across a 376-cell grid of shapes and dtypes, the trained kernels kept a 1.49x geomean against max-autotune recompiled per cell. The losses were not hidden. About 10 percent of grid cells lost, and those cells are reported.

The 1B result is just as important to me. A MiniCPM5-1B version of the loop, run across 3 seeds and 4 ablation arms on a single RTX 4090, beat torch.compile max-autotune in all 12 runs.

That does not mean the 1B model is a general GPU programmer. It is not. It means the method is not only a large-model story. A small model becomes useful when the task is narrow and the verifier is strong.

What the ablations changed in my head

I expected the RL recipe to be the star.

That was not really what the ablations said.

On familiar operators, the different training arms were close enough that seed noise dominated the ordering. The big thing doing the work was search against the referee. A decent model plus best-of-N sampling, judged by a hard verifier, already gets surprisingly far.

Learning mattered more on operators the model had not seen before.

One example was the family of operators that need overflow-safe exp and log handling, such as softplus-style fusions. Continuing supervised training did not teach them well. The verifier-guided loop did. On the first pass, the valid-kernel rate was low. After self-distilling on verified winners, the second pass reached 8 out of 8 on the softplus fusion cases.

That is the useful split:

For known patterns, search against a good referee does most of the job.

For new patterns, self-distillation on verified winners is where the model actually learns.

The application: Kernel Mint

Kernel Mint is the interactive application of the method.

It lets a user compose a GPU operation, then a small model writes a Triton kernel and the same kind of referee decides whether it counts. The local path uses the MiniCPM5-1B GGUF model through llama.cpp. The Pro path uses the larger 27B model through the Modal-backed backend.

The application exists to make the method inspectable. You can watch the chain:

model proposes kernel
kernel compiles or fails
correctness passes or fails
speed is measured against baselines
only verified wins reach the leaderboard

The demo is not a substitute for the paper. It is a way to see the method behaving in public.

What this is not

The claims are narrow.

These are scheduling wins on memory-bound fusion operators. This is not a claim of new GPU algorithms. It is not a claim against cuBLAS, FlashAttention, or vendor-tuned libraries. It is not a general statement that language models can write any high-performance kernel.

The domain was chosen because it is narrow enough for a small model to hit and strict enough for a referee to check.

That narrowness is a feature. It makes the result easier to inspect.

The part I would reuse elsewhere

The reusable idea is not "use this exact model for kernels".

The reusable idea is:

Pick a domain where correctness can be checked cheaply.
Build the reward channel as a hardened system, not a loose metric.
Ship negative controls for the reward itself.
Let the model search against that referee.
Distill only verified winners.
Report the losses and failure modes.

That recipe should transfer beyond GPU kernels. It is most useful in places where the model's output can be executed, tested, and scored without asking a human to judge it every time.

The proposer can change. In this project it ranged from a 1B model to a 27B model. The part I would keep is the referee.