The Referee Is the Product: teaching a 1B model to write GPU kernels you can trust

Community Article
Published June 15, 2026

Build Small hackathon field notes. Live demo: OUROBOROS Kernel Mint. Two minute demo: YouTube.

A one-billion-parameter model wrote a GPU kernel that runs faster than PyTorch's own compiler. That sounds like the headline. It is not. The interesting part is the thing that proved it, because that is the part you can actually trust.

The problem with "my small model is good at X"

Everyone has a demo where a small model does something impressive. Almost none of them let you check. The output looks plausible, the numbers are self-reported, and you take it on faith. For code that has to be both correct and fast, faith is worthless.

GPU kernels are a good test case. A kernel is the small program that runs one step of a neural network on the graphics card. Fusing several steps into one kernel is where a lot of real inference speed comes from, and writing them well is expert work. PyTorch ships a whole compiler, torch.compile, to do it for you. So "a small model writes a fast kernel" is only a real claim if something incorruptible checks two things: is it correct, and is it actually faster.

What I built

Kernel Mint lets you compose a fused operation out of blocks: a normalization, an optional residual add, and an activation, or a named operator from a real transformer. A fine-tuned OpenBMB MiniCPM5-1B writes a real Triton kernel for it. Then an immutable referee does the only thing that matters:

  1. compiles the kernel,
  2. checks it gives the same answers as PyTorch on adversarial inputs (odd shapes, huge values, low precision),
  3. times it against torch.compile max-autotune, the strong baseline,
  4. and blocks the specific ways a kernel can cheat the benchmark.

A green tick here is earned, not asserted. Every number you see was measured by that referee, not typed in by a model or a human.

The referee is the moat

The benchmark-gaming defenses are the part I am proudest of, because they are what make the speed numbers mean anything. A kernel could try to win the timer without doing the work, and the referee has to stop it:

  • Special-casing the timed shape. A kernel could hard-code the answer for the one public shape the timer uses and return garbage everywhere else. The referee folds the bench shape into its adversarial correctness sweep, so this fails the moment it is run on a shape it did not anticipate.
  • Memoizing by input pointer. A kernel could cache its output keyed by the input's memory address and then do no work on later calls. The referee pokes one input element before every timed iteration and verifies the final timed output against the live input state. A stale, cached answer fails.
  • Mutating its inputs. A kernel could write its result back into the input tensor so the reused bench inputs look already done. The referee enforces a contract: run() must write a fresh output and never touch its inputs.

Each of these is shipped as a rejected negative control, and the 30-case self-test (good kernels pass, subtly wrong ones and these three exploits get rejected) is green on both an RTX 4090 and an H200.

The results, with the limits stated

  • The 1B MiniCPM5 setup beat torch.compile max-autotune in 12 out of 12 independently seeded runs in the multi-seed ablation. That is the live-small-model result I care about most.
  • The larger Qwen3.6-27B run produced 76 verified compiler-beating kernels on H200. 69 of them held up across five fresh re-benchmark runs (mean of means 1.30x, range 1.11x to 2.04x across that reproducible set). The other 7 are single-shot probes on problems the model had never trained on.
  • Across a 376-cell grid of shapes and dtypes, the trained kernels keep a 1.49x geomean versus max-autotune recompiled per cell.
  • They also beat hand-written expert kernels (Liger, Unsloth, the Triton tutorial) on swiglu, rmsnorm, relu2, and geglu. softmax and layernorm come out as ties within noise.

And the honest bound, because it is the whole point: these are reproducible scheduling wins on memory-bound fusion operations. They are not wins over cuBLAS or FlashAttention, and they are not new algorithms. About 10% of grid cells are losses, reported per cell rather than hidden. Measured end to end on a real MLP sub-block, the gain is a modest single-digit percent, because the matmuls dominate and I do not touch them. The page says so.

How it learns: the referee is the teacher

The model is not trained to imitate a human. It is trained against the referee. Supervised fine-tuning on verified kernels, then reinforcement learning where the only reward is the referee's verdict: a kernel that compiles, matches PyTorch, and beats the compiler. No human labels anywhere. The model learns from its own verified wins.

One thing that surprised me: a long supervised pass could not teach the model some new operators, the ones (mish, softplus) that need exp and log overflow guards. The RL self-distillation loop could. On the first pass, the valid-kernel rate on softplus fusions was one to three out of eight. After self-distilling on that pass's verified winners, the second pass was eight out of eight. The verifier's reward beat corpus imitation for teaching a genuinely new skill.

What surprised me

  • The scarce thing is the verifier, not the model. I ran the same loop on MiniCPM5-1B and on Qwen3.6-27B against the same referee, and both learned to beat the compiler. The model is swappable. The referee is the product.
  • Search dominates on familiar operators; learning matters on unseen ones. A multi-seed ablation (three seeds, four arms) on the 1B showed the arms sitting within seed noise on operators the model already knew: sampling against the referee did most of the work. On never-trained operators, the loop's learning is what got there. I kept the null result instead of hiding it.
  • Let the referee overrule you. I kept a ledger of times the verifier contradicted my own predictions about what would work. It reached ten. When the model's losing grid cells flipped to wins, it was not via the split-row schedules I had predicted, it was the model's simpler whole-row kernels falsifying my hypothesis. When you have a hard oracle, your intuitions are hypotheses, and most of them are wrong.

The stack

OpenBMB's MiniCPM5-1B (the genuinely tiny smith, and yes it really is 1B) and Qwen3.6-27B (the bigger one), both fine-tuned with the OUROBOROS loop. Trained on Modal H200s and served on Modal with scale-to-zero, with a fully local, no-cloud-API mode that runs the 1B with llama.cpp and the in-process referee on the Space's own GPU. The front end is one Gradio Space whose entire interactive surface is a custom JavaScript machine-builder. Everything is MIT licensed: the models, the verified-kernel corpus and the evidence reports, and the code. The immutable referee that scores your kernel in the demo is the same one that trained the models.

Go mint one yourself. The green tick is earned.

Community

Sign up or log in to comment