Post
5
LLM-generated GPU kernels pass the standard correctness test and are still wrong.
The industry oracle is one line: torch.allclose at one shape, one dtype, one seed. Every modern kernel benchmark uses it. It is blind to whole bug classes.
So I built the receipts:
- a 26-op corpus of correct and LLM-buggy kernels
- a differential fuzz vs an fp64 reference that catches what allclose misses
- a live demo you can click
The Correctness Illusion in LLM-Generated GPU Kernels (2606.20128)
dipankarsarkar/gpuemu-corpus
dipankarsarkar/the-correctness-illusion
What is your teams actual correctness oracle for generated kernels?
The industry oracle is one line: torch.allclose at one shape, one dtype, one seed. Every modern kernel benchmark uses it. It is blind to whole bug classes.
So I built the receipts:
- a 26-op corpus of correct and LLM-buggy kernels
- a differential fuzz vs an fp64 reference that catches what allclose misses
- a live demo you can click
The Correctness Illusion in LLM-Generated GPU Kernels (2606.20128)
dipankarsarkar/gpuemu-corpus
dipankarsarkar/the-correctness-illusion
What is your teams actual correctness oracle for generated kernels?