Validates every WGSL compute shader against CPU reference implementations
Profiles a single forward pass with per-operation timing. Requires a loaded model — enter repo below and load first.
Measures GPTQ matvec memory bandwidth at model-realistic sizes. Compares regular vs split-K dispatch strategies.