Ternary GEMV kernel lab — your GPU

Runs the batch-1 BitNet decode kernel (weights read once/token) in several variants against a ~0.69 GB ternary matrix on your real GPU, and reports the decode rate each reaches vs the 152 GB/s roofline (220 tok/s).

starting…