The kernel is memory-latency-bound (45 of 152 GB/s). Each 64-thread group here computes B output rows, so each thread issues B independent weight loads per step (more loads in flight to hide latency) while sharing the activation reads. B=1 is today's kernel. Higher B should climb toward the 152 GB/s roofline — until register pressure drops occupancy.