Native CUDA BF16 M=1 decode GEMV kernels from FlashRT.
Available functions:
bf16_decode_gemv_bf16
bf16_decode_gemv_unrolled_bf16
See README.md and VALIDATION.md.
README.md
VALIDATION.md