Native CUDA FlashRT grouped MoE GEMV kernels for BF16 activations and NVFP4 weights.
Available functions:
w4a16_decode_gemv_bf16
grouped_w4a16_gemv_bf16