--- license: apache-2.0 library_name: kernels tags: - kernels - cuda - quantization - llm --- # flute_kernels CUDA matmul kernels for LUT-quantized LLMs, packaged for the [`kernels`](https://huggingface.co/docs/kernels/index) library. Upstream: [hanguo97/flute](https://github.com/hanguo97/flute) (Han Guo et al., Apache-2.0). ## Use ```python import torch from kernels import get_kernel flute = get_kernel("galqiwi/flute_kernels", version=1) # qgemm: y = x · dequant(Q, table)·s y = flute.qgemm(x, Q, scales, table, table2, workspace, num_bits, group_size, template_id, num_sms) # fused HadaCore rotation + qgemm (HIGGS path) y = flute.qgemm_hadamard(x, Q, scales, table, table2, workspace, num_bits, group_size, hadamard_size, template_id, num_sms) # stand-alone Hadamard transform (HadaCore, fp16/bf16, pow-2 dim ≤ 32768) y = flute.hadamard_transform(x, inplace=False) ``` ## Load-time helpers ```python flute.utils.pack(W, num_bits, template_ids, num_sms) flute.utils.make_qmap2_from_qmap(qmap) flute.utils.get_workspace_streamk(device) flute.utils.get_template_config(num_bits, template_id, num_sms) flute.utils.get_template_ids(num_bits) flute.utils.is_template_supported(M, N, K, num_bits, template_id, num_sms) flute.utils.get_device_num_sms(device) flute.TEMPLATE_CONFIGS # the pre-tuned config dict ``` ## Attribution CUDA code is adapted from [hanguo97/flute](https://github.com/hanguo97/flute) (Apache-2.0). HadaCore kernel borrowed from [pytorch-labs/applied-ai](https://github.com/pytorch-labs/applied-ai). Built against [NVIDIA CUTLASS](https://github.com/NVIDIA/cutlass) v3.5 (BSD-3-Clause); upstream FLUTE pins v3.4.1 but CuTe API is stable across 3.x.