---
license: apache-2.0
library_name: kernels
tags:
- kernels
- cuda
- quantization
- llm
---

# flute_kernels

CUDA matmul kernels for LUT-quantized LLMs, packaged for the
[`kernels`](https://huggingface.co/docs/kernels/index) library.

Upstream: [hanguo97/flute](https://github.com/hanguo97/flute)
(Han Guo et al., Apache-2.0).

## Use

```python
import torch
from kernels import get_kernel

flute = get_kernel("galqiwi/flute_kernels", version=1)

# qgemm: y = x · dequant(Q, table)·s
y = flute.qgemm(x, Q, scales, table, table2, workspace,
                num_bits, group_size, template_id, num_sms)

# fused HadaCore rotation + qgemm (HIGGS path)
y = flute.qgemm_hadamard(x, Q, scales, table, table2, workspace,
                         num_bits, group_size, hadamard_size,
                         template_id, num_sms)

# stand-alone Hadamard transform (HadaCore, fp16/bf16, pow-2 dim ≤ 32768)
y = flute.hadamard_transform(x, inplace=False)
```

## Load-time helpers

```python
flute.utils.pack(W, num_bits, template_ids, num_sms)
flute.utils.make_qmap2_from_qmap(qmap)
flute.utils.get_workspace_streamk(device)
flute.utils.get_template_config(num_bits, template_id, num_sms)
flute.utils.get_template_ids(num_bits)
flute.utils.is_template_supported(M, N, K, num_bits, template_id, num_sms)
flute.utils.get_device_num_sms(device)
flute.TEMPLATE_CONFIGS    # the pre-tuned config dict
```

## Attribution

CUDA code is adapted from
[hanguo97/flute](https://github.com/hanguo97/flute) (Apache-2.0).
HadaCore kernel borrowed from
[pytorch-labs/applied-ai](https://github.com/pytorch-labs/applied-ai).
Built against [NVIDIA CUTLASS](https://github.com/NVIDIA/cutlass) v3.5
(BSD-3-Clause); upstream FLUTE pins v3.4.1 but CuTe API is stable across 3.x.