metadata
tags:
- llama-cpp
- turboquant
- triattention
- kv-cache
- windows
- cuda
license: mit
llama.cpp TurboQuant + TriAttention — Windows CUDA 13 Binaries
Pre-built Windows x64 Release binaries for the atomicmilkshake/llama-cpp-turboquant fork.
This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on arXiv 2604.04921) on top of llama.cpp.
Download
llama-turboquant-triattention-win-cu13-x64.zip (~179 MB)
Requirements
- Windows 10/11 x64
- NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer)
- CUDA 13.x runtime — install from developer.nvidia.com/cuda-downloads (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size)
Usage
llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ --triattention-stats model.triattention ^ --triattention-budget 4096 ^ --triattention-window 256 ^ --triattention-log
TriAttention Performance
Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:
| Mode | Prune time | Generation |
|---|---|---|
| No pruning | — | 17.5 tok/s |
| CPU scoring | ~5900 ms/event | 17.5 tok/s |
| GPU scoring | ~4-9 ms/event | 75.0 tok/s |
~1000x speedup on pruning events; 4.3x overall throughput improvement.
TriAttention Flags
| Flag | Description | Default |
|---|---|---|
| --triattention-stats | Calibration file (required to enable) | — |
| --triattention-budget | Max KV tokens to retain | 512 |
| --triattention-window | Recent-token protection window | 64 |
| --triattention-trigger | slack|interval|ill | slack |
| --triattention-log | Log each prune event | off |
| --triattention-no-protect-prefill | Allow evicting prompt tokens | off |
Source
github.com/atomicmilkshake/llama-cpp-turboquant — branch eature/triattention