| --- |
| tags: |
| - llama-cpp |
| - turboquant |
| - triattention |
| - kv-cache |
| - windows |
| - cuda |
| license: mit |
| --- |
| |
| # llama.cpp TurboQuant + TriAttention β Windows CUDA 13 Binaries |
|
|
| Pre-built Windows x64 Release binaries for the [atomicmilkshake/llama-cpp-turboquant](https://github.com/atomicmilkshake/llama-cpp-turboquant) fork. |
|
|
| This builds adds **TurboQuant** (custom quantization) and **TriAttention** (GPU-accelerated KV cache pruning based on [arXiv 2604.04921](https://arxiv.org/abs/2604.04921)) on top of llama.cpp. |
|
|
| ## Download |
|
|
| **[llama-turboquant-triattention-win-cu13-x64.zip](llama-turboquant-triattention-win-cu13-x64.zip)** (~179 MB) |
|
|
| ## Requirements |
|
|
| - Windows 10/11 x64 |
| - NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer) |
| - CUDA 13.x runtime β install from [developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads) (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size) |
| |
| ## Usage |
| |
| ` |
| llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ |
| --triattention-stats model.triattention ^ |
| --triattention-budget 4096 ^ |
| --triattention-window 256 ^ |
| --triattention-log |
| ` |
| |
| ## TriAttention Performance |
| |
| Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256: |
| |
| | Mode | Prune time | Generation | |
| |------|-----------|------------| |
| | No pruning | β | 17.5 tok/s | |
| | CPU scoring | ~5900 ms/event | 17.5 tok/s | |
| | **GPU scoring** | **~4-9 ms/event** | **75.0 tok/s** | |
| |
| ~1000x speedup on pruning events; 4.3x overall throughput improvement. |
| |
| ## TriAttention Flags |
| |
| | Flag | Description | Default | |
| |------|-------------|---------| |
| | --triattention-stats <file> | Calibration file (**required** to enable) | β | |
| | --triattention-budget <n> | Max KV tokens to retain | 512 | |
| | --triattention-window <n> | Recent-token protection window | 64 | |
| | --triattention-trigger | slack\|interval\|ill | slack | |
| | --triattention-log | Log each prune event | off | |
| | --triattention-no-protect-prefill | Allow evicting prompt tokens | off | |
| |
| ## Source |
| |
| [github.com/atomicmilkshake/llama-cpp-turboquant](https://github.com/atomicmilkshake/llama-cpp-turboquant) β branch eature/triattention |
| |