atomicmilkshake
/

llama-cpp-turboquant-binaries

Model card Files Files and versions

llama-cpp-turboquant-binaries / README.md

atomicmilkshake's picture

atomicmilkshake

Add README

402c910 verified 2 days ago

|

history blame contribute delete

2.14 kB

	---
	tags:
	- llama-cpp
	- turboquant
	- triattention
	- kv-cache
	- windows
	- cuda
	license: mit
	---

	# llama.cpp TurboQuant + TriAttention — Windows CUDA 13 Binaries

	Pre-built Windows x64 Release binaries for the [atomicmilkshake/llama-cpp-turboquant](https://github.com/atomicmilkshake/llama-cpp-turboquant) fork.

	This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on [arXiv 2604.04921](https://arxiv.org/abs/2604.04921)) on top of llama.cpp.

	## Download

	[llama-turboquant-triattention-win-cu13-x64.zip](llama-turboquant-triattention-win-cu13-x64.zip) (~179 MB)

	## Requirements

	- Windows 10/11 x64
	- NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer)
	- CUDA 13.x runtime — install from [developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads) (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size)

	## Usage

	`
	llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^
	--triattention-stats model.triattention ^
	--triattention-budget 4096 ^
	--triattention-window 256 ^
	--triattention-log
	`

	## TriAttention Performance

	Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:

	\| Mode \| Prune time \| Generation \|
	\|------\|-----------\|------------\|
	\| No pruning \| — \| 17.5 tok/s \|
	\| CPU scoring \| ~5900 ms/event \| 17.5 tok/s \|
	\| GPU scoring \| ~4-9 ms/event \| 75.0 tok/s \|

	~1000x speedup on pruning events; 4.3x overall throughput improvement.

	## TriAttention Flags

	\| Flag \| Description \| Default \|
	\|------\|-------------\|---------\|
	\| --triattention-stats <file> \| Calibration file (required to enable) \| — \|
	\| --triattention-budget <n> \| Max KV tokens to retain \| 512 \|
	\| --triattention-window <n> \| Recent-token protection window \| 64 \|
	\| --triattention-trigger \| slack\\|interval\\|ill \| slack \|
	\| --triattention-log \| Log each prune event \| off \|
	\| --triattention-no-protect-prefill \| Allow evicting prompt tokens \| off \|

	## Source

	[github.com/atomicmilkshake/llama-cpp-turboquant](https://github.com/atomicmilkshake/llama-cpp-turboquant) — branch eature/triattention