Add README

402c910 verified 2 days ago

2.14 kB

tags:
  - llama-cpp
  - turboquant
  - triattention
  - kv-cache
  - windows
  - cuda
license: mit

llama.cpp TurboQuant + TriAttention — Windows CUDA 13 Binaries

Pre-built Windows x64 Release binaries for the atomicmilkshake/llama-cpp-turboquant fork.

This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on arXiv 2604.04921) on top of llama.cpp.

Download

llama-turboquant-triattention-win-cu13-x64.zip (~179 MB)

Requirements

Windows 10/11 x64
NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer)
CUDA 13.x runtime — install from developer.nvidia.com/cuda-downloads (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size)

Usage

llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ --triattention-stats model.triattention ^ --triattention-budget 4096 ^ --triattention-window 256 ^ --triattention-log

TriAttention Performance

Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:

Mode	Prune time	Generation
No pruning	—	17.5 tok/s
CPU scoring	~5900 ms/event	17.5 tok/s
GPU scoring	~4-9 ms/event	75.0 tok/s

~1000x speedup on pruning events; 4.3x overall throughput improvement.

TriAttention Flags

Flag	Description	Default
--triattention-stats	Calibration file (required to enable)	—
--triattention-budget	Max KV tokens to retain	512
--triattention-window	Recent-token protection window	64
--triattention-trigger	slack\|interval\|ill	slack
--triattention-log	Log each prune event	off
--triattention-no-protect-prefill	Allow evicting prompt tokens	off

Source

github.com/atomicmilkshake/llama-cpp-turboquant — branch eature/triattention