AI Engineering Lab commited on
Commit ·
740999f
1
Parent(s): a99aedc
docs: overhaul README — architecture diagram, fixed anchors, VRAM visualization, bilingual
Browse files
README.md
CHANGED
|
@@ -24,223 +24,277 @@ tags:
|
|
| 24 |
<div align="center">
|
| 25 |
|
| 26 |
[](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)
|
|
|
|
| 27 |

|
| 28 |
-
 — Mistral-Small-3.2-24B Q4_K_M
|
| 53 |
|
| 54 |
-
*4 independent benchmark runs, 15 total measurements.*
|
| 55 |
|
| 56 |
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 57 |
|--|:--------------:|:-----------------:|:-----:|
|
| 58 |
-
| **Context** | 8,192
|
| 59 |
-
| **VRAM** | 15.3 GB | 17.1 GB | +1.8 GB
|
| 60 |
| **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
|
| 61 |
-
| **KV-Cache
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
|
| 68 |
|
| 69 |
-
*2 independent benchmark sessions
|
| 70 |
|
| 71 |
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 72 |
|--|:--------------:|:-----------------:|:-----:|
|
| 73 |
-
| **Context** | 8,192
|
| 74 |
-
| **VRAM** | 5.7 GB | 6.2 GB | +0.54 GB
|
| 75 |
| **Tokens/s (avg)** | 49.8 | 47.5 | **−4.6%** |
|
| 76 |
|
| 77 |
-
>
|
| 78 |
|
| 79 |
Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant-4070-results-2026-04-01.json) · [`results/turboquant-4070-laptop-2026-04-01.json`](results/turboquant-4070-laptop-2026-04-01.json)
|
| 80 |
|
|
|
|
|
|
|
| 81 |
### Cross-GPU Summary
|
| 82 |
|
| 83 |
-
| GPU | VRAM | Model |
|
| 84 |
-
|-----|------|-------|---------------------|----------
|
| 85 |
-
| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000
|
| 86 |
-
| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
### 1. Build the Docker
|
| 95 |
|
| 96 |
```bash
|
| 97 |
docker build -t turboquant:feature .
|
| 98 |
|
| 99 |
-
# Verify TurboQuant is compiled in:
|
| 100 |
-
docker run --rm turboquant:feature llama-server -h 2>&1 | grep
|
| 101 |
-
# Must show: turbo2, turbo3, turbo4
|
| 102 |
```
|
| 103 |
|
| 104 |
-
### 2. Download
|
| 105 |
|
| 106 |
```bash
|
| 107 |
-
# Set your HuggingFace token
|
| 108 |
export HF_TOKEN=hf_your_token_here
|
| 109 |
-
|
| 110 |
bash scripts/download-model.sh
|
| 111 |
```
|
| 112 |
|
| 113 |
-
### 3. Run
|
| 114 |
|
| 115 |
```bash
|
| 116 |
bash scripts/run-baseline.sh
|
| 117 |
-
# Server
|
| 118 |
```
|
| 119 |
|
| 120 |
### 4. Run TurboQuant (turbo3, 100K context)
|
| 121 |
|
| 122 |
```bash
|
| 123 |
bash scripts/run-turbo.sh
|
| 124 |
-
# Server
|
| 125 |
```
|
| 126 |
|
| 127 |
-
### 5. Test
|
| 128 |
|
| 129 |
```bash
|
| 130 |
-
# Check
|
| 131 |
-
curl -s http://localhost:
|
| 132 |
-
|
|
|
|
| 133 |
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
#
|
| 138 |
-
curl http://localhost:8182/v1/chat/completions \
|
| 139 |
-H "Content-Type: application/json" \
|
| 140 |
-
-d '{"model":"local","messages":[{"role":"user","content":"
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
---
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
### The KV-Cache Problem
|
| 148 |
-
|
| 149 |
-
When an LLM runs, it caches Key-Value pairs for every token in the context window.
|
| 150 |
-
This cache grows **linearly** with context length:
|
| 151 |
-
|
| 152 |
-
```
|
| 153 |
-
Mistral-Small-3.2 24B on RTX 3090 (24 GB total, ~14.4 GB for model weights):
|
| 154 |
-
|
| 155 |
-
Context KV-Cache (f16) Available after model Fits?
|
| 156 |
-
8,192 ~1 GB 9.6 GB ✅
|
| 157 |
-
32,000 ~4 GB 9.6 GB ✅
|
| 158 |
-
100,000 ~12 GB 9.6 GB ❌ OOM without TurboQuant
|
| 159 |
-
100,000 ~2.8 GB (turbo3) 9.6 GB ✅
|
| 160 |
-
```
|
| 161 |
-
|
| 162 |
-
### What TurboQuant Does
|
| 163 |
-
|
| 164 |
-
TurboQuant compresses the KV-cache from 16-bit floats to 2–4-bit integers.
|
| 165 |
-
**It does NOT compress the model weights** — only the runtime cache.
|
| 166 |
-
|
| 167 |
-
```
|
| 168 |
-
f16 KV-Cache → turbo3 KV-Cache
|
| 169 |
-
16 bits → 3 bits = 4.3× compression
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
The model reads the quantized cache and generates text normally.
|
| 173 |
-
Quality loss: <1% perplexity increase at turbo3 (per paper).
|
| 174 |
-
|
| 175 |
-
### Two Repos — Critical Distinction
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
| 180 |
-
|------|-----------|-------------|
|
| 181 |
-
| `TheTom/turboquant_plus` | Python library for research | HuggingFace models, Python API |
|
| 182 |
-
| `TheTom/llama-cpp-turboquant` | llama.cpp fork | **This guide — llama-server** |
|
| 183 |
|
| 184 |
-
**
|
|
|
|
|
|
|
| 185 |
|
| 186 |
---
|
| 187 |
|
| 188 |
-
##
|
| 189 |
-
|
| 190 |
-
Every error we hit during setup, documented so you don't repeat them:
|
| 191 |
-
|
| 192 |
-
### E1: Wrong Repository
|
| 193 |
|
| 194 |
-
**Symptom:**
|
| 195 |
-
**Cause:**
|
| 196 |
-
**Fix:** Use the correct repo. See Dockerfile.
|
| 197 |
|
| 198 |
-
### E2: Wrong cmake Flag
|
| 199 |
-
|
| 200 |
-
**Symptom:** CUDA not used during inference, slow CPU fallback.
|
| 201 |
-
**Cause:** Old flag `-DLLAMA_CUBLAS=ON` was renamed in llama.cpp post-GGML-refactor.
|
| 202 |
-
**Fix:**
|
| 203 |
```dockerfile
|
| 204 |
-
#
|
| 205 |
-
cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON
|
| 206 |
|
| 207 |
-
#
|
| 208 |
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
|
| 209 |
```
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
**Symptom:** Build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
|
| 214 |
-
**Cause:** CUDA devel images have a stub `libcuda.so` but not `libcuda.so.1` (the runtime driver is injected at container start, not build time).
|
| 215 |
-
**Fix:** Add symlink before cmake:
|
| 216 |
```dockerfile
|
|
|
|
| 217 |
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
|
| 218 |
/usr/local/cuda/lib64/stubs/libcuda.so.1 \
|
| 219 |
&& echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
|
| 220 |
&& ldconfig
|
| 221 |
```
|
| 222 |
|
| 223 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
-
**Symptom:** `Unsupported cache type: turbo3` at runtime despite clean build.
|
| 226 |
-
**Cause:** Cloning the default `master` branch of `llama-cpp-turboquant` — which is a standard llama.cpp fork **without** TurboQuant. The implementation is on `feature/turboquant-kv-cache`.
|
| 227 |
-
**Fix:**
|
| 228 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
git clone https://github.com/TheTom/llama-cpp-turboquant.git \
|
| 230 |
--branch feature/turboquant-kv-cache --depth=1
|
| 231 |
```
|
| 232 |
-
|
|
|
|
| 233 |
```bash
|
| 234 |
curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
|
| 235 |
| python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
|
| 236 |
```
|
| 237 |
|
| 238 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
-
**Symptom:** 404 or 401 when downloading model.
|
| 241 |
-
**Cause:** Model repo names change. Don't rely on memory or cached context.
|
| 242 |
-
**Fix:** Always query HF Search API before downloading:
|
| 243 |
```bash
|
|
|
|
| 244 |
curl -s -H "Authorization: Bearer $HF_TOKEN" \
|
| 245 |
"https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
|
| 246 |
| python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
|
|
@@ -248,121 +302,140 @@ curl -s -H "Authorization: Bearer $HF_TOKEN" \
|
|
| 248 |
|
| 249 |
---
|
| 250 |
|
| 251 |
-
|
|
|
|
| 252 |
|
| 253 |
```bash
|
| 254 |
-
# 1. Build
|
| 255 |
docker build -t turboquant:feature .
|
|
|
|
|
|
|
| 256 |
|
| 257 |
-
# 2.
|
| 258 |
-
export HF_TOKEN=hf_your_token
|
| 259 |
-
bash scripts/download-model.sh
|
| 260 |
-
|
| 261 |
-
# 3. Baseline measurement
|
| 262 |
-
bash scripts/run-baseline.sh &
|
| 263 |
-
sleep 45 # wait for server startup
|
| 264 |
-
curl -s http://localhost:8180/v1/chat/completions \
|
| 265 |
-
-H "Content-Type: application/json" \
|
| 266 |
-
-d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200, one per line."}],"max_tokens":500}' \
|
| 267 |
-
| python3 -c "import sys,json; d=json.load(sys.stdin); u=d['usage']; print(f'TPS: {u[\"completion_tokens\"] / (d[\"usage\"].get(\"total_time_ms\",10000)/1000):.1f}')"
|
| 268 |
nvidia-smi --query-gpu=memory.used --format=csv,noheader
|
|
|
|
| 269 |
|
| 270 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 271 |
docker stop turboquant-baseline
|
| 272 |
-
bash scripts/run-turbo.sh
|
| 273 |
-
sleep 90
|
| 274 |
-
|
|
|
|
| 275 |
```
|
| 276 |
|
| 277 |
-
Expected
|
|
|
|
|
|
|
| 278 |
|
| 279 |
-
---
|
| 280 |
|
| 281 |
-
|
| 282 |
|
| 283 |
-
|
| 284 |
-
|--|---------|----------|
|
| 285 |
-
| GPU VRAM | 16 GB | RTX 3090 24 GB |
|
| 286 |
-
| System RAM | 16 GB | 32 GB |
|
| 287 |
-
| Disk | 30 GB | SSD |
|
| 288 |
-
| CUDA | 12.x | 12.6.3 |
|
| 289 |
-
| OS | Linux / Windows + Docker | Windows + Docker Desktop |
|
| 290 |
|
| 291 |
-
|
| 292 |
|
| 293 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 294 |
|
| 295 |
-
|
| 296 |
|
| 297 |
-
|
| 298 |
-
Should work with any GGUF model that fits the VRAM budget after KV-cache allocation.
|
| 299 |
|
| 300 |
-
|
|
| 301 |
-
|-------
|
| 302 |
-
|
|
| 303 |
-
|
|
| 304 |
-
|
|
|
|
|
|
|
|
| 305 |
|
| 306 |
-
*
|
| 307 |
|
| 308 |
---
|
| 309 |
|
| 310 |
-
##
|
| 311 |
|
| 312 |
Content and scripts: [CC BY 4.0](LICENSE)
|
| 313 |
-
Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al.
|
| 314 |
llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
|
| 315 |
|
| 316 |
---
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
| 320 |
## 🇩🇪 Deutsch
|
| 321 |
|
| 322 |
### TurboQuant auf Consumer-Hardware — Praktischer Guide
|
| 323 |
|
| 324 |
-
Dieses Repository dokumentiert unsere
|
| 325 |
-
auf einer RTX 3090 im Homelab-Betrieb. Wir sind das erste europäische Team,
|
| 326 |
-
das diese Methode praktisch auf Consumer-Hardware veröffentlicht dokumentiert hat.
|
| 327 |
-
|
| 328 |
-
### Das Ergebnis
|
| 329 |
|
| 330 |
-
|
| 331 |
|
| 332 |
-
|
| 333 |
-
- nur **+1.8 GB VRAM** Mehrverbrauch
|
| 334 |
-
- nur **−8.5% Geschwindigkeitsverlust**
|
| 335 |
-
- **gleiche Modellgewichte** — nur der Laufzeit-Cache wird komprimiert
|
| 336 |
|
| 337 |
-
|
|
|
|
|
|
|
|
|
|
| 338 |
|
| 339 |
-
|
| 340 |
-
|
|
|
|
|
|
|
| 341 |
|
| 342 |
-
###
|
| 343 |
|
| 344 |
-
|
| 345 |
-
Der häufigste: falscher Branch (`master` statt `feature/turboquant-kv-cache`).
|
| 346 |
|
| 347 |
-
### Schnellstart
|
| 348 |
|
| 349 |
```bash
|
| 350 |
# Image bauen (~20 Minuten)
|
| 351 |
docker build -t turboquant:feature .
|
| 352 |
|
| 353 |
-
#
|
|
|
|
|
|
|
|
|
|
| 354 |
export HF_TOKEN=dein_token
|
| 355 |
bash scripts/download-model.sh
|
| 356 |
|
| 357 |
-
# Baseline starten (f16, 8K Context)
|
| 358 |
bash scripts/run-baseline.sh
|
| 359 |
|
| 360 |
-
# TurboQuant starten (turbo3, 100K Context)
|
|
|
|
| 361 |
bash scripts/run-turbo.sh
|
| 362 |
```
|
| 363 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 364 |
Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
|
| 365 |
|
| 366 |
---
|
| 367 |
|
| 368 |
-
*AI Engineering Lab
|
|
|
|
| 24 |
<div align="center">
|
| 25 |
|
| 26 |
[](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)
|
| 27 |
+
[](https://arxiv.org/abs/2504.19874)
|
| 28 |

|
| 29 |
+

|
| 30 |
+

|
| 31 |
+

|
| 32 |
+

|
| 33 |
+

|
| 34 |
|
| 35 |
+
**TurboQuant (ICLR 2026) quantizes the KV-cache at runtime — not the model weights.**
|
| 36 |
+
**Result: 100,000 token context on an RTX 3090. +1.8 GB VRAM. −8% speed.**
|
| 37 |
|
| 38 |
+
*Verified across multiple independent runs. Step-by-step guide with Dockerfile, scripts, and raw benchmark data.*
|
| 39 |
|
| 40 |
+
</div>
|
|
|
|
| 41 |
|
| 42 |
+
---
|
| 43 |
|
| 44 |
+
[TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) was presented at ICLR 2026. It compresses the KV-cache from 16-bit to 2–4-bit integers during inference. Model weights stay at full precision. This closes the gap between what your GPU can load and what context length it can actually serve.
|
| 45 |
+
|
| 46 |
+
This repo documents our setup on two consumer GPUs — what we ran into, what we fixed, and what we measured.
|
| 47 |
+
|
| 48 |
+
**What's in this repo:**
|
| 49 |
+
| File | Description |
|
| 50 |
+
|------|-------------|
|
| 51 |
+
| `Dockerfile` | Builds llama.cpp with TurboQuant (correct repo, branch, cmake flags) |
|
| 52 |
+
| `scripts/run-baseline.sh` | Starts llama-server with f16 cache, 8K context |
|
| 53 |
+
| `scripts/run-turbo.sh` | Starts llama-server with turbo3 cache, 100K context |
|
| 54 |
+
| `scripts/download-model.sh` | Downloads model from HuggingFace via API |
|
| 55 |
+
| `results/` | Raw benchmark JSON — all runs, both GPUs |
|
| 56 |
+
| `WHITEPAPER.de.md` | German white paper |
|
| 57 |
+
|
| 58 |
+
**→ [Results](#results) · [How It Works](#how-it-works) · [Quick Start](#quick-start) · [Errors & Fixes](#errors) · [Deutsch](#deutsch)**
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
+
<a id="results"></a>
|
| 63 |
+
## Results
|
| 64 |
|
| 65 |
+
Verified on two consumer GPUs. April 2026.
|
| 66 |
|
| 67 |
### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
|
| 68 |
|
| 69 |
+
*4 independent benchmark runs, 15 total measurements. All runs consistent (±0.3% TPS variance).*
|
| 70 |
|
| 71 |
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 72 |
|--|:--------------:|:-----------------:|:-----:|
|
| 73 |
+
| **Context** | 8,192 | **100,000** | **+12.2×** |
|
| 74 |
+
| **VRAM used** | 15.3 GB | 17.1 GB | +1.8 GB |
|
| 75 |
| **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
|
| 76 |
+
| **KV-Cache** | ~1 GB (f16) | ~2.8 GB (3-bit) | 4.3× smaller |
|
| 77 |
+
|
| 78 |
+
> 12× more context. +12% VRAM. −7.5% speed. Same model weights.
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
RTX 3090 — 24 GB VRAM
|
| 82 |
+
────────────────────────────────────────────────
|
| 83 |
+
Baseline (f16, 8K ctx): █████████████░░░░░░ 15.3 GB
|
| 84 |
+
TurboQuant (turbo3, 100K ctx): ██████████████░░░░░ 17.1 GB
|
| 85 |
+
↑ weights 14.4 GB fixed
|
| 86 |
+
↑ KV-cache
|
| 87 |
+
```
|
| 88 |
|
| 89 |
+
Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) — 4 runs, 15 measurements
|
| 90 |
|
| 91 |
+
---
|
| 92 |
|
| 93 |
### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
|
| 94 |
|
| 95 |
+
*2 independent benchmark sessions. VRAM delta stable at ±2 MB between sessions.*
|
| 96 |
|
| 97 |
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 98 |
|--|:--------------:|:-----------------:|:-----:|
|
| 99 |
+
| **Context** | 8,192 | **64,000** | **+7.8×** |
|
| 100 |
+
| **VRAM used** | 5.7 GB | 6.2 GB | +0.54 GB |
|
| 101 |
| **Tokens/s (avg)** | 49.8 | 47.5 | **−4.6%** |
|
| 102 |
|
| 103 |
+
> 7.8× more context. +0.5 GB VRAM. −5% speed.
|
| 104 |
|
| 105 |
Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant-4070-results-2026-04-01.json) · [`results/turboquant-4070-laptop-2026-04-01.json`](results/turboquant-4070-laptop-2026-04-01.json)
|
| 106 |
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
### Cross-GPU Summary
|
| 110 |
|
| 111 |
+
| GPU | VRAM | Model | ctx (turbo3) | VRAM delta | Speed loss | Runs |
|
| 112 |
+
|-----|------|-------|:------------:|:----------:|:----------:|:----:|
|
| 113 |
+
| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | **100,000** | +1.8 GB | −7.5% | 4 |
|
| 114 |
+
| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | **64,000** | +0.5 GB | −4.6% | 2 |
|
| 115 |
+
|
| 116 |
+
The principle scales with the GPU. More VRAM → larger model → larger absolute VRAM savings from compression → more context headroom.
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
<a id="how-it-works"></a>
|
| 121 |
+
## How It Works
|
| 122 |
+
|
| 123 |
+
### The KV-cache problem
|
| 124 |
+
|
| 125 |
+
Every token you feed into an LLM creates Key-Value vectors that must stay in VRAM for the duration of the request. With f16 (default), this cache grows linearly:
|
| 126 |
+
|
| 127 |
+
```
|
| 128 |
+
Mistral-Small-3.2 24B on RTX 3090 (24 GB):
|
| 129 |
+
Model weights occupy 14.4 GB. Remaining: ~9.6 GB for KV-cache.
|
| 130 |
+
|
| 131 |
+
Context KV-cache (f16) Remaining Status
|
| 132 |
+
8,192 ~1 GB ~8.6 GB ✅ fine
|
| 133 |
+
32,000 ~4 GB ~5.6 GB ✅ fine
|
| 134 |
+
100,000 ~12 GB −2.4 GB ❌ OOM
|
| 135 |
+
100,000 ~2.8 GB (turbo3) ~6.8 GB ✅ fine
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
### What TurboQuant does
|
| 139 |
+
|
| 140 |
+
TurboQuant re-encodes the KV-cache from 16 bits to 2–4 bits on-the-fly. The model reads the quantized cache and generates output normally. The model weights are never touched.
|
| 141 |
+
|
| 142 |
+
```
|
| 143 |
+
f16 KV-cache → turbo3 KV-cache
|
| 144 |
+
16 bits → 3 bits = 4.3× compression
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
Quality loss at turbo3: <1% perplexity increase (per paper). In practice: not noticeable for most tasks.
|
| 148 |
+
|
| 149 |
+
```mermaid
|
| 150 |
+
flowchart LR
|
| 151 |
+
tokens["100K input tokens"] --> model["Model layers\nweights: 14.4 GB\nunchanged"]
|
| 152 |
+
model --> kv_type{KV-Cache format?}
|
| 153 |
+
kv_type -->|"f16 default"| oom["~12 GB cache\n❌ OOM on 24 GB GPU"]
|
| 154 |
+
kv_type -->|"turbo3 (3-bit)"| fits["~2.8 GB cache\n✅ Fits: 7 GB still free"]
|
| 155 |
+
fits --> output["Output tokens\n−8% TPS vs baseline"]
|
| 156 |
+
```
|
| 157 |
|
| 158 |
+
### Critical: two repos with confusing names
|
| 159 |
+
|
| 160 |
+
| Repo | What it is | Used here? |
|
| 161 |
+
|------|-----------|:----------:|
|
| 162 |
+
| `TheTom/turboquant_plus` | Python research library — HuggingFace models, Python API | ❌ |
|
| 163 |
+
| `TheTom/llama-cpp-turboquant` | llama.cpp fork with `--cache-type-k turbo3` | ✅ |
|
| 164 |
+
|
| 165 |
+
Branch: `feature/turboquant-kv-cache` — **not `master`** (which is a standard llama.cpp fork, no TurboQuant).
|
| 166 |
|
| 167 |
---
|
| 168 |
|
| 169 |
+
<a id="quick-start"></a>
|
| 170 |
+
## Quick Start
|
| 171 |
+
|
| 172 |
+
**Requirements:** Docker with NVIDIA runtime, CUDA 12.x, HuggingFace account (free).
|
| 173 |
|
| 174 |
+
### 1. Build the Docker image (~20 min)
|
| 175 |
|
| 176 |
```bash
|
| 177 |
docker build -t turboquant:feature .
|
| 178 |
|
| 179 |
+
# Verify TurboQuant is compiled in — must show turbo2, turbo3, turbo4:
|
| 180 |
+
docker run --rm turboquant:feature llama-server -h 2>&1 | grep turbo
|
|
|
|
| 181 |
```
|
| 182 |
|
| 183 |
+
### 2. Download model (~14 GB)
|
| 184 |
|
| 185 |
```bash
|
|
|
|
| 186 |
export HF_TOKEN=hf_your_token_here
|
|
|
|
| 187 |
bash scripts/download-model.sh
|
| 188 |
```
|
| 189 |
|
| 190 |
+
### 3. Run baseline (f16, 8K context)
|
| 191 |
|
| 192 |
```bash
|
| 193 |
bash scripts/run-baseline.sh
|
| 194 |
+
# Server on port 8180. Starts in ~45s.
|
| 195 |
```
|
| 196 |
|
| 197 |
### 4. Run TurboQuant (turbo3, 100K context)
|
| 198 |
|
| 199 |
```bash
|
| 200 |
bash scripts/run-turbo.sh
|
| 201 |
+
# Server on port 8182. Starts in ~90s — 100K context allocation takes longer.
|
| 202 |
```
|
| 203 |
|
| 204 |
+
### 5. Test
|
| 205 |
|
| 206 |
```bash
|
| 207 |
+
# Check context length
|
| 208 |
+
curl -s http://localhost:8182/v1/models \
|
| 209 |
+
| python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['context_length'])"
|
| 210 |
+
# → 100000
|
| 211 |
|
| 212 |
+
# Warmup (first request is cold — don't measure this one)
|
| 213 |
+
curl -sf http://localhost:8182/v1/chat/completions \
|
| 214 |
+
-H "Content-Type: application/json" \
|
| 215 |
+
-d '{"model":"local","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}' > /dev/null
|
| 216 |
|
| 217 |
+
# Measure TPS
|
| 218 |
+
curl -s http://localhost:8182/v1/chat/completions \
|
| 219 |
-H "Content-Type: application/json" \
|
| 220 |
+
-d '{"model":"local","messages":[{"role":"user","content":"Explain transformer attention in detail. At least 400 words."}],"max_tokens":500}' \
|
| 221 |
+
| python3 -c "import sys,json; t=json.load(sys.stdin)['timings']; print(f'{t[\"predicted_per_second\"]:.1f} TPS ({t[\"predicted_n\"]} tokens)')"
|
| 222 |
```
|
| 223 |
|
| 224 |
---
|
| 225 |
|
| 226 |
+
<a id="errors"></a>
|
| 227 |
+
## Errors & Fixes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
+
5 errors we ran into during setup. All documented so you can skip them.
|
| 230 |
|
| 231 |
+
### E1: Wrong repository
|
|
|
|
|
|
|
|
|
|
| 232 |
|
| 233 |
+
**Symptom:** Build succeeds. `llama-server -h | grep turbo` returns nothing.
|
| 234 |
+
**Cause:** Built from `TheTom/turboquant_plus` — that's a Python library for HuggingFace-style inference. The llama.cpp fork is `TheTom/llama-cpp-turboquant`.
|
| 235 |
+
**Fix:** Use the Dockerfile in this repo. It clones the correct repo.
|
| 236 |
|
| 237 |
---
|
| 238 |
|
| 239 |
+
### E2: Wrong cmake flag
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
|
| 241 |
+
**Symptom:** CUDA is not used. Inference runs on CPU — extremely slow.
|
| 242 |
+
**Cause:** `-DLLAMA_CUBLAS=ON` was renamed to `-DGGML_CUDA=ON` in llama.cpp after the GGML refactor. The old flag compiles without error but is silently ignored.
|
|
|
|
| 243 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 244 |
```dockerfile
|
| 245 |
+
# Wrong — silently ignored since llama.cpp GGML refactor:
|
| 246 |
+
cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON .
|
| 247 |
|
| 248 |
+
# Correct:
|
| 249 |
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
|
| 250 |
```
|
| 251 |
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
### E3: `libcuda.so.1` not found at build time
|
| 255 |
+
|
| 256 |
+
**Symptom:** Docker build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
|
| 257 |
+
**Cause:** CUDA development images ship a stub `libcuda.so` — the actual driver (`libcuda.so.1`) is injected at container runtime, not available during `docker build`.
|
| 258 |
|
|
|
|
|
|
|
|
|
|
| 259 |
```dockerfile
|
| 260 |
+
# Add before cmake in your Dockerfile:
|
| 261 |
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
|
| 262 |
/usr/local/cuda/lib64/stubs/libcuda.so.1 \
|
| 263 |
&& echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
|
| 264 |
&& ldconfig
|
| 265 |
```
|
| 266 |
|
| 267 |
+
---
|
| 268 |
+
|
| 269 |
+
### E4: Wrong branch
|
| 270 |
+
|
| 271 |
+
**Symptom:** `Unsupported cache type: turbo3` at runtime. Build was clean.
|
| 272 |
+
**Cause:** The default `master` branch of `llama-cpp-turboquant` is a plain llama.cpp fork. TurboQuant lives on `feature/turboquant-kv-cache`.
|
| 273 |
|
|
|
|
|
|
|
|
|
|
| 274 |
```bash
|
| 275 |
+
# Wrong — master branch, no TurboQuant:
|
| 276 |
+
git clone https://github.com/TheTom/llama-cpp-turboquant.git
|
| 277 |
+
|
| 278 |
+
# Correct:
|
| 279 |
git clone https://github.com/TheTom/llama-cpp-turboquant.git \
|
| 280 |
--branch feature/turboquant-kv-cache --depth=1
|
| 281 |
```
|
| 282 |
+
|
| 283 |
+
Check branches before cloning:
|
| 284 |
```bash
|
| 285 |
curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
|
| 286 |
| python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
|
| 287 |
```
|
| 288 |
|
| 289 |
+
---
|
| 290 |
+
|
| 291 |
+
### E5: Wrong HuggingFace repo name
|
| 292 |
+
|
| 293 |
+
**Symptom:** 404 or 401 on model download.
|
| 294 |
+
**Cause:** Model repo names on HuggingFace change. Don't rely on memory.
|
| 295 |
|
|
|
|
|
|
|
|
|
|
| 296 |
```bash
|
| 297 |
+
# Always verify first:
|
| 298 |
curl -s -H "Authorization: Bearer $HF_TOKEN" \
|
| 299 |
"https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
|
| 300 |
| python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
|
|
|
|
| 302 |
|
| 303 |
---
|
| 304 |
|
| 305 |
+
<a id="reproduce"></a>
|
| 306 |
+
## Reproduce Our Results
|
| 307 |
|
| 308 |
```bash
|
| 309 |
+
# 1. Build and start baseline server
|
| 310 |
docker build -t turboquant:feature .
|
| 311 |
+
bash scripts/run-baseline.sh
|
| 312 |
+
sleep 45
|
| 313 |
|
| 314 |
+
# 2. VRAM after server start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
nvidia-smi --query-gpu=memory.used --format=csv,noheader
|
| 316 |
+
# Expected: ~15300 MB
|
| 317 |
|
| 318 |
+
# 3. Warmup (mandatory — first request is cold and gives wrong TPS)
|
| 319 |
+
curl -sf http://localhost:8180/v1/chat/completions \
|
| 320 |
+
-H "Content-Type: application/json" \
|
| 321 |
+
-d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":5}' > /dev/null
|
| 322 |
+
|
| 323 |
+
# 4. 3× TPS measurement
|
| 324 |
+
PROMPT='{"model":"local","messages":[{"role":"user","content":"Explain in detail how transformer attention mechanisms work. Cover self-attention, multi-head attention, key-query-value matrices, and positional encoding. Write at least 400 words."}],"max_tokens":500}'
|
| 325 |
+
for i in 1 2 3; do
|
| 326 |
+
curl -sf http://localhost:8180/v1/chat/completions \
|
| 327 |
+
-H "Content-Type: application/json" \
|
| 328 |
+
-d "$PROMPT" \
|
| 329 |
+
| python3 -c "import sys,json; t=json.load(sys.stdin)['timings']; print(f'Run $i: {t[\"predicted_per_second\"]:.2f} TPS ({t[\"predicted_n\"]} tokens)')"
|
| 330 |
+
done
|
| 331 |
+
|
| 332 |
+
# 5. Stop, start TurboQuant (100K context needs ~90s to allocate)
|
| 333 |
docker stop turboquant-baseline
|
| 334 |
+
bash scripts/run-turbo.sh
|
| 335 |
+
sleep 90
|
| 336 |
+
|
| 337 |
+
# 6. Repeat steps 2-4 on port 8182
|
| 338 |
```
|
| 339 |
|
| 340 |
+
**Expected on RTX 3090 + Mistral-Small-3.2 24B:**
|
| 341 |
+
- Baseline: 50–52 TPS, VRAM ~15.3 GB
|
| 342 |
+
- turbo3 at 100K: 46–48 TPS, VRAM ~17.1 GB
|
| 343 |
|
| 344 |
+
Full benchmark data for comparison: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json)
|
| 345 |
|
| 346 |
+
---
|
| 347 |
|
| 348 |
+
## Hardware & Model Compatibility
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
|
| 350 |
+
### GPU VRAM requirements
|
| 351 |
|
| 352 |
+
| VRAM | Recommended model | turbo3 context | Notes |
|
| 353 |
+
|------|------------------|:--------------:|-------|
|
| 354 |
+
| 6 GB | Llama-3.2 3B Q4_K_M (~2 GB) | ~200K | Very fast, limited capability |
|
| 355 |
+
| 8 GB | Llama-3.1 8B Q4_K_M (4.7 GB) | ~64K | **Verified** — our RTX 4070 setup |
|
| 356 |
+
| 12 GB | Qwen2.5 14B Q4_K_M (~8.5 GB) | ~80K | Estimated |
|
| 357 |
+
| 24 GB | Mistral-Small-3.2 24B Q4_K_M (14.4 GB) | ~100K | **Verified** — our RTX 3090 setup |
|
| 358 |
|
| 359 |
+
*Estimates for non-verified rows depend on model architecture and batch size.*
|
| 360 |
|
| 361 |
+
### System requirements
|
|
|
|
| 362 |
|
| 363 |
+
| Component | Minimum | Our setups |
|
| 364 |
+
|-----------|---------|-----------|
|
| 365 |
+
| GPU | CUDA-capable, VRAM per table above | RTX 3090 / RTX 4070 Laptop |
|
| 366 |
+
| System RAM | 16 GB | 32 GB / 16 GB |
|
| 367 |
+
| Disk | 20 GB free | SSD |
|
| 368 |
+
| CUDA | 12.x | 12.6.3 |
|
| 369 |
+
| OS | Linux, or Windows with Docker Desktop | Windows + Docker Desktop |
|
| 370 |
|
| 371 |
+
**Windows:** Docker Desktop works. Use named Docker volumes for models — avoid `/tmp/` paths.
|
| 372 |
|
| 373 |
---
|
| 374 |
|
| 375 |
+
## License
|
| 376 |
|
| 377 |
Content and scripts: [CC BY 4.0](LICENSE)
|
| 378 |
+
Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al., ICLR 2026
|
| 379 |
llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
|
| 380 |
|
| 381 |
---
|
| 382 |
|
| 383 |
+
<a id="deutsch"></a>
|
|
|
|
| 384 |
## 🇩🇪 Deutsch
|
| 385 |
|
| 386 |
### TurboQuant auf Consumer-Hardware — Praktischer Guide
|
| 387 |
|
| 388 |
+
Dieses Repository dokumentiert unsere Ergebnisse mit TurboQuant (ICLR 2026) auf zwei Consumer-GPUs — inklusive aller 5 Fehler die wir gemacht haben und wie wir sie gelöst haben.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 389 |
|
| 390 |
+
**TurboQuant komprimiert den KV-Cache von 16-Bit auf 2–4-Bit während der Inferenz. Die Modellgewichte bleiben unverändert.**
|
| 391 |
|
| 392 |
+
### Ergebnisse
|
|
|
|
|
|
|
|
|
|
| 393 |
|
| 394 |
+
**RTX 3090 (24 GB), Mistral-Small-3.2 24B Q4_K_M** — 4 unabhängige Runs:
|
| 395 |
+
- Context: 8.192 → **100.000 Tokens** (+12,2×)
|
| 396 |
+
- VRAM-Mehrverbrauch: nur **+1,8 GB** (statt ~12 GB die f16 bei 100K bräuchte)
|
| 397 |
+
- Geschwindigkeitsverlust: nur **−7,5%** (51,0 → 47,2 Tokens/s)
|
| 398 |
|
| 399 |
+
**RTX 4070 Laptop (8 GB), Llama-3.1 8B Q4_K_M** — 2 unabhängige Sessions:
|
| 400 |
+
- Context: 8.192 → **64.000 Tokens** (+7,8×)
|
| 401 |
+
- VRAM-Mehrverbrauch: nur **+0,5 GB**
|
| 402 |
+
- Geschwindigkeitsverlust: nur **−4,6%**
|
| 403 |
|
| 404 |
+
### Warum das relevant ist
|
| 405 |
|
| 406 |
+
Größerer Context bedeutet: längere Dokumente verarbeiten, besseres RAG, mehr Gesprächshistorie — alles auf einer einzigen Consumer-GPU. Keine Cloud-Kosten, keine Datenschutz-Probleme.
|
|
|
|
| 407 |
|
| 408 |
+
### Schnellstart
|
| 409 |
|
| 410 |
```bash
|
| 411 |
# Image bauen (~20 Minuten)
|
| 412 |
docker build -t turboquant:feature .
|
| 413 |
|
| 414 |
+
# TurboQuant-Unterstützung prüfen (muss turbo2, turbo3, turbo4 zeigen):
|
| 415 |
+
docker run --rm turboquant:feature llama-server -h 2>&1 | grep turbo
|
| 416 |
+
|
| 417 |
+
# Modell herunterladen (~14 GB)
|
| 418 |
export HF_TOKEN=dein_token
|
| 419 |
bash scripts/download-model.sh
|
| 420 |
|
| 421 |
+
# Baseline starten (f16, 8K Context, Port 8180)
|
| 422 |
bash scripts/run-baseline.sh
|
| 423 |
|
| 424 |
+
# TurboQuant starten (turbo3, 100K Context, Port 8182)
|
| 425 |
+
# Hinweis: Startup dauert ~90s — 100K Context-Allokation braucht länger als 8K
|
| 426 |
bash scripts/run-turbo.sh
|
| 427 |
```
|
| 428 |
|
| 429 |
+
### Die 5 Fehler — Kurzfassung
|
| 430 |
+
|
| 431 |
+
1. **Falsches Repo** — `turboquant_plus` ist eine Python-Bibliothek, nicht der llama.cpp Fork → [E1](#errors)
|
| 432 |
+
2. **Falsches cmake-Flag** — `-DLLAMA_CUBLAS=ON` wird still ignoriert, korrekt: `-DGGML_CUDA=ON` → [E2](#errors)
|
| 433 |
+
3. **`libcuda.so.1` fehlt** — Symlink vor cmake notwendig → [E3](#errors)
|
| 434 |
+
4. **Falscher Branch** — `master` hat kein TurboQuant, korrekt: `feature/turboquant-kv-cache` → [E4](#errors)
|
| 435 |
+
5. **Falscher HF-Repo-Name** — Immer per API prüfen, nie aus dem Gedächtnis → [E5](#errors)
|
| 436 |
+
|
| 437 |
Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
|
| 438 |
|
| 439 |
---
|
| 440 |
|
| 441 |
+
*[AI Engineering Lab](https://ai-engineering.at) · April 2026*
|