AI Engineering Lab commited on
Commit
740999f
·
1 Parent(s): a99aedc

docs: overhaul README — architecture diagram, fixed anchors, VRAM visualization, bilingual

Browse files
Files changed (1) hide show
  1. README.md +251 -178
README.md CHANGED
@@ -24,223 +24,277 @@ tags:
24
  <div align="center">
25
 
26
  [![GitHub](https://img.shields.io/badge/GitHub-AI--Engineerings--at%2Fllama--cpp--turboquant--guide-181717?logo=github)](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)
 
27
  ![TurboQuant](https://img.shields.io/badge/TurboQuant-turbo3%20KV--Cache-blueviolet)
28
- ![Context](https://img.shields.io/badge/Context-100%2C000%20tokens-brightgreen)
29
- ![GPU](https://img.shields.io/badge/GPU-RTX%203090%2024GB-76b900?logo=nvidia)
30
- ![VRAM Overhead](https://img.shields.io/badge/VRAM%20overhead-%2B1.8%20GB%20only-blue)
31
- ![Speed Loss](https://img.shields.io/badge/Speed%20loss--8.5%25%20only-orange)
32
- ![License](https://img.shields.io/badge/license-CC%20BY%204.0-green)
33
 
34
- **Practical guide: TurboQuant KV-cache quantization on consumer hardware.**
35
- **100,000 token context on a single RTX 3090 verified, reproducible, step-by-step.**
36
 
37
- *Based on [TurboQuant (ICLR 2026, arXiv:2504.19874)](https://arxiv.org/abs/2504.19874)*
38
 
39
- **📦 [GitHub: AI-Engineerings-at/llama-cpp-turboquant-guide](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)**
40
- — Dockerfile · Scripts · Raw benchmark JSON · German white paper
41
 
42
- [Results](#-results) · [Quick Start](#-quick-start) · [How It Works](#-how-it-works) · [Errors & Fixes](#-errors--fixes) · [Deutsch](#-deutsch)
43
 
44
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ---
47
 
48
- ## 📊 Results
 
49
 
50
- Tested on two consumer GPUs. Results verified across multiple independent runs (April 1, 2026).
51
 
52
  ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
53
 
54
- *4 independent benchmark runs, 15 total measurements.*
55
 
56
  | | Baseline (f16) | TurboQuant turbo3 | Delta |
57
  |--|:--------------:|:-----------------:|:-----:|
58
- | **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
59
- | **VRAM** | 15.3 GB | 17.1 GB | +1.8 GB only |
60
  | **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
61
- | **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- > **12× more context. +12% VRAM. −7.5% speed. Same model weights.**
64
 
65
- Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) (all 4 runs)
66
 
67
  ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
68
 
69
- *2 independent benchmark sessions verified.*
70
 
71
  | | Baseline (f16) | TurboQuant turbo3 | Delta |
72
  |--|:--------------:|:-----------------:|:-----:|
73
- | **Context** | 8,192 tokens | **64,000 tokens** | **+7.8×** |
74
- | **VRAM** | 5.7 GB | 6.2 GB | +0.54 GB only |
75
  | **Tokens/s (avg)** | 49.8 | 47.5 | **−4.6%** |
76
 
77
- > **7.8× more context. +0.5 GB VRAM. −5% speed. Consistent across both runs.**
78
 
79
  Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant-4070-results-2026-04-01.json) · [`results/turboquant-4070-laptop-2026-04-01.json`](results/turboquant-4070-laptop-2026-04-01.json)
80
 
 
 
81
  ### Cross-GPU Summary
82
 
83
- | GPU | VRAM | Model | Max Context (turbo3) | Speed Loss | Runs |
84
- |-----|------|-------|---------------------|-----------|------|
85
- | RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −7.5% | 4 |
86
- | RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −4.6% | 2 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
- TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.
 
 
 
 
 
 
 
89
 
90
  ---
91
 
92
- ## 🚀 Quick Start
 
 
 
93
 
94
- ### 1. Build the Docker Image (~20 minutes)
95
 
96
  ```bash
97
  docker build -t turboquant:feature .
98
 
99
- # Verify TurboQuant is compiled in:
100
- docker run --rm turboquant:feature llama-server -h 2>&1 | grep -A3 "cache-type-k"
101
- # Must show: turbo2, turbo3, turbo4
102
  ```
103
 
104
- ### 2. Download a Model
105
 
106
  ```bash
107
- # Set your HuggingFace token
108
  export HF_TOKEN=hf_your_token_here
109
-
110
  bash scripts/download-model.sh
111
  ```
112
 
113
- ### 3. Run Baseline (f16, 8K context)
114
 
115
  ```bash
116
  bash scripts/run-baseline.sh
117
- # Server starts on port 8180
118
  ```
119
 
120
  ### 4. Run TurboQuant (turbo3, 100K context)
121
 
122
  ```bash
123
  bash scripts/run-turbo.sh
124
- # Server starts on port 8182
125
  ```
126
 
127
- ### 5. Test It
128
 
129
  ```bash
130
- # Check available context
131
- curl -s http://localhost:8180/v1/models | jq '.data[0].context_length'
132
- # Baseline: 8192
 
133
 
134
- curl -s http://localhost:8182/v1/models | jq '.data[0].context_length'
135
- # TurboQuant: 131072 (model max, allocated to 100000)
 
 
136
 
137
- # Generate tokens (measures TPS in response)
138
- curl http://localhost:8182/v1/chat/completions \
139
  -H "Content-Type: application/json" \
140
- -d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200"}],"max_tokens":500}'
 
141
  ```
142
 
143
  ---
144
 
145
- ## ⚙️ How It Works
146
-
147
- ### The KV-Cache Problem
148
-
149
- When an LLM runs, it caches Key-Value pairs for every token in the context window.
150
- This cache grows **linearly** with context length:
151
-
152
- ```
153
- Mistral-Small-3.2 24B on RTX 3090 (24 GB total, ~14.4 GB for model weights):
154
-
155
- Context KV-Cache (f16) Available after model Fits?
156
- 8,192 ~1 GB 9.6 GB ✅
157
- 32,000 ~4 GB 9.6 GB ✅
158
- 100,000 ~12 GB 9.6 GB ❌ OOM without TurboQuant
159
- 100,000 ~2.8 GB (turbo3) 9.6 GB ✅
160
- ```
161
-
162
- ### What TurboQuant Does
163
-
164
- TurboQuant compresses the KV-cache from 16-bit floats to 2–4-bit integers.
165
- **It does NOT compress the model weights** — only the runtime cache.
166
-
167
- ```
168
- f16 KV-Cache → turbo3 KV-Cache
169
- 16 bits → 3 bits = 4.3× compression
170
- ```
171
-
172
- The model reads the quantized cache and generates text normally.
173
- Quality loss: <1% perplexity increase at turbo3 (per paper).
174
-
175
- ### Two Repos — Critical Distinction
176
 
177
- There are two TurboQuant repositories with confusing names:
178
 
179
- | Repo | What it is | When to use |
180
- |------|-----------|-------------|
181
- | `TheTom/turboquant_plus` | Python library for research | HuggingFace models, Python API |
182
- | `TheTom/llama-cpp-turboquant` | llama.cpp fork | **This guide — llama-server** |
183
 
184
- **This guide uses `TheTom/llama-cpp-turboquant`, branch `feature/turboquant-kv-cache`.**
 
 
185
 
186
  ---
187
 
188
- ## 🐛 Errors & Fixes
189
-
190
- Every error we hit during setup, documented so you don't repeat them:
191
-
192
- ### E1: Wrong Repository
193
 
194
- **Symptom:** No `turbo2`/`turbo3`/`turbo4` options after building.
195
- **Cause:** Built from `TheTom/turboquant_plus` (Python library) instead of `TheTom/llama-cpp-turboquant`.
196
- **Fix:** Use the correct repo. See Dockerfile.
197
 
198
- ### E2: Wrong cmake Flag
199
-
200
- **Symptom:** CUDA not used during inference, slow CPU fallback.
201
- **Cause:** Old flag `-DLLAMA_CUBLAS=ON` was renamed in llama.cpp post-GGML-refactor.
202
- **Fix:**
203
  ```dockerfile
204
- # WRONG (old, silently ignored):
205
- cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON
206
 
207
- # CORRECT:
208
  cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
209
  ```
210
 
211
- ### E3: libcuda.so.1 Not Found at Build Time
 
 
 
 
 
212
 
213
- **Symptom:** Build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
214
- **Cause:** CUDA devel images have a stub `libcuda.so` but not `libcuda.so.1` (the runtime driver is injected at container start, not build time).
215
- **Fix:** Add symlink before cmake:
216
  ```dockerfile
 
217
  RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
218
  /usr/local/cuda/lib64/stubs/libcuda.so.1 \
219
  && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
220
  && ldconfig
221
  ```
222
 
223
- ### E4: Wrong Branch
 
 
 
 
 
224
 
225
- **Symptom:** `Unsupported cache type: turbo3` at runtime despite clean build.
226
- **Cause:** Cloning the default `master` branch of `llama-cpp-turboquant` — which is a standard llama.cpp fork **without** TurboQuant. The implementation is on `feature/turboquant-kv-cache`.
227
- **Fix:**
228
  ```bash
 
 
 
 
229
  git clone https://github.com/TheTom/llama-cpp-turboquant.git \
230
  --branch feature/turboquant-kv-cache --depth=1
231
  ```
232
- Always verify before building:
 
233
  ```bash
234
  curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
235
  | python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
236
  ```
237
 
238
- ### E5: Wrong HuggingFace Repo Name
 
 
 
 
 
239
 
240
- **Symptom:** 404 or 401 when downloading model.
241
- **Cause:** Model repo names change. Don't rely on memory or cached context.
242
- **Fix:** Always query HF Search API before downloading:
243
  ```bash
 
244
  curl -s -H "Authorization: Bearer $HF_TOKEN" \
245
  "https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
246
  | python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
@@ -248,121 +302,140 @@ curl -s -H "Authorization: Bearer $HF_TOKEN" \
248
 
249
  ---
250
 
251
- ## 🔬 Reproduce Our Results
 
252
 
253
  ```bash
254
- # 1. Build
255
  docker build -t turboquant:feature .
 
 
256
 
257
- # 2. Download model (~14 GB)
258
- export HF_TOKEN=hf_your_token
259
- bash scripts/download-model.sh
260
-
261
- # 3. Baseline measurement
262
- bash scripts/run-baseline.sh &
263
- sleep 45 # wait for server startup
264
- curl -s http://localhost:8180/v1/chat/completions \
265
- -H "Content-Type: application/json" \
266
- -d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200, one per line."}],"max_tokens":500}' \
267
- | python3 -c "import sys,json; d=json.load(sys.stdin); u=d['usage']; print(f'TPS: {u[\"completion_tokens\"] / (d[\"usage\"].get(\"total_time_ms\",10000)/1000):.1f}')"
268
  nvidia-smi --query-gpu=memory.used --format=csv,noheader
 
269
 
270
- # 4. Turbo3 measurement
 
 
 
 
 
 
 
 
 
 
 
 
 
 
271
  docker stop turboquant-baseline
272
- bash scripts/run-turbo.sh &
273
- sleep 90 # 100K context allocation takes longer
274
- # repeat curl + nvidia-smi on port 8182
 
275
  ```
276
 
277
- Expected results matching our run: see [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json)
 
 
278
 
279
- ---
280
 
281
- ## Hardware Requirements
282
 
283
- | | Minimum | Our Setup |
284
- |--|---------|----------|
285
- | GPU VRAM | 16 GB | RTX 3090 24 GB |
286
- | System RAM | 16 GB | 32 GB |
287
- | Disk | 30 GB | SSD |
288
- | CUDA | 12.x | 12.6.3 |
289
- | OS | Linux / Windows + Docker | Windows + Docker Desktop |
290
 
291
- > **Note on Windows:** Docker Desktop works fine. Avoid `/tmp/` paths — use named Docker volumes for model storage.
292
 
293
- ---
 
 
 
 
 
294
 
295
- ## Model Compatibility
296
 
297
- Tested with **Mistral-Small-3.2-24B Q4_K_M** (14 GB).
298
- Should work with any GGUF model that fits the VRAM budget after KV-cache allocation.
299
 
300
- | Model | Size | VRAM (model) | Max ctx (turbo3) |
301
- |-------|------|-------------|-----------------|
302
- | Mistral-Small-3.2 24B Q4_K_M | 14 GB | 14.4 GB | ~100K on 24 GB GPU |
303
- | Llama-3.1 8B Q4_K_M | 4.7 GB | 5.1 GB | ~200K on 16 GB GPU |
304
- | Qwen2.5 14B Q4_K_M | 8.5 GB | 8.8 GB | ~150K on 16 GB GPU |
 
 
305
 
306
- *Estimates. Actual values depend on architecture and batch size.*
307
 
308
  ---
309
 
310
- ## 📄 License
311
 
312
  Content and scripts: [CC BY 4.0](LICENSE)
313
- Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al. (ICLR 2026)
314
  llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
315
 
316
  ---
317
 
318
- ---
319
-
320
  ## 🇩🇪 Deutsch
321
 
322
  ### TurboQuant auf Consumer-Hardware — Praktischer Guide
323
 
324
- Dieses Repository dokumentiert unsere Erfahrungen beim Einsatz von TurboQuant (ICLR 2026)
325
- auf einer RTX 3090 im Homelab-Betrieb. Wir sind das erste europäische Team,
326
- das diese Methode praktisch auf Consumer-Hardware veröffentlicht dokumentiert hat.
327
-
328
- ### Das Ergebnis
329
 
330
- Mit TurboQuant turbo3 (3-bit KV-Cache) haben wir auf einer RTX 3090 (24 GB):
331
 
332
- - **12× mehr Context** (8.192 → 100.000 Tokens)
333
- - nur **+1.8 GB VRAM** Mehrverbrauch
334
- - nur **−8.5% Geschwindigkeitsverlust**
335
- - **gleiche Modellgewichte** — nur der Laufzeit-Cache wird komprimiert
336
 
337
- ### Warum das wichtig ist
 
 
 
338
 
339
- Größerer Context bedeutet: Längere Dokumente, mehr Gesprächshistorie, besseres RAG,
340
- Code-Analyse ganzer Codebasen alles auf einer einzigen Consumer-GPU.
 
 
341
 
342
- ### Fehler-Protokoll (5 Fehler die wir gemacht haben)
343
 
344
- Alle 5 Fehler aus unserem Setup sind unter [Errors & Fixes](#-errors--fixes) dokumentiert.
345
- Der häufigste: falscher Branch (`master` statt `feature/turboquant-kv-cache`).
346
 
347
- ### Schnellstart (Deutsch)
348
 
349
  ```bash
350
  # Image bauen (~20 Minuten)
351
  docker build -t turboquant:feature .
352
 
353
- # Modell herunterladen (14 GB)
 
 
 
354
  export HF_TOKEN=dein_token
355
  bash scripts/download-model.sh
356
 
357
- # Baseline starten (f16, 8K Context)
358
  bash scripts/run-baseline.sh
359
 
360
- # TurboQuant starten (turbo3, 100K Context)
 
361
  bash scripts/run-turbo.sh
362
  ```
363
 
 
 
 
 
 
 
 
 
364
  Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
365
 
366
  ---
367
 
368
- *AI Engineering Lab · April 2026 · [ai-engineering.at](https://ai-engineering.at)*
 
24
  <div align="center">
25
 
26
  [![GitHub](https://img.shields.io/badge/GitHub-AI--Engineerings--at%2Fllama--cpp--turboquant--guide-181717?logo=github)](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)
27
+ [![Paper](https://img.shields.io/badge/Paper-arXiv%3A2504.19874-b31b1b?logo=arxiv)](https://arxiv.org/abs/2504.19874)
28
  ![TurboQuant](https://img.shields.io/badge/TurboQuant-turbo3%20KV--Cache-blueviolet)
29
+ ![RTX 3090](https://img.shields.io/badge/RTX%203090-100K%20ctx-76b900?logo=nvidia)
30
+ ![RTX 4070 Laptop](https://img.shields.io/badge/RTX%204070%20Laptop-64K%20ctx-76b900?logo=nvidia)
31
+ ![VRAM Overhead](https://img.shields.io/badge/VRAM%20delta-%2B1.8%20GB%20only-blue)
32
+ ![Speed Loss](https://img.shields.io/badge/Speed%20loss-max%208.5%25-orange)
33
+ ![License](https://img.shields.io/badge/license-CC%20BY%204.0-lightgrey)
34
 
35
+ **TurboQuant (ICLR 2026) quantizes the KV-cache at runtime not the model weights.**
36
+ **Result: 100,000 token context on an RTX 3090. +1.8 GB VRAM. −8% speed.**
37
 
38
+ *Verified across multiple independent runs. Step-by-step guide with Dockerfile, scripts, and raw benchmark data.*
39
 
40
+ </div>
 
41
 
42
+ ---
43
 
44
+ [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) was presented at ICLR 2026. It compresses the KV-cache from 16-bit to 2–4-bit integers during inference. Model weights stay at full precision. This closes the gap between what your GPU can load and what context length it can actually serve.
45
+
46
+ This repo documents our setup on two consumer GPUs — what we ran into, what we fixed, and what we measured.
47
+
48
+ **What's in this repo:**
49
+ | File | Description |
50
+ |------|-------------|
51
+ | `Dockerfile` | Builds llama.cpp with TurboQuant (correct repo, branch, cmake flags) |
52
+ | `scripts/run-baseline.sh` | Starts llama-server with f16 cache, 8K context |
53
+ | `scripts/run-turbo.sh` | Starts llama-server with turbo3 cache, 100K context |
54
+ | `scripts/download-model.sh` | Downloads model from HuggingFace via API |
55
+ | `results/` | Raw benchmark JSON — all runs, both GPUs |
56
+ | `WHITEPAPER.de.md` | German white paper |
57
+
58
+ **→ [Results](#results) · [How It Works](#how-it-works) · [Quick Start](#quick-start) · [Errors & Fixes](#errors) · [Deutsch](#deutsch)**
59
 
60
  ---
61
 
62
+ <a id="results"></a>
63
+ ## Results
64
 
65
+ Verified on two consumer GPUs. April 2026.
66
 
67
  ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
68
 
69
+ *4 independent benchmark runs, 15 total measurements. All runs consistent (±0.3% TPS variance).*
70
 
71
  | | Baseline (f16) | TurboQuant turbo3 | Delta |
72
  |--|:--------------:|:-----------------:|:-----:|
73
+ | **Context** | 8,192 | **100,000** | **+12.2×** |
74
+ | **VRAM used** | 15.3 GB | 17.1 GB | +1.8 GB |
75
  | **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
76
+ | **KV-Cache** | ~1 GB (f16) | ~2.8 GB (3-bit) | 4.3× smaller |
77
+
78
+ > 12× more context. +12% VRAM. −7.5% speed. Same model weights.
79
+
80
+ ```
81
+ RTX 3090 — 24 GB VRAM
82
+ ────────────────────────────────────────────────
83
+ Baseline (f16, 8K ctx): █████████████░░░░░░ 15.3 GB
84
+ TurboQuant (turbo3, 100K ctx): ██████████████░░░░░ 17.1 GB
85
+ ↑ weights 14.4 GB fixed
86
+ ↑ KV-cache
87
+ ```
88
 
89
+ Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) 4 runs, 15 measurements
90
 
91
+ ---
92
 
93
  ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
94
 
95
+ *2 independent benchmark sessions. VRAM delta stable at ±2 MB between sessions.*
96
 
97
  | | Baseline (f16) | TurboQuant turbo3 | Delta |
98
  |--|:--------------:|:-----------------:|:-----:|
99
+ | **Context** | 8,192 | **64,000** | **+7.8×** |
100
+ | **VRAM used** | 5.7 GB | 6.2 GB | +0.54 GB |
101
  | **Tokens/s (avg)** | 49.8 | 47.5 | **−4.6%** |
102
 
103
+ > 7.8× more context. +0.5 GB VRAM. −5% speed.
104
 
105
  Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant-4070-results-2026-04-01.json) · [`results/turboquant-4070-laptop-2026-04-01.json`](results/turboquant-4070-laptop-2026-04-01.json)
106
 
107
+ ---
108
+
109
  ### Cross-GPU Summary
110
 
111
+ | GPU | VRAM | Model | ctx (turbo3) | VRAM delta | Speed loss | Runs |
112
+ |-----|------|-------|:------------:|:----------:|:----------:|:----:|
113
+ | RTX 3090 | 24 GB | Mistral-Small-3.2 24B | **100,000** | +1.8 GB | −7.5% | 4 |
114
+ | RTX 4070 Laptop | 8 GB | Llama-3.1 8B | **64,000** | +0.5 GB | −4.6% | 2 |
115
+
116
+ The principle scales with the GPU. More VRAM → larger model → larger absolute VRAM savings from compression → more context headroom.
117
+
118
+ ---
119
+
120
+ <a id="how-it-works"></a>
121
+ ## How It Works
122
+
123
+ ### The KV-cache problem
124
+
125
+ Every token you feed into an LLM creates Key-Value vectors that must stay in VRAM for the duration of the request. With f16 (default), this cache grows linearly:
126
+
127
+ ```
128
+ Mistral-Small-3.2 24B on RTX 3090 (24 GB):
129
+ Model weights occupy 14.4 GB. Remaining: ~9.6 GB for KV-cache.
130
+
131
+ Context KV-cache (f16) Remaining Status
132
+ 8,192 ~1 GB ~8.6 GB ✅ fine
133
+ 32,000 ~4 GB ~5.6 GB ✅ fine
134
+ 100,000 ~12 GB −2.4 GB ❌ OOM
135
+ 100,000 ~2.8 GB (turbo3) ~6.8 GB ✅ fine
136
+ ```
137
+
138
+ ### What TurboQuant does
139
+
140
+ TurboQuant re-encodes the KV-cache from 16 bits to 2–4 bits on-the-fly. The model reads the quantized cache and generates output normally. The model weights are never touched.
141
+
142
+ ```
143
+ f16 KV-cache → turbo3 KV-cache
144
+ 16 bits → 3 bits = 4.3× compression
145
+ ```
146
+
147
+ Quality loss at turbo3: <1% perplexity increase (per paper). In practice: not noticeable for most tasks.
148
+
149
+ ```mermaid
150
+ flowchart LR
151
+ tokens["100K input tokens"] --> model["Model layers\nweights: 14.4 GB\nunchanged"]
152
+ model --> kv_type{KV-Cache format?}
153
+ kv_type -->|"f16 default"| oom["~12 GB cache\n❌ OOM on 24 GB GPU"]
154
+ kv_type -->|"turbo3 (3-bit)"| fits["~2.8 GB cache\n✅ Fits: 7 GB still free"]
155
+ fits --> output["Output tokens\n−8% TPS vs baseline"]
156
+ ```
157
 
158
+ ### Critical: two repos with confusing names
159
+
160
+ | Repo | What it is | Used here? |
161
+ |------|-----------|:----------:|
162
+ | `TheTom/turboquant_plus` | Python research library — HuggingFace models, Python API | ❌ |
163
+ | `TheTom/llama-cpp-turboquant` | llama.cpp fork with `--cache-type-k turbo3` | ✅ |
164
+
165
+ Branch: `feature/turboquant-kv-cache` — **not `master`** (which is a standard llama.cpp fork, no TurboQuant).
166
 
167
  ---
168
 
169
+ <a id="quick-start"></a>
170
+ ## Quick Start
171
+
172
+ **Requirements:** Docker with NVIDIA runtime, CUDA 12.x, HuggingFace account (free).
173
 
174
+ ### 1. Build the Docker image (~20 min)
175
 
176
  ```bash
177
  docker build -t turboquant:feature .
178
 
179
+ # Verify TurboQuant is compiled in — must show turbo2, turbo3, turbo4:
180
+ docker run --rm turboquant:feature llama-server -h 2>&1 | grep turbo
 
181
  ```
182
 
183
+ ### 2. Download model (~14 GB)
184
 
185
  ```bash
 
186
  export HF_TOKEN=hf_your_token_here
 
187
  bash scripts/download-model.sh
188
  ```
189
 
190
+ ### 3. Run baseline (f16, 8K context)
191
 
192
  ```bash
193
  bash scripts/run-baseline.sh
194
+ # Server on port 8180. Starts in ~45s.
195
  ```
196
 
197
  ### 4. Run TurboQuant (turbo3, 100K context)
198
 
199
  ```bash
200
  bash scripts/run-turbo.sh
201
+ # Server on port 8182. Starts in ~90s — 100K context allocation takes longer.
202
  ```
203
 
204
+ ### 5. Test
205
 
206
  ```bash
207
+ # Check context length
208
+ curl -s http://localhost:8182/v1/models \
209
+ | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['context_length'])"
210
+ # → 100000
211
 
212
+ # Warmup (first request is cold — don't measure this one)
213
+ curl -sf http://localhost:8182/v1/chat/completions \
214
+ -H "Content-Type: application/json" \
215
+ -d '{"model":"local","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}' > /dev/null
216
 
217
+ # Measure TPS
218
+ curl -s http://localhost:8182/v1/chat/completions \
219
  -H "Content-Type: application/json" \
220
+ -d '{"model":"local","messages":[{"role":"user","content":"Explain transformer attention in detail. At least 400 words."}],"max_tokens":500}' \
221
+ | python3 -c "import sys,json; t=json.load(sys.stdin)['timings']; print(f'{t[\"predicted_per_second\"]:.1f} TPS ({t[\"predicted_n\"]} tokens)')"
222
  ```
223
 
224
  ---
225
 
226
+ <a id="errors"></a>
227
+ ## Errors & Fixes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
+ 5 errors we ran into during setup. All documented so you can skip them.
230
 
231
+ ### E1: Wrong repository
 
 
 
232
 
233
+ **Symptom:** Build succeeds. `llama-server -h | grep turbo` returns nothing.
234
+ **Cause:** Built from `TheTom/turboquant_plus` — that's a Python library for HuggingFace-style inference. The llama.cpp fork is `TheTom/llama-cpp-turboquant`.
235
+ **Fix:** Use the Dockerfile in this repo. It clones the correct repo.
236
 
237
  ---
238
 
239
+ ### E2: Wrong cmake flag
 
 
 
 
240
 
241
+ **Symptom:** CUDA is not used. Inference runs on CPU — extremely slow.
242
+ **Cause:** `-DLLAMA_CUBLAS=ON` was renamed to `-DGGML_CUDA=ON` in llama.cpp after the GGML refactor. The old flag compiles without error but is silently ignored.
 
243
 
 
 
 
 
 
244
  ```dockerfile
245
+ # Wrong silently ignored since llama.cpp GGML refactor:
246
+ cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON .
247
 
248
+ # Correct:
249
  cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
250
  ```
251
 
252
+ ---
253
+
254
+ ### E3: `libcuda.so.1` not found at build time
255
+
256
+ **Symptom:** Docker build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
257
+ **Cause:** CUDA development images ship a stub `libcuda.so` — the actual driver (`libcuda.so.1`) is injected at container runtime, not available during `docker build`.
258
 
 
 
 
259
  ```dockerfile
260
+ # Add before cmake in your Dockerfile:
261
  RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
262
  /usr/local/cuda/lib64/stubs/libcuda.so.1 \
263
  && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
264
  && ldconfig
265
  ```
266
 
267
+ ---
268
+
269
+ ### E4: Wrong branch
270
+
271
+ **Symptom:** `Unsupported cache type: turbo3` at runtime. Build was clean.
272
+ **Cause:** The default `master` branch of `llama-cpp-turboquant` is a plain llama.cpp fork. TurboQuant lives on `feature/turboquant-kv-cache`.
273
 
 
 
 
274
  ```bash
275
+ # Wrong — master branch, no TurboQuant:
276
+ git clone https://github.com/TheTom/llama-cpp-turboquant.git
277
+
278
+ # Correct:
279
  git clone https://github.com/TheTom/llama-cpp-turboquant.git \
280
  --branch feature/turboquant-kv-cache --depth=1
281
  ```
282
+
283
+ Check branches before cloning:
284
  ```bash
285
  curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
286
  | python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
287
  ```
288
 
289
+ ---
290
+
291
+ ### E5: Wrong HuggingFace repo name
292
+
293
+ **Symptom:** 404 or 401 on model download.
294
+ **Cause:** Model repo names on HuggingFace change. Don't rely on memory.
295
 
 
 
 
296
  ```bash
297
+ # Always verify first:
298
  curl -s -H "Authorization: Bearer $HF_TOKEN" \
299
  "https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
300
  | python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
 
302
 
303
  ---
304
 
305
+ <a id="reproduce"></a>
306
+ ## Reproduce Our Results
307
 
308
  ```bash
309
+ # 1. Build and start baseline server
310
  docker build -t turboquant:feature .
311
+ bash scripts/run-baseline.sh
312
+ sleep 45
313
 
314
+ # 2. VRAM after server start
 
 
 
 
 
 
 
 
 
 
315
  nvidia-smi --query-gpu=memory.used --format=csv,noheader
316
+ # Expected: ~15300 MB
317
 
318
+ # 3. Warmup (mandatory — first request is cold and gives wrong TPS)
319
+ curl -sf http://localhost:8180/v1/chat/completions \
320
+ -H "Content-Type: application/json" \
321
+ -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":5}' > /dev/null
322
+
323
+ # 4. 3× TPS measurement
324
+ PROMPT='{"model":"local","messages":[{"role":"user","content":"Explain in detail how transformer attention mechanisms work. Cover self-attention, multi-head attention, key-query-value matrices, and positional encoding. Write at least 400 words."}],"max_tokens":500}'
325
+ for i in 1 2 3; do
326
+ curl -sf http://localhost:8180/v1/chat/completions \
327
+ -H "Content-Type: application/json" \
328
+ -d "$PROMPT" \
329
+ | python3 -c "import sys,json; t=json.load(sys.stdin)['timings']; print(f'Run $i: {t[\"predicted_per_second\"]:.2f} TPS ({t[\"predicted_n\"]} tokens)')"
330
+ done
331
+
332
+ # 5. Stop, start TurboQuant (100K context needs ~90s to allocate)
333
  docker stop turboquant-baseline
334
+ bash scripts/run-turbo.sh
335
+ sleep 90
336
+
337
+ # 6. Repeat steps 2-4 on port 8182
338
  ```
339
 
340
+ **Expected on RTX 3090 + Mistral-Small-3.2 24B:**
341
+ - Baseline: 50–52 TPS, VRAM ~15.3 GB
342
+ - turbo3 at 100K: 46–48 TPS, VRAM ~17.1 GB
343
 
344
+ Full benchmark data for comparison: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json)
345
 
346
+ ---
347
 
348
+ ## Hardware & Model Compatibility
 
 
 
 
 
 
349
 
350
+ ### GPU VRAM requirements
351
 
352
+ | VRAM | Recommended model | turbo3 context | Notes |
353
+ |------|------------------|:--------------:|-------|
354
+ | 6 GB | Llama-3.2 3B Q4_K_M (~2 GB) | ~200K | Very fast, limited capability |
355
+ | 8 GB | Llama-3.1 8B Q4_K_M (4.7 GB) | ~64K | **Verified** — our RTX 4070 setup |
356
+ | 12 GB | Qwen2.5 14B Q4_K_M (~8.5 GB) | ~80K | Estimated |
357
+ | 24 GB | Mistral-Small-3.2 24B Q4_K_M (14.4 GB) | ~100K | **Verified** — our RTX 3090 setup |
358
 
359
+ *Estimates for non-verified rows depend on model architecture and batch size.*
360
 
361
+ ### System requirements
 
362
 
363
+ | Component | Minimum | Our setups |
364
+ |-----------|---------|-----------|
365
+ | GPU | CUDA-capable, VRAM per table above | RTX 3090 / RTX 4070 Laptop |
366
+ | System RAM | 16 GB | 32 GB / 16 GB |
367
+ | Disk | 20 GB free | SSD |
368
+ | CUDA | 12.x | 12.6.3 |
369
+ | OS | Linux, or Windows with Docker Desktop | Windows + Docker Desktop |
370
 
371
+ **Windows:** Docker Desktop works. Use named Docker volumes for models avoid `/tmp/` paths.
372
 
373
  ---
374
 
375
+ ## License
376
 
377
  Content and scripts: [CC BY 4.0](LICENSE)
378
+ Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al., ICLR 2026
379
  llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
380
 
381
  ---
382
 
383
+ <a id="deutsch"></a>
 
384
  ## 🇩🇪 Deutsch
385
 
386
  ### TurboQuant auf Consumer-Hardware — Praktischer Guide
387
 
388
+ Dieses Repository dokumentiert unsere Ergebnisse mit TurboQuant (ICLR 2026) auf zwei Consumer-GPUs — inklusive aller 5 Fehler die wir gemacht haben und wie wir sie gelöst haben.
 
 
 
 
389
 
390
+ **TurboQuant komprimiert den KV-Cache von 16-Bit auf 2–4-Bit während der Inferenz. Die Modellgewichte bleiben unverändert.**
391
 
392
+ ### Ergebnisse
 
 
 
393
 
394
+ **RTX 3090 (24 GB), Mistral-Small-3.2 24B Q4_K_M** — 4 unabhängige Runs:
395
+ - Context: 8.192 → **100.000 Tokens** (+12,2×)
396
+ - VRAM-Mehrverbrauch: nur **+1,8 GB** (statt ~12 GB die f16 bei 100K bräuchte)
397
+ - Geschwindigkeitsverlust: nur **−7,5%** (51,0 → 47,2 Tokens/s)
398
 
399
+ **RTX 4070 Laptop (8 GB), Llama-3.1 8B Q4_K_M** — 2 unabhängige Sessions:
400
+ - Context: 8.192 **64.000 Tokens** (+7,8×)
401
+ - VRAM-Mehrverbrauch: nur **+0,5 GB**
402
+ - Geschwindigkeitsverlust: nur **−4,6%**
403
 
404
+ ### Warum das relevant ist
405
 
406
+ Größerer Context bedeutet: längere Dokumente verarbeiten, besseres RAG, mehr Gesprächshistorie — alles auf einer einzigen Consumer-GPU. Keine Cloud-Kosten, keine Datenschutz-Probleme.
 
407
 
408
+ ### Schnellstart
409
 
410
  ```bash
411
  # Image bauen (~20 Minuten)
412
  docker build -t turboquant:feature .
413
 
414
+ # TurboQuant-Unterstützung prüfen (muss turbo2, turbo3, turbo4 zeigen):
415
+ docker run --rm turboquant:feature llama-server -h 2>&1 | grep turbo
416
+
417
+ # Modell herunterladen (~14 GB)
418
  export HF_TOKEN=dein_token
419
  bash scripts/download-model.sh
420
 
421
+ # Baseline starten (f16, 8K Context, Port 8180)
422
  bash scripts/run-baseline.sh
423
 
424
+ # TurboQuant starten (turbo3, 100K Context, Port 8182)
425
+ # Hinweis: Startup dauert ~90s — 100K Context-Allokation braucht länger als 8K
426
  bash scripts/run-turbo.sh
427
  ```
428
 
429
+ ### Die 5 Fehler — Kurzfassung
430
+
431
+ 1. **Falsches Repo** — `turboquant_plus` ist eine Python-Bibliothek, nicht der llama.cpp Fork → [E1](#errors)
432
+ 2. **Falsches cmake-Flag** — `-DLLAMA_CUBLAS=ON` wird still ignoriert, korrekt: `-DGGML_CUDA=ON` → [E2](#errors)
433
+ 3. **`libcuda.so.1` fehlt** — Symlink vor cmake notwendig → [E3](#errors)
434
+ 4. **Falscher Branch** — `master` hat kein TurboQuant, korrekt: `feature/turboquant-kv-cache` → [E4](#errors)
435
+ 5. **Falscher HF-Repo-Name** — Immer per API prüfen, nie aus dem Gedächtnis → [E5](#errors)
436
+
437
  Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
438
 
439
  ---
440
 
441
+ *[AI Engineering Lab](https://ai-engineering.at) · April 2026*