Spaces:

ai-engineering-at
/

llama-cpp-turboquant-guide

Running

App Files Files Community

AI Engineering Lab commited on Apr 2

Commit

740999f

1 Parent(s): a99aedc

docs: overhaul README — architecture diagram, fixed anchors, VRAM visualization, bilingual

Browse files

Files changed (1) hide show

README.md +251 -178

README.md CHANGED Viewed

@@ -24,223 +24,277 @@ tags:
 <div align="center">
 [![GitHub](https://img.shields.io/badge/GitHub-AI--Engineerings--at%2Fllama--cpp--turboquant--guide-181717?logo=github)](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)
 ![TurboQuant](https://img.shields.io/badge/TurboQuant-turbo3%20KV--Cache-blueviolet)
-![Context](https://img.shields.io/badge/Context-100%2C000%20tokens-brightgreen)
-![GPU](https://img.shields.io/badge/GPU-RTX%203090%2024GB-76b900?logo=nvidia)
-![VRAM Overhead](https://img.shields.io/badge/VRAM%20overhead-%2B1.8%20GB%20only-blue)
-![Speed Loss](https://img.shields.io/badge/Speed%20loss--8.5%25%20only-orange)
-![License](https://img.shields.io/badge/license-CC%20BY%204.0-green)
-**Practical guide: TurboQuant KV-cache quantization on consumer hardware.**
-**100,000 token context on a single RTX 3090 — verified, reproducible, step-by-step.**
-*Based on [TurboQuant (ICLR 2026, arXiv:2504.19874)](https://arxiv.org/abs/2504.19874)*
-**📦 [GitHub: AI-Engineerings-at/llama-cpp-turboquant-guide](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)**
-— Dockerfile · Scripts · Raw benchmark JSON · German white paper
-[Results](#-results) · [Quick Start](#-quick-start) · [How It Works](#-how-it-works) · [Errors & Fixes](#-errors--fixes) · [Deutsch](#-deutsch)
-</div>
 ---
-## 📊 Results
-Tested on two consumer GPUs. Results verified across multiple independent runs (April 1, 2026).
 ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
-*4 independent benchmark runs, 15 total measurements.*
 | | Baseline (f16) | TurboQuant turbo3 | Delta |
 |--|:--------------:|:-----------------:|:-----:|
-| **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
-| **VRAM** | 15.3 GB | 17.1 GB | +1.8 GB only |
 | **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
-| **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
-> **12× more context. +12% VRAM. −7.5% speed. Same model weights.**
-Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) (all 4 runs)
 ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
-*2 independent benchmark sessions verified.*
 | | Baseline (f16) | TurboQuant turbo3 | Delta |
 |--|:--------------:|:-----------------:|:-----:|
-| **Context** | 8,192 tokens | **64,000 tokens** | **+7.8×** |
-| **VRAM** | 5.7 GB | 6.2 GB | +0.54 GB only |
 | **Tokens/s (avg)** | 49.8 | 47.5 | **−4.6%** |
-> **7.8× more context. +0.5 GB VRAM. −5% speed. Consistent across both runs.**
 Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant-4070-results-2026-04-01.json) · [`results/turboquant-4070-laptop-2026-04-01.json`](results/turboquant-4070-laptop-2026-04-01.json)
 ### Cross-GPU Summary
-| GPU | VRAM | Model | Max Context (turbo3) | Speed Loss | Runs |
-|-----|------|-------|---------------------|-----------|------|
-| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −7.5% | 4 |
-| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −4.6% | 2 |
-TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.
 ---
-## 🚀 Quick Start
-### 1. Build the Docker Image (~20 minutes)
 ```bash
 docker build -t turboquant:feature .
-# Verify TurboQuant is compiled in:
-docker run --rm turboquant:feature llama-server -h 2>&1 | grep -A3 "cache-type-k"
-# Must show: turbo2, turbo3, turbo4
 ```
-### 2. Download a Model
 ```bash
-# Set your HuggingFace token
 export HF_TOKEN=hf_your_token_here
 bash scripts/download-model.sh
 ```
-### 3. Run Baseline (f16, 8K context)
 ```bash
 bash scripts/run-baseline.sh
-# Server starts on port 8180
 ```
 ### 4. Run TurboQuant (turbo3, 100K context)
 ```bash
 bash scripts/run-turbo.sh
-# Server starts on port 8182
 ```
-### 5. Test It
 ```bash
-# Check available context
-curl -s http://localhost:8180/v1/models | jq '.data[0].context_length'
-# Baseline: 8192
-curl -s http://localhost:8182/v1/models | jq '.data[0].context_length'
-# TurboQuant: 131072 (model max, allocated to 100000)
-# Generate tokens (measures TPS in response)
-curl http://localhost:8182/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200"}],"max_tokens":500}'
 ```
 ---
-## ⚙️ How It Works
-### The KV-Cache Problem
-When an LLM runs, it caches Key-Value pairs for every token in the context window.
-This cache grows **linearly** with context length:
-```
-Mistral-Small-3.2 24B on RTX 3090 (24 GB total, ~14.4 GB for model weights):
-Context    KV-Cache (f16)   Available after model   Fits?
-  8,192        ~1 GB             9.6 GB               ✅
- 32,000        ~4 GB             9.6 GB               ✅
-100,000       ~12 GB             9.6 GB               ❌ OOM without TurboQuant
-100,000     ~2.8 GB (turbo3)     9.6 GB               ✅
-```
-### What TurboQuant Does
-TurboQuant compresses the KV-cache from 16-bit floats to 2–4-bit integers.
-**It does NOT compress the model weights** — only the runtime cache.
-```
-f16 KV-Cache  →  turbo3 KV-Cache
-  16 bits     →    3 bits  =  4.3× compression
-```
-The model reads the quantized cache and generates text normally.
-Quality loss: <1% perplexity increase at turbo3 (per paper).
-### Two Repos — Critical Distinction
-There are two TurboQuant repositories with confusing names:
-| Repo | What it is | When to use |
-|------|-----------|-------------|
-| `TheTom/turboquant_plus` | Python library for research | HuggingFace models, Python API |
-| `TheTom/llama-cpp-turboquant` | llama.cpp fork | **This guide — llama-server** |
-**This guide uses `TheTom/llama-cpp-turboquant`, branch `feature/turboquant-kv-cache`.**
 ---
-## 🐛 Errors & Fixes
-Every error we hit during setup, documented so you don't repeat them:
-### E1: Wrong Repository
-**Symptom:** No `turbo2`/`turbo3`/`turbo4` options after building.
-**Cause:** Built from `TheTom/turboquant_plus` (Python library) instead of `TheTom/llama-cpp-turboquant`.
-**Fix:** Use the correct repo. See Dockerfile.
-### E2: Wrong cmake Flag
-**Symptom:** CUDA not used during inference, slow CPU fallback.
-**Cause:** Old flag `-DLLAMA_CUBLAS=ON` was renamed in llama.cpp post-GGML-refactor.
-**Fix:**
 ```dockerfile
-# WRONG (old, silently ignored):
-cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON
-# CORRECT:
 cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
 ```
-### E3: libcuda.so.1 Not Found at Build Time
-**Symptom:** Build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
-**Cause:** CUDA devel images have a stub `libcuda.so` but not `libcuda.so.1` (the runtime driver is injected at container start, not build time).
-**Fix:** Add symlink before cmake:
 ```dockerfile
 RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
            /usr/local/cuda/lib64/stubs/libcuda.so.1 \
     && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
     && ldconfig
 ```
-### E4: Wrong Branch
-**Symptom:** `Unsupported cache type: turbo3` at runtime despite clean build.
-**Cause:** Cloning the default `master` branch of `llama-cpp-turboquant` — which is a standard llama.cpp fork **without** TurboQuant. The implementation is on `feature/turboquant-kv-cache`.
-**Fix:**
 ```bash
 git clone https://github.com/TheTom/llama-cpp-turboquant.git \
   --branch feature/turboquant-kv-cache --depth=1
 ```
-Always verify before building:
 ```bash
 curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
   | python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
 ```
-### E5: Wrong HuggingFace Repo Name
-**Symptom:** 404 or 401 when downloading model.
-**Cause:** Model repo names change. Don't rely on memory or cached context.
-**Fix:** Always query HF Search API before downloading:
 ```bash
 curl -s -H "Authorization: Bearer $HF_TOKEN" \
   "https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
   | python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
@@ -248,121 +302,140 @@ curl -s -H "Authorization: Bearer $HF_TOKEN" \
 ---
-## 🔬 Reproduce Our Results
 ```bash
-# 1. Build
 docker build -t turboquant:feature .
-# 2. Download model (~14 GB)
-export HF_TOKEN=hf_your_token
-bash scripts/download-model.sh
-# 3. Baseline measurement
-bash scripts/run-baseline.sh &
-sleep 45  # wait for server startup
-curl -s http://localhost:8180/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200, one per line."}],"max_tokens":500}' \
-  | python3 -c "import sys,json; d=json.load(sys.stdin); u=d['usage']; print(f'TPS: {u[\"completion_tokens\"] / (d[\"usage\"].get(\"total_time_ms\",10000)/1000):.1f}')"
 nvidia-smi --query-gpu=memory.used --format=csv,noheader
-# 4. Turbo3 measurement
 docker stop turboquant-baseline
-bash scripts/run-turbo.sh &
-sleep 90  # 100K context allocation takes longer
-# repeat curl + nvidia-smi on port 8182
 ```
-Expected results matching our run: see [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json)
----
-## Hardware Requirements
-| | Minimum | Our Setup |
-|--|---------|----------|
-| GPU VRAM | 16 GB | RTX 3090 24 GB |
-| System RAM | 16 GB | 32 GB |
-| Disk | 30 GB | SSD |
-| CUDA | 12.x | 12.6.3 |
-| OS | Linux / Windows + Docker | Windows + Docker Desktop |
-> **Note on Windows:** Docker Desktop works fine. Avoid `/tmp/` paths — use named Docker volumes for model storage.
----
-## Model Compatibility
-Tested with **Mistral-Small-3.2-24B Q4_K_M** (14 GB).
-Should work with any GGUF model that fits the VRAM budget after KV-cache allocation.
-| Model | Size | VRAM (model) | Max ctx (turbo3) |
-|-------|------|-------------|-----------------|
-| Mistral-Small-3.2 24B Q4_K_M | 14 GB | 14.4 GB | ~100K on 24 GB GPU |
-| Llama-3.1 8B Q4_K_M | 4.7 GB | 5.1 GB | ~200K on 16 GB GPU |
-| Qwen2.5 14B Q4_K_M | 8.5 GB | 8.8 GB | ~150K on 16 GB GPU |
-*Estimates. Actual values depend on architecture and batch size.*
 ---
-## 📄 License
 Content and scripts: [CC BY 4.0](LICENSE)
-Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al. (ICLR 2026)
 llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
 ---
----
 ## 🇩🇪 Deutsch
 ### TurboQuant auf Consumer-Hardware — Praktischer Guide
-Dieses Repository dokumentiert unsere Erfahrungen beim Einsatz von TurboQuant (ICLR 2026)
-auf einer RTX 3090 im Homelab-Betrieb. Wir sind das erste europäische Team,
-das diese Methode praktisch auf Consumer-Hardware veröffentlicht dokumentiert hat.
-### Das Ergebnis
-Mit TurboQuant turbo3 (3-bit KV-Cache) haben wir auf einer RTX 3090 (24 GB):
-- **12× mehr Context** (8.192 → 100.000 Tokens)
-- nur **+1.8 GB VRAM** Mehrverbrauch
-- nur **−8.5% Geschwindigkeitsverlust**
-- **gleiche Modellgewichte** — nur der Laufzeit-Cache wird komprimiert
-### Warum das wichtig ist
-Größerer Context bedeutet: Längere Dokumente, mehr Gesprächshistorie, besseres RAG,
-Code-Analyse ganzer Codebasen — alles auf einer einzigen Consumer-GPU.
-### Fehler-Protokoll (5 Fehler die wir gemacht haben)
-Alle 5 Fehler aus unserem Setup sind unter [Errors & Fixes](#-errors--fixes) dokumentiert.
-Der häufigste: falscher Branch (`master` statt `feature/turboquant-kv-cache`).
-### Schnellstart (Deutsch)
 ```bash
 # Image bauen (~20 Minuten)
 docker build -t turboquant:feature .
-# Modell herunterladen (14 GB)
 export HF_TOKEN=dein_token
 bash scripts/download-model.sh
-# Baseline starten (f16, 8K Context)
 bash scripts/run-baseline.sh
-# TurboQuant starten (turbo3, 100K Context)
 bash scripts/run-turbo.sh
 ```
 Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
 ---
-*AI Engineering Lab · April 2026 · [ai-engineering.at](https://ai-engineering.at)*

 <div align="center">
 [![GitHub](https://img.shields.io/badge/GitHub-AI--Engineerings--at%2Fllama--cpp--turboquant--guide-181717?logo=github)](https://github.com/AI-Engineerings-at/llama-cpp-turboquant-guide)
+[![Paper](https://img.shields.io/badge/Paper-arXiv%3A2504.19874-b31b1b?logo=arxiv)](https://arxiv.org/abs/2504.19874)
 ![TurboQuant](https://img.shields.io/badge/TurboQuant-turbo3%20KV--Cache-blueviolet)
+![RTX 3090](https://img.shields.io/badge/RTX%203090-100K%20ctx-76b900?logo=nvidia)
+![RTX 4070 Laptop](https://img.shields.io/badge/RTX%204070%20Laptop-64K%20ctx-76b900?logo=nvidia)
+![VRAM Overhead](https://img.shields.io/badge/VRAM%20delta-%2B1.8%20GB%20only-blue)
+![Speed Loss](https://img.shields.io/badge/Speed%20loss-max%208.5%25-orange)
+![License](https://img.shields.io/badge/license-CC%20BY%204.0-lightgrey)
+**TurboQuant (ICLR 2026) quantizes the KV-cache at runtime — not the model weights.**
+**Result: 100,000 token context on an RTX 3090. +1.8 GB VRAM. −8% speed.**
+*Verified across multiple independent runs. Step-by-step guide with Dockerfile, scripts, and raw benchmark data.*
+</div>
+---
+[TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) was presented at ICLR 2026. It compresses the KV-cache from 16-bit to 2–4-bit integers during inference. Model weights stay at full precision. This closes the gap between what your GPU can load and what context length it can actually serve.
+This repo documents our setup on two consumer GPUs — what we ran into, what we fixed, and what we measured.
+**What's in this repo:**
+| File | Description |
+|------|-------------|
+| `Dockerfile` | Builds llama.cpp with TurboQuant (correct repo, branch, cmake flags) |
+| `scripts/run-baseline.sh` | Starts llama-server with f16 cache, 8K context |
+| `scripts/run-turbo.sh` | Starts llama-server with turbo3 cache, 100K context |
+| `scripts/download-model.sh` | Downloads model from HuggingFace via API |
+| `results/` | Raw benchmark JSON — all runs, both GPUs |
+| `WHITEPAPER.de.md` | German white paper |
+**→ [Results](#results) · [How It Works](#how-it-works) · [Quick Start](#quick-start) · [Errors & Fixes](#errors) · [Deutsch](#deutsch)**
 ---
+<a id="results"></a>
+## Results
+Verified on two consumer GPUs. April 2026.
 ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
+*4 independent benchmark runs, 15 total measurements. All runs consistent (±0.3% TPS variance).*
 | | Baseline (f16) | TurboQuant turbo3 | Delta |
 |--|:--------------:|:-----------------:|:-----:|
+| **Context** | 8,192 | **100,000** | **+12.2×** |
+| **VRAM used** | 15.3 GB | 17.1 GB | +1.8 GB |
 | **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
+| **KV-Cache** | ~1 GB (f16) | ~2.8 GB (3-bit) | 4.3× smaller |
+> 12× more context. +12% VRAM. −7.5% speed. Same model weights.
+```
+RTX 3090 — 24 GB VRAM
+────────────────────────────────────────────────
+Baseline   (f16,    8K ctx):  █████████████░░░░░░  15.3 GB
+TurboQuant (turbo3, 100K ctx): ██████████████░░░░░  17.1 GB
+                               ↑ weights 14.4 GB fixed
+                                              ↑ KV-cache
+```
+Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) — 4 runs, 15 measurements
+---
 ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
+*2 independent benchmark sessions. VRAM delta stable at ±2 MB between sessions.*
 | | Baseline (f16) | TurboQuant turbo3 | Delta |
 |--|:--------------:|:-----------------:|:-----:|
+| **Context** | 8,192 | **64,000** | **+7.8×** |
+| **VRAM used** | 5.7 GB | 6.2 GB | +0.54 GB |
 | **Tokens/s (avg)** | 49.8 | 47.5 | **−4.6%** |
+> 7.8× more context. +0.5 GB VRAM. −5% speed.
 Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant-4070-results-2026-04-01.json) · [`results/turboquant-4070-laptop-2026-04-01.json`](results/turboquant-4070-laptop-2026-04-01.json)
+---
 ### Cross-GPU Summary
+| GPU | VRAM | Model | ctx (turbo3) | VRAM delta | Speed loss | Runs |
+|-----|------|-------|:------------:|:----------:|:----------:|:----:|
+| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | **100,000** | +1.8 GB | −7.5% | 4 |
+| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | **64,000** | +0.5 GB | −4.6% | 2 |
+The principle scales with the GPU. More VRAM → larger model → larger absolute VRAM savings from compression → more context headroom.
+---
+<a id="how-it-works"></a>
+## How It Works
+### The KV-cache problem
+Every token you feed into an LLM creates Key-Value vectors that must stay in VRAM for the duration of the request. With f16 (default), this cache grows linearly:
+```
+Mistral-Small-3.2 24B on RTX 3090 (24 GB):
+  Model weights occupy 14.4 GB.  Remaining: ~9.6 GB for KV-cache.
+  Context     KV-cache (f16)   Remaining   Status
+    8,192          ~1 GB         ~8.6 GB     ✅ fine
+   32,000          ~4 GB         ~5.6 GB     ✅ fine
+  100,000         ~12 GB        −2.4 GB      ❌ OOM
+  100,000      ~2.8 GB (turbo3)  ~6.8 GB     ✅ fine
+```
+### What TurboQuant does
+TurboQuant re-encodes the KV-cache from 16 bits to 2–4 bits on-the-fly. The model reads the quantized cache and generates output normally. The model weights are never touched.
+```
+f16 KV-cache  →  turbo3 KV-cache
+  16 bits     →     3 bits   =  4.3× compression
+```
+Quality loss at turbo3: <1% perplexity increase (per paper). In practice: not noticeable for most tasks.
+```mermaid
+flowchart LR
+    tokens["100K input tokens"] --> model["Model layers\nweights: 14.4 GB\nunchanged"]
+    model --> kv_type{KV-Cache format?}
+    kv_type -->|"f16 default"| oom["~12 GB cache\n❌ OOM on 24 GB GPU"]
+    kv_type -->|"turbo3 (3-bit)"| fits["~2.8 GB cache\n✅ Fits: 7 GB still free"]
+    fits --> output["Output tokens\n−8% TPS vs baseline"]
+```
+### Critical: two repos with confusing names
+| Repo | What it is | Used here? |
+|------|-----------|:----------:|
+| `TheTom/turboquant_plus` | Python research library — HuggingFace models, Python API | ❌ |
+| `TheTom/llama-cpp-turboquant` | llama.cpp fork with `--cache-type-k turbo3` | ✅ |
+Branch: `feature/turboquant-kv-cache` — **not `master`** (which is a standard llama.cpp fork, no TurboQuant).
 ---
+<a id="quick-start"></a>
+## Quick Start
+**Requirements:** Docker with NVIDIA runtime, CUDA 12.x, HuggingFace account (free).
+### 1. Build the Docker image (~20 min)
 ```bash
 docker build -t turboquant:feature .
+# Verify TurboQuant is compiled in — must show turbo2, turbo3, turbo4:
+docker run --rm turboquant:feature llama-server -h 2>&1 | grep turbo
 ```
+### 2. Download model (~14 GB)
 ```bash
 export HF_TOKEN=hf_your_token_here
 bash scripts/download-model.sh
 ```
+### 3. Run baseline (f16, 8K context)
 ```bash
 bash scripts/run-baseline.sh
+# Server on port 8180. Starts in ~45s.
 ```
 ### 4. Run TurboQuant (turbo3, 100K context)
 ```bash
 bash scripts/run-turbo.sh
+# Server on port 8182. Starts in ~90s — 100K context allocation takes longer.
 ```
+### 5. Test
 ```bash
+# Check context length
+curl -s http://localhost:8182/v1/models \
+  | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['context_length'])"
+# → 100000
+# Warmup (first request is cold — don't measure this one)
+curl -sf http://localhost:8182/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"local","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}' > /dev/null
+# Measure TPS
+curl -s http://localhost:8182/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{"model":"local","messages":[{"role":"user","content":"Explain transformer attention in detail. At least 400 words."}],"max_tokens":500}' \
+  | python3 -c "import sys,json; t=json.load(sys.stdin)['timings']; print(f'{t[\"predicted_per_second\"]:.1f} TPS ({t[\"predicted_n\"]} tokens)')"
 ```
 ---
+<a id="errors"></a>
+## Errors & Fixes
+5 errors we ran into during setup. All documented so you can skip them.
+### E1: Wrong repository
+**Symptom:** Build succeeds. `llama-server -h | grep turbo` returns nothing.
+**Cause:** Built from `TheTom/turboquant_plus` — that's a Python library for HuggingFace-style inference. The llama.cpp fork is `TheTom/llama-cpp-turboquant`.
+**Fix:** Use the Dockerfile in this repo. It clones the correct repo.
 ---
+### E2: Wrong cmake flag
+**Symptom:** CUDA is not used. Inference runs on CPU — extremely slow.
+**Cause:** `-DLLAMA_CUBLAS=ON` was renamed to `-DGGML_CUDA=ON` in llama.cpp after the GGML refactor. The old flag compiles without error but is silently ignored.
 ```dockerfile
+# Wrong — silently ignored since llama.cpp GGML refactor:
+cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON .
+# Correct:
 cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
 ```
+---
+### E3: `libcuda.so.1` not found at build time
+**Symptom:** Docker build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
+**Cause:** CUDA development images ship a stub `libcuda.so` — the actual driver (`libcuda.so.1`) is injected at container runtime, not available during `docker build`.
 ```dockerfile
+# Add before cmake in your Dockerfile:
 RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
            /usr/local/cuda/lib64/stubs/libcuda.so.1 \
     && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
     && ldconfig
 ```
+---
+### E4: Wrong branch
+**Symptom:** `Unsupported cache type: turbo3` at runtime. Build was clean.
+**Cause:** The default `master` branch of `llama-cpp-turboquant` is a plain llama.cpp fork. TurboQuant lives on `feature/turboquant-kv-cache`.
 ```bash
+# Wrong — master branch, no TurboQuant:
+git clone https://github.com/TheTom/llama-cpp-turboquant.git
+# Correct:
 git clone https://github.com/TheTom/llama-cpp-turboquant.git \
   --branch feature/turboquant-kv-cache --depth=1
 ```
+Check branches before cloning:
 ```bash
 curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
   | python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
 ```
+---
+### E5: Wrong HuggingFace repo name
+**Symptom:** 404 or 401 on model download.
+**Cause:** Model repo names on HuggingFace change. Don't rely on memory.
 ```bash
+# Always verify first:
 curl -s -H "Authorization: Bearer $HF_TOKEN" \
   "https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
   | python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
 ---
+<a id="reproduce"></a>
+## Reproduce Our Results
 ```bash
+# 1. Build and start baseline server
 docker build -t turboquant:feature .
+bash scripts/run-baseline.sh
+sleep 45
+# 2. VRAM after server start
 nvidia-smi --query-gpu=memory.used --format=csv,noheader
+# Expected: ~15300 MB
+# 3. Warmup (mandatory — first request is cold and gives wrong TPS)
+curl -sf http://localhost:8180/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":5}' > /dev/null
+# 4. 3× TPS measurement
+PROMPT='{"model":"local","messages":[{"role":"user","content":"Explain in detail how transformer attention mechanisms work. Cover self-attention, multi-head attention, key-query-value matrices, and positional encoding. Write at least 400 words."}],"max_tokens":500}'
+for i in 1 2 3; do
+  curl -sf http://localhost:8180/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d "$PROMPT" \
+    | python3 -c "import sys,json; t=json.load(sys.stdin)['timings']; print(f'Run $i: {t[\"predicted_per_second\"]:.2f} TPS ({t[\"predicted_n\"]} tokens)')"
+done
+# 5. Stop, start TurboQuant (100K context needs ~90s to allocate)
 docker stop turboquant-baseline
+bash scripts/run-turbo.sh
+sleep 90
+# 6. Repeat steps 2-4 on port 8182
 ```
+**Expected on RTX 3090 + Mistral-Small-3.2 24B:**
+- Baseline: 50–52 TPS, VRAM ~15.3 GB
+- turbo3 at 100K: 46–48 TPS, VRAM ~17.1 GB
+Full benchmark data for comparison: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json)
+---
+## Hardware & Model Compatibility
+### GPU VRAM requirements
+| VRAM | Recommended model | turbo3 context | Notes |
+|------|------------------|:--------------:|-------|
+| 6 GB | Llama-3.2 3B Q4_K_M (~2 GB) | ~200K | Very fast, limited capability |
+| 8 GB | Llama-3.1 8B Q4_K_M (4.7 GB) | ~64K | **Verified** — our RTX 4070 setup |
+| 12 GB | Qwen2.5 14B Q4_K_M (~8.5 GB) | ~80K | Estimated |
+| 24 GB | Mistral-Small-3.2 24B Q4_K_M (14.4 GB) | ~100K | **Verified** — our RTX 3090 setup |
+*Estimates for non-verified rows depend on model architecture and batch size.*
+### System requirements
+| Component | Minimum | Our setups |
+|-----------|---------|-----------|
+| GPU | CUDA-capable, VRAM per table above | RTX 3090 / RTX 4070 Laptop |
+| System RAM | 16 GB | 32 GB / 16 GB |
+| Disk | 20 GB free | SSD |
+| CUDA | 12.x | 12.6.3 |
+| OS | Linux, or Windows with Docker Desktop | Windows + Docker Desktop |
+**Windows:** Docker Desktop works. Use named Docker volumes for models — avoid `/tmp/` paths.
 ---
+## License
 Content and scripts: [CC BY 4.0](LICENSE)
+Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al., ICLR 2026
 llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
 ---
+<a id="deutsch"></a>
 ## 🇩🇪 Deutsch
 ### TurboQuant auf Consumer-Hardware — Praktischer Guide
+Dieses Repository dokumentiert unsere Ergebnisse mit TurboQuant (ICLR 2026) auf zwei Consumer-GPUs — inklusive aller 5 Fehler die wir gemacht haben und wie wir sie gelöst haben.
+**TurboQuant komprimiert den KV-Cache von 16-Bit auf 2–4-Bit während der Inferenz. Die Modellgewichte bleiben unverändert.**
+### Ergebnisse
+**RTX 3090 (24 GB), Mistral-Small-3.2 24B Q4_K_M** — 4 unabhängige Runs:
+- Context: 8.192 → **100.000 Tokens** (+12,2×)
+- VRAM-Mehrverbrauch: nur **+1,8 GB** (statt ~12 GB die f16 bei 100K bräuchte)
+- Geschwindigkeitsverlust: nur **−7,5%** (51,0 → 47,2 Tokens/s)
+**RTX 4070 Laptop (8 GB), Llama-3.1 8B Q4_K_M** — 2 unabhängige Sessions:
+- Context: 8.192 → **64.000 Tokens** (+7,8×)
+- VRAM-Mehrverbrauch: nur **+0,5 GB**
+- Geschwindigkeitsverlust: nur **−4,6%**
+### Warum das relevant ist
+Größerer Context bedeutet: längere Dokumente verarbeiten, besseres RAG, mehr Gesprächshistorie — alles auf einer einzigen Consumer-GPU. Keine Cloud-Kosten, keine Datenschutz-Probleme.
+### Schnellstart
 ```bash
 # Image bauen (~20 Minuten)
 docker build -t turboquant:feature .
+# TurboQuant-Unterstützung prüfen (muss turbo2, turbo3, turbo4 zeigen):
+docker run --rm turboquant:feature llama-server -h 2>&1 | grep turbo
+# Modell herunterladen (~14 GB)
 export HF_TOKEN=dein_token
 bash scripts/download-model.sh
+# Baseline starten (f16, 8K Context, Port 8180)
 bash scripts/run-baseline.sh
+# TurboQuant starten (turbo3, 100K Context, Port 8182)
+# Hinweis: Startup dauert ~90s — 100K Context-Allokation braucht länger als 8K
 bash scripts/run-turbo.sh
 ```
+### Die 5 Fehler — Kurzfassung
+1. **Falsches Repo** — `turboquant_plus` ist eine Python-Bibliothek, nicht der llama.cpp Fork → [E1](#errors)
+2. **Falsches cmake-Flag** — `-DLLAMA_CUBLAS=ON` wird still ignoriert, korrekt: `-DGGML_CUDA=ON` → [E2](#errors)
+3. **`libcuda.so.1` fehlt** — Symlink vor cmake notwendig → [E3](#errors)
+4. **Falscher Branch** — `master` hat kein TurboQuant, korrekt: `feature/turboquant-kv-cache` → [E4](#errors)
+5. **Falscher HF-Repo-Name** — Immer per API prüfen, nie aus dem Gedächtnis → [E5](#errors)
 Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
 ---
+*[AI Engineering Lab](https://ai-engineering.at) · April 2026*