AI Engineering Lab commited on
Commit ·
87efc66
0
Parent(s):
Initial release: TurboQuant practical guide for consumer hardware
Browse files- README.md: bilingual (EN/DE), benchmark results, quick start, 5 error fixes
- WHITEPAPER.de.md: full German white paper
- Dockerfile: reproducible CUDA build (feature/turboquant-kv-cache branch)
- scripts/: download-model.sh, run-baseline.sh, run-turbo.sh
- results/: RTX 3090 benchmark JSON (April 2026)
Results: 100K context on RTX 3090 with +1.8 GB VRAM and -8.5% TPS
Based on TurboQuant (ICLR 2026, arXiv:2504.19874)
- Dockerfile +56 -0
- LICENSE +35 -0
- README.md +316 -0
- WHITEPAPER.de.md +186 -0
- results/turboquant-rtx3090-2026-04-01.json +92 -0
- scripts/download-model.sh +68 -0
- scripts/run-baseline.sh +42 -0
- scripts/run-turbo.sh +48 -0
Dockerfile
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TurboQuant llama.cpp — CUDA Build
|
| 2 |
+
# Builds llama-server with TurboQuant KV-cache quantization support
|
| 3 |
+
# turbo2 / turbo3 / turbo4 cache types enabled
|
| 4 |
+
#
|
| 5 |
+
# CRITICAL: Use --branch feature/turboquant-kv-cache (NOT master!)
|
| 6 |
+
# The master branch is a standard llama.cpp without TurboQuant support.
|
| 7 |
+
#
|
| 8 |
+
# Usage:
|
| 9 |
+
# docker build -t turboquant:feature .
|
| 10 |
+
# docker run --rm turboquant:feature llama-server -h 2>&1 | grep -A3 "cache-type-k"
|
| 11 |
+
# # Must show: turbo2, turbo3, turbo4
|
| 12 |
+
|
| 13 |
+
FROM nvidia/cuda:12.6.3-devel-ubuntu22.04
|
| 14 |
+
|
| 15 |
+
ENV DEBIAN_FRONTEND=noninteractive
|
| 16 |
+
RUN apt-get update && apt-get install -y \
|
| 17 |
+
cmake \
|
| 18 |
+
build-essential \
|
| 19 |
+
git \
|
| 20 |
+
wget \
|
| 21 |
+
curl \
|
| 22 |
+
python3 \
|
| 23 |
+
python3-pip \
|
| 24 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 25 |
+
|
| 26 |
+
WORKDIR /build
|
| 27 |
+
|
| 28 |
+
# CRITICAL: Must use --branch feature/turboquant-kv-cache
|
| 29 |
+
# Default 'master' does NOT have turbo2/turbo3/turbo4 cache types!
|
| 30 |
+
RUN git clone https://github.com/TheTom/llama-cpp-turboquant.git \
|
| 31 |
+
--branch feature/turboquant-kv-cache \
|
| 32 |
+
--depth=1
|
| 33 |
+
|
| 34 |
+
WORKDIR /build/llama-cpp-turboquant
|
| 35 |
+
|
| 36 |
+
# Fix: libcuda.so.1 is not available at build time (driver is injected at runtime only)
|
| 37 |
+
# The devel image provides a stub at /usr/local/cuda/lib64/stubs/libcuda.so
|
| 38 |
+
# Symlink to .1 so the linker finds it during cmake build
|
| 39 |
+
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
|
| 40 |
+
/usr/local/cuda/lib64/stubs/libcuda.so.1 \
|
| 41 |
+
&& echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
|
| 42 |
+
&& ldconfig
|
| 43 |
+
|
| 44 |
+
# IMPORTANT: Use -DGGML_CUDA=ON (not -DLLAMA_CUBLAS=ON which was renamed in ~2024)
|
| 45 |
+
RUN cmake -B build \
|
| 46 |
+
-DGGML_CUDA=ON \
|
| 47 |
+
-DCMAKE_BUILD_TYPE=Release \
|
| 48 |
+
&& cmake --build build --config Release -j4 --target llama-server
|
| 49 |
+
|
| 50 |
+
RUN cp build/bin/llama-server /usr/local/bin/llama-server
|
| 51 |
+
|
| 52 |
+
WORKDIR /models
|
| 53 |
+
EXPOSE 8180
|
| 54 |
+
|
| 55 |
+
# Default: show help. Override CMD in docker run to actually serve a model.
|
| 56 |
+
CMD ["llama-server", "--help"]
|
LICENSE
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Creative Commons Attribution 4.0 International (CC BY 4.0)
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2026 AI Engineering Lab (ai-engineering.at)
|
| 4 |
+
|
| 5 |
+
You are free to:
|
| 6 |
+
Share — copy and redistribute the material in any medium or format
|
| 7 |
+
Adapt — remix, transform, and build upon the material for any purpose, even commercially
|
| 8 |
+
|
| 9 |
+
Under the following terms:
|
| 10 |
+
Attribution — You must give appropriate credit, provide a link to the license,
|
| 11 |
+
and indicate if changes were made.
|
| 12 |
+
|
| 13 |
+
Full license text: https://creativecommons.org/licenses/by/4.0/legalcode
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
The Dockerfile and shell scripts in this repository are additionally available under MIT License:
|
| 18 |
+
|
| 19 |
+
MIT License
|
| 20 |
+
|
| 21 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software
|
| 22 |
+
and associated documentation files, to deal in the Software without restriction, including
|
| 23 |
+
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
| 24 |
+
and/or sell copies of the Software, and to permit persons to whom the Software is furnished
|
| 25 |
+
to do so, subject to the following conditions:
|
| 26 |
+
|
| 27 |
+
The above copyright notice and this permission notice shall be included in all copies or
|
| 28 |
+
substantial portions of the Software.
|
| 29 |
+
|
| 30 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
Based on TurboQuant (arXiv:2504.19874) by Thomas et al., licensed under their respective terms.
|
| 35 |
+
llama.cpp fork: https://github.com/TheTom/llama-cpp-turboquant
|
README.md
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# llama-cpp-turboquant-guide
|
| 2 |
+
|
| 3 |
+
<div align="center">
|
| 4 |
+
|
| 5 |
+

|
| 6 |
+

|
| 7 |
+

|
| 8 |
+

|
| 9 |
+

|
| 10 |
+

|
| 11 |
+
|
| 12 |
+
**Practical guide: TurboQuant KV-cache quantization on consumer hardware.**
|
| 13 |
+
**100,000 token context on a single RTX 3090 — verified, reproducible, step-by-step.**
|
| 14 |
+
|
| 15 |
+
*Based on [TurboQuant (ICLR 2026, arXiv:2504.19874)](https://arxiv.org/abs/2504.19874)*
|
| 16 |
+
|
| 17 |
+
[Results](#-results) · [Quick Start](#-quick-start) · [How It Works](#-how-it-works) · [Errors & Fixes](#-errors--fixes) · [Deutsch](#-deutsch)
|
| 18 |
+
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 📊 Results
|
| 24 |
+
|
| 25 |
+
Tested on **NVIDIA RTX 3090 (24 GB VRAM)** with **Mistral-Small-3.2-24B Q4_K_M**.
|
| 26 |
+
|
| 27 |
+
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 28 |
+
|--|:--------------:|:-----------------:|:-----:|
|
| 29 |
+
| **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
|
| 30 |
+
| **VRAM** | 15.4 GB | 17.2 GB | +1.8 GB only |
|
| 31 |
+
| **Tokens/s** | 49.2 | 45.0 | −8.5% |
|
| 32 |
+
| **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
|
| 33 |
+
|
| 34 |
+
> **12× more context. +11% VRAM. −8% speed. Same model weights.**
|
| 35 |
+
|
| 36 |
+
Raw data: [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json)
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 🚀 Quick Start
|
| 41 |
+
|
| 42 |
+
### 1. Build the Docker Image (~20 minutes)
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
docker build -t turboquant:feature .
|
| 46 |
+
|
| 47 |
+
# Verify TurboQuant is compiled in:
|
| 48 |
+
docker run --rm turboquant:feature llama-server -h 2>&1 | grep -A3 "cache-type-k"
|
| 49 |
+
# Must show: turbo2, turbo3, turbo4
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### 2. Download a Model
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
# Set your HuggingFace token
|
| 56 |
+
export HF_TOKEN=hf_your_token_here
|
| 57 |
+
|
| 58 |
+
bash scripts/download-model.sh
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### 3. Run Baseline (f16, 8K context)
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
bash scripts/run-baseline.sh
|
| 65 |
+
# Server starts on port 8180
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### 4. Run TurboQuant (turbo3, 100K context)
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
bash scripts/run-turbo.sh
|
| 72 |
+
# Server starts on port 8182
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### 5. Test It
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
# Check available context
|
| 79 |
+
curl -s http://localhost:8180/v1/models | jq '.data[0].context_length'
|
| 80 |
+
# Baseline: 8192
|
| 81 |
+
|
| 82 |
+
curl -s http://localhost:8182/v1/models | jq '.data[0].context_length'
|
| 83 |
+
# TurboQuant: 131072 (model max, allocated to 100000)
|
| 84 |
+
|
| 85 |
+
# Generate tokens (measures TPS in response)
|
| 86 |
+
curl http://localhost:8182/v1/chat/completions \
|
| 87 |
+
-H "Content-Type: application/json" \
|
| 88 |
+
-d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200"}],"max_tokens":500}'
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## ⚙️ How It Works
|
| 94 |
+
|
| 95 |
+
### The KV-Cache Problem
|
| 96 |
+
|
| 97 |
+
When an LLM runs, it caches Key-Value pairs for every token in the context window.
|
| 98 |
+
This cache grows **linearly** with context length:
|
| 99 |
+
|
| 100 |
+
```
|
| 101 |
+
Mistral-Small-3.2 24B on RTX 3090 (24 GB total, ~14.4 GB for model weights):
|
| 102 |
+
|
| 103 |
+
Context KV-Cache (f16) Available after model Fits?
|
| 104 |
+
8,192 ~1 GB 9.6 GB ✅
|
| 105 |
+
32,000 ~4 GB 9.6 GB ✅
|
| 106 |
+
100,000 ~12 GB 9.6 GB ❌ OOM without TurboQuant
|
| 107 |
+
100,000 ~2.8 GB (turbo3) 9.6 GB ✅
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
### What TurboQuant Does
|
| 111 |
+
|
| 112 |
+
TurboQuant compresses the KV-cache from 16-bit floats to 2–4-bit integers.
|
| 113 |
+
**It does NOT compress the model weights** — only the runtime cache.
|
| 114 |
+
|
| 115 |
+
```
|
| 116 |
+
f16 KV-Cache → turbo3 KV-Cache
|
| 117 |
+
16 bits → 3 bits = 4.3× compression
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
The model reads the quantized cache and generates text normally.
|
| 121 |
+
Quality loss: <1% perplexity increase at turbo3 (per paper).
|
| 122 |
+
|
| 123 |
+
### Two Repos — Critical Distinction
|
| 124 |
+
|
| 125 |
+
There are two TurboQuant repositories with confusing names:
|
| 126 |
+
|
| 127 |
+
| Repo | What it is | When to use |
|
| 128 |
+
|------|-----------|-------------|
|
| 129 |
+
| `TheTom/turboquant_plus` | Python library for research | HuggingFace models, Python API |
|
| 130 |
+
| `TheTom/llama-cpp-turboquant` | llama.cpp fork | **This guide — llama-server** |
|
| 131 |
+
|
| 132 |
+
**This guide uses `TheTom/llama-cpp-turboquant`, branch `feature/turboquant-kv-cache`.**
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
## 🐛 Errors & Fixes
|
| 137 |
+
|
| 138 |
+
Every error we hit during setup, documented so you don't repeat them:
|
| 139 |
+
|
| 140 |
+
### E1: Wrong Repository
|
| 141 |
+
|
| 142 |
+
**Symptom:** No `turbo2`/`turbo3`/`turbo4` options after building.
|
| 143 |
+
**Cause:** Built from `TheTom/turboquant_plus` (Python library) instead of `TheTom/llama-cpp-turboquant`.
|
| 144 |
+
**Fix:** Use the correct repo. See Dockerfile.
|
| 145 |
+
|
| 146 |
+
### E2: Wrong cmake Flag
|
| 147 |
+
|
| 148 |
+
**Symptom:** CUDA not used during inference, slow CPU fallback.
|
| 149 |
+
**Cause:** Old flag `-DLLAMA_CUBLAS=ON` was renamed in llama.cpp post-GGML-refactor.
|
| 150 |
+
**Fix:**
|
| 151 |
+
```dockerfile
|
| 152 |
+
# WRONG (old, silently ignored):
|
| 153 |
+
cmake -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON
|
| 154 |
+
|
| 155 |
+
# CORRECT:
|
| 156 |
+
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### E3: libcuda.so.1 Not Found at Build Time
|
| 160 |
+
|
| 161 |
+
**Symptom:** Build fails with `cannot find -lcuda` or linker error for `libcuda.so.1`.
|
| 162 |
+
**Cause:** CUDA devel images have a stub `libcuda.so` but not `libcuda.so.1` (the runtime driver is injected at container start, not build time).
|
| 163 |
+
**Fix:** Add symlink before cmake:
|
| 164 |
+
```dockerfile
|
| 165 |
+
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
|
| 166 |
+
/usr/local/cuda/lib64/stubs/libcuda.so.1 \
|
| 167 |
+
&& echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf \
|
| 168 |
+
&& ldconfig
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
### E4: Wrong Branch
|
| 172 |
+
|
| 173 |
+
**Symptom:** `Unsupported cache type: turbo3` at runtime despite clean build.
|
| 174 |
+
**Cause:** Cloning the default `master` branch of `llama-cpp-turboquant` — which is a standard llama.cpp fork **without** TurboQuant. The implementation is on `feature/turboquant-kv-cache`.
|
| 175 |
+
**Fix:**
|
| 176 |
+
```bash
|
| 177 |
+
git clone https://github.com/TheTom/llama-cpp-turboquant.git \
|
| 178 |
+
--branch feature/turboquant-kv-cache --depth=1
|
| 179 |
+
```
|
| 180 |
+
Always verify before building:
|
| 181 |
+
```bash
|
| 182 |
+
curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
|
| 183 |
+
| python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
### E5: Wrong HuggingFace Repo Name
|
| 187 |
+
|
| 188 |
+
**Symptom:** 404 or 401 when downloading model.
|
| 189 |
+
**Cause:** Model repo names change. Don't rely on memory or cached context.
|
| 190 |
+
**Fix:** Always query HF Search API before downloading:
|
| 191 |
+
```bash
|
| 192 |
+
curl -s -H "Authorization: Bearer $HF_TOKEN" \
|
| 193 |
+
"https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
|
| 194 |
+
| python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## 🔬 Reproduce Our Results
|
| 200 |
+
|
| 201 |
+
```bash
|
| 202 |
+
# 1. Build
|
| 203 |
+
docker build -t turboquant:feature .
|
| 204 |
+
|
| 205 |
+
# 2. Download model (~14 GB)
|
| 206 |
+
export HF_TOKEN=hf_your_token
|
| 207 |
+
bash scripts/download-model.sh
|
| 208 |
+
|
| 209 |
+
# 3. Baseline measurement
|
| 210 |
+
bash scripts/run-baseline.sh &
|
| 211 |
+
sleep 45 # wait for server startup
|
| 212 |
+
curl -s http://localhost:8180/v1/chat/completions \
|
| 213 |
+
-H "Content-Type: application/json" \
|
| 214 |
+
-d '{"model":"local","messages":[{"role":"user","content":"Count from 1 to 200, one per line."}],"max_tokens":500}' \
|
| 215 |
+
| python3 -c "import sys,json; d=json.load(sys.stdin); u=d['usage']; print(f'TPS: {u[\"completion_tokens\"] / (d[\"usage\"].get(\"total_time_ms\",10000)/1000):.1f}')"
|
| 216 |
+
nvidia-smi --query-gpu=memory.used --format=csv,noheader
|
| 217 |
+
|
| 218 |
+
# 4. Turbo3 measurement
|
| 219 |
+
docker stop turboquant-baseline
|
| 220 |
+
bash scripts/run-turbo.sh &
|
| 221 |
+
sleep 90 # 100K context allocation takes longer
|
| 222 |
+
# repeat curl + nvidia-smi on port 8182
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
Expected results matching our run: see [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json)
|
| 226 |
+
|
| 227 |
+
---
|
| 228 |
+
|
| 229 |
+
## Hardware Requirements
|
| 230 |
+
|
| 231 |
+
| | Minimum | Our Setup |
|
| 232 |
+
|--|---------|----------|
|
| 233 |
+
| GPU VRAM | 16 GB | RTX 3090 24 GB |
|
| 234 |
+
| System RAM | 16 GB | 32 GB |
|
| 235 |
+
| Disk | 30 GB | SSD |
|
| 236 |
+
| CUDA | 12.x | 12.6.3 |
|
| 237 |
+
| OS | Linux / Windows + Docker | Windows + Docker Desktop |
|
| 238 |
+
|
| 239 |
+
> **Note on Windows:** Docker Desktop works fine. Avoid `/tmp/` paths — use named Docker volumes for model storage.
|
| 240 |
+
|
| 241 |
+
---
|
| 242 |
+
|
| 243 |
+
## Model Compatibility
|
| 244 |
+
|
| 245 |
+
Tested with **Mistral-Small-3.2-24B Q4_K_M** (14 GB).
|
| 246 |
+
Should work with any GGUF model that fits the VRAM budget after KV-cache allocation.
|
| 247 |
+
|
| 248 |
+
| Model | Size | VRAM (model) | Max ctx (turbo3) |
|
| 249 |
+
|-------|------|-------------|-----------------|
|
| 250 |
+
| Mistral-Small-3.2 24B Q4_K_M | 14 GB | 14.4 GB | ~100K on 24 GB GPU |
|
| 251 |
+
| Llama-3.1 8B Q4_K_M | 4.7 GB | 5.1 GB | ~200K on 16 GB GPU |
|
| 252 |
+
| Qwen2.5 14B Q4_K_M | 8.5 GB | 8.8 GB | ~150K on 16 GB GPU |
|
| 253 |
+
|
| 254 |
+
*Estimates. Actual values depend on architecture and batch size.*
|
| 255 |
+
|
| 256 |
+
---
|
| 257 |
+
|
| 258 |
+
## 📄 License
|
| 259 |
+
|
| 260 |
+
Content and scripts: [CC BY 4.0](LICENSE)
|
| 261 |
+
Based on [TurboQuant (arXiv:2504.19874)](https://arxiv.org/abs/2504.19874) by Thomas et al. (ICLR 2026)
|
| 262 |
+
llama.cpp fork: [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
---
|
| 267 |
+
|
| 268 |
+
## 🇩🇪 Deutsch
|
| 269 |
+
|
| 270 |
+
### TurboQuant auf Consumer-Hardware — Praktischer Guide
|
| 271 |
+
|
| 272 |
+
Dieses Repository dokumentiert unsere Erfahrungen beim Einsatz von TurboQuant (ICLR 2026)
|
| 273 |
+
auf einer RTX 3090 im Homelab-Betrieb. Wir sind das erste europäische Team,
|
| 274 |
+
das diese Methode praktisch auf Consumer-Hardware veröffentlicht dokumentiert hat.
|
| 275 |
+
|
| 276 |
+
### Das Ergebnis
|
| 277 |
+
|
| 278 |
+
Mit TurboQuant turbo3 (3-bit KV-Cache) haben wir auf einer RTX 3090 (24 GB):
|
| 279 |
+
|
| 280 |
+
- **12× mehr Context** (8.192 → 100.000 Tokens)
|
| 281 |
+
- nur **+1.8 GB VRAM** Mehrverbrauch
|
| 282 |
+
- nur **−8.5% Geschwindigkeitsverlust**
|
| 283 |
+
- **gleiche Modellgewichte** — nur der Laufzeit-Cache wird komprimiert
|
| 284 |
+
|
| 285 |
+
### Warum das wichtig ist
|
| 286 |
+
|
| 287 |
+
Größerer Context bedeutet: Längere Dokumente, mehr Gesprächshistorie, besseres RAG,
|
| 288 |
+
Code-Analyse ganzer Codebasen — alles auf einer einzigen Consumer-GPU.
|
| 289 |
+
|
| 290 |
+
### Fehler-Protokoll (5 Fehler die wir gemacht haben)
|
| 291 |
+
|
| 292 |
+
Alle 5 Fehler aus unserem Setup sind unter [Errors & Fixes](#-errors--fixes) dokumentiert.
|
| 293 |
+
Der häufigste: falscher Branch (`master` statt `feature/turboquant-kv-cache`).
|
| 294 |
+
|
| 295 |
+
### Schnellstart (Deutsch)
|
| 296 |
+
|
| 297 |
+
```bash
|
| 298 |
+
# Image bauen (~20 Minuten)
|
| 299 |
+
docker build -t turboquant:feature .
|
| 300 |
+
|
| 301 |
+
# Modell herunterladen (14 GB)
|
| 302 |
+
export HF_TOKEN=dein_token
|
| 303 |
+
bash scripts/download-model.sh
|
| 304 |
+
|
| 305 |
+
# Baseline starten (f16, 8K Context)
|
| 306 |
+
bash scripts/run-baseline.sh
|
| 307 |
+
|
| 308 |
+
# TurboQuant starten (turbo3, 100K Context)
|
| 309 |
+
bash scripts/run-turbo.sh
|
| 310 |
+
```
|
| 311 |
+
|
| 312 |
+
Vollständige deutsche Dokumentation: [`WHITEPAPER.de.md`](WHITEPAPER.de.md)
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
*AI Engineering Lab · April 2026 · [ai-engineering.at](https://ai-engineering.at)*
|
WHITEPAPER.de.md
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TurboQuant auf Consumer-Hardware
|
| 2 |
+
## 100.000 Token Context auf einer RTX 3090 — Schritt für Schritt
|
| 3 |
+
|
| 4 |
+
> AI Engineering Lab | April 2026
|
| 5 |
+
> Getestet auf: NVIDIA RTX 3090 (24 GB VRAM), Windows + Docker Desktop
|
| 6 |
+
> Modell: Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Executive Summary
|
| 11 |
+
|
| 12 |
+
TurboQuant ist eine KV-Cache-Quantisierungsmethode aus dem Paper
|
| 13 |
+
"TurboQuant: Ultra-Low-Bit KV-Cache Quantization for LLMs" (ICLR 2026, arXiv:2504.19874).
|
| 14 |
+
|
| 15 |
+
**Das Ergebnis in einer Zeile:**
|
| 16 |
+
Mit TurboQuant turbo3 (3-bit KV-Cache) erreichen wir auf einer RTX 3090
|
| 17 |
+
einen Context von **100.000 Tokens** — bei nur 8,5% Geschwindigkeitsverlust
|
| 18 |
+
und ohne Änderung der Modellgewichte.
|
| 19 |
+
|
| 20 |
+
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 21 |
+
|--|:---:|:---:|:---:|
|
| 22 |
+
| **Context** | 8.192 | **100.000** | **+12,2×** |
|
| 23 |
+
| **VRAM** | 15,4 GB | 17,2 GB | +1,8 GB |
|
| 24 |
+
| **Tokens/s** | 49,2 | 45,0 | −8,5% |
|
| 25 |
+
| **KV-Cache** | ~1 GB (f16) | ~2,8 GB (3-bit) | 4,3× Kompression |
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 1. Das Problem: Der KV-Cache frisst VRAM
|
| 30 |
+
|
| 31 |
+
Wenn ein LLM läuft, berechnet es für jeden Token Key-Value-Paare (KV).
|
| 32 |
+
Diese werden im VRAM gecacht, damit spätere Tokens darauf zugreifen können.
|
| 33 |
+
|
| 34 |
+
Das Problem: Der KV-Cache **wächst linear mit der Kontextlänge**.
|
| 35 |
+
|
| 36 |
+
Für Mistral-Small-3.2 24B auf einer RTX 3090 (24 GB, davon ~14,4 GB für Modellgewichte):
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
Context KV-Cache (f16) Verfügbar nach Modell Passt?
|
| 40 |
+
8.192 ~1 GB 9,6 GB ✅
|
| 41 |
+
32.000 ~4 GB 9,6 GB ✅
|
| 42 |
+
100.000 ~12 GB 9,6 GB ❌ OOM
|
| 43 |
+
100.000 ~2,8 GB (turbo3) 9,6 GB ✅
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
100K Context ist ohne Optimierung auf einer 24-GB-GPU schlicht nicht möglich.
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## 2. Was TurboQuant macht
|
| 51 |
+
|
| 52 |
+
TurboQuant komprimiert den KV-Cache von 16-bit auf 2–4-bit.
|
| 53 |
+
**NICHT die Modellgewichte** — nur den Laufzeit-Cache.
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
f16 KV-Cache (16 bit) → turbo3 KV-Cache (3 bit) = 4,3× weniger Speicher
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Das Modell liest den quantisierten Cache und generiert Text ganz normal.
|
| 60 |
+
Qualitätsverlust: laut Paper <1% Perplexity-Anstieg bei turbo3.
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## 3. Das Ecosystem — Zwei Repos, ein häufiger Fehler
|
| 65 |
+
|
| 66 |
+
Es gibt zwei TurboQuant-Repositories mit verwirrenden Namen:
|
| 67 |
+
|
| 68 |
+
| Repo | Was es ist | Wann benutzen |
|
| 69 |
+
|------|-----------|--------------|
|
| 70 |
+
| `TheTom/turboquant_plus` | Python-Bibliothek | HuggingFace-Modelle, Forschung |
|
| 71 |
+
| `TheTom/llama-cpp-turboquant` | llama.cpp-Fork | **Dieser Guide — llama-server** |
|
| 72 |
+
|
| 73 |
+
**Dieser Guide verwendet `TheTom/llama-cpp-turboquant`, Branch `feature/turboquant-kv-cache`.**
|
| 74 |
+
|
| 75 |
+
Kritisch: Der Default-Branch `master` ist ein normales llama.cpp **ohne TurboQuant**.
|
| 76 |
+
Die Implementierung liegt auf `feature/turboquant-kv-cache`.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## 4. Setup — Schritt für Schritt
|
| 81 |
+
|
| 82 |
+
### 4.1 Branch verifizieren (vor dem Build!)
|
| 83 |
+
|
| 84 |
+
```bash
|
| 85 |
+
curl -s "https://api.github.com/repos/TheTom/llama-cpp-turboquant/branches" \
|
| 86 |
+
| python3 -c "import sys,json; [print(b['name']) for b in json.load(sys.stdin)]"
|
| 87 |
+
# Erwartet: feature/turboquant-kv-cache, master
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### 4.2 Docker Image bauen (~20 Minuten)
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
docker build -t turboquant:feature .
|
| 94 |
+
|
| 95 |
+
# Verifizieren: turbo2, turbo3, turbo4 müssen erscheinen
|
| 96 |
+
docker run --rm turboquant:feature llama-server -h 2>&1 | grep -A3 "cache-type-k"
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### 4.3 Modell herunterladen (~14 GB)
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
export HF_TOKEN=hf_dein_token
|
| 103 |
+
bash scripts/download-model.sh
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### 4.4 Baseline starten (Referenzwert)
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
bash scripts/run-baseline.sh
|
| 110 |
+
# → Port 8180, f16 KV-Cache, 8192 Context
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
### 4.5 TurboQuant starten
|
| 114 |
+
|
| 115 |
+
```bash
|
| 116 |
+
bash scripts/run-turbo.sh
|
| 117 |
+
# → Port 8182, turbo3 KV-Cache, 100.000 Context
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 5. Fehler-Protokoll
|
| 123 |
+
|
| 124 |
+
Alle 5 Fehler aus unserem Setup — damit du sie nicht wiederholst:
|
| 125 |
+
|
| 126 |
+
### Fehler 1: Falsches Repository
|
| 127 |
+
**Symptom:** Kein `turbo2`/`turbo3`/`turbo4` nach dem Build.
|
| 128 |
+
**Ursache:** `TheTom/turboquant_plus` (Python-Bibliothek) statt `TheTom/llama-cpp-turboquant` gebaut.
|
| 129 |
+
**Fix:** Richtiges Repo verwenden — siehe Dockerfile.
|
| 130 |
+
|
| 131 |
+
### Fehler 2: Falsches cmake-Flag
|
| 132 |
+
**Symptom:** Kein CUDA, CPU-Fallback.
|
| 133 |
+
**Ursache:** `-DLLAMA_CUBLAS=ON` wurde umbenannt.
|
| 134 |
+
**Fix:** `-DGGML_CUDA=ON` (modernes llama.cpp, post-GGML-Refactor).
|
| 135 |
+
|
| 136 |
+
### Fehler 3: libcuda.so.1 fehlt beim Build
|
| 137 |
+
**Symptom:** Linker-Fehler beim cmake-Build.
|
| 138 |
+
**Ursache:** Docker-Build-Time hat keinen NVIDIA-Treiber — nur ein Stub ohne `.1`-Suffix.
|
| 139 |
+
**Fix:** Symlink VOR cmake setzen (siehe Dockerfile).
|
| 140 |
+
|
| 141 |
+
### Fehler 4: Falscher Branch
|
| 142 |
+
**Symptom:** `Unsupported cache type: turbo3` zur Laufzeit.
|
| 143 |
+
**Ursache:** Master-Branch geklont (standard llama.cpp, kein TurboQuant).
|
| 144 |
+
**Fix:** `git clone --branch feature/turboquant-kv-cache`
|
| 145 |
+
|
| 146 |
+
### Fehler 5: Falscher HuggingFace-Repo-Name
|
| 147 |
+
**Symptom:** 404 beim Modell-Download.
|
| 148 |
+
**Ursache:** Repo-Namen aus dem Gedächtnis rekonstruiert — falsch.
|
| 149 |
+
**Fix:** Immer live via HF Search API verifizieren (nie aus dem Kontext nehmen).
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## 6. Benchmark-Methodik
|
| 154 |
+
|
| 155 |
+
Jede Messung:
|
| 156 |
+
1. VRAM messen nach Server-Start (nvidia-smi)
|
| 157 |
+
2. 3× curl an `/v1/chat/completions` mit "Count from 1 to 200"
|
| 158 |
+
3. Durchschnitt aus 3 Läufen
|
| 159 |
+
4. Container stoppen, 30s warten, nächste Messung
|
| 160 |
+
|
| 161 |
+
TPS-Berechnung: `completion_tokens / (total_duration_ms / 1000)`
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## 7. Produktions-Checkliste
|
| 166 |
+
|
| 167 |
+
Bevor du TurboQuant in Produktion einsetzt:
|
| 168 |
+
|
| 169 |
+
- [ ] Image gebaut von `feature/turboquant-kv-cache` (NICHT master)
|
| 170 |
+
- [ ] Verifiziert: `llama-server -h | grep turbo` zeigt turbo2, turbo3, turbo4
|
| 171 |
+
- [ ] VRAM-Budget berechnet: Modell + KV-Cache + Overhead ≤ GPU-VRAM
|
| 172 |
+
- [ ] Port-Konflikte geprüft (kein anderer Service auf dem Port)
|
| 173 |
+
- [ ] Startup-Zeit eingeplant: 100K Context braucht ~90s Startzeit
|
| 174 |
+
- [ ] Qualität getestet: Stichproben-Outputs mit turbo3 vs f16 verglichen
|
| 175 |
+
- [ ] Modell-Download via HF Search API verifiziert (nicht aus Erinnerung)
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## 8. Rohdaten
|
| 180 |
+
|
| 181 |
+
Alle Benchmark-Rohdaten: [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json)
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
*AI Engineering Lab · April 2026 · [ai-engineering.at](https://ai-engineering.at)*
|
| 186 |
+
*Basierend auf TurboQuant (arXiv:2504.19874) von Thomas et al. (ICLR 2026)*
|
results/turboquant-rtx3090-2026-04-01.json
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"date": "2026-04-01",
|
| 3 |
+
"hardware": {
|
| 4 |
+
"node": ".90",
|
| 5 |
+
"gpu": "NVIDIA GeForce RTX 3090",
|
| 6 |
+
"vram_total_mb": 24576,
|
| 7 |
+
"vram_total_gb": 24.0,
|
| 8 |
+
"driver": "CUDA 12.6"
|
| 9 |
+
},
|
| 10 |
+
"model": {
|
| 11 |
+
"name": "Mistral-Small-3.2-24B-Instruct-2506",
|
| 12 |
+
"file": "mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf",
|
| 13 |
+
"hf_repo": "bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF",
|
| 14 |
+
"quantization": "Q4_K_M",
|
| 15 |
+
"size_gb": 14.0
|
| 16 |
+
},
|
| 17 |
+
"llama_cpp_build": {
|
| 18 |
+
"repo": "TheTom/llama-cpp-turboquant",
|
| 19 |
+
"branch": "feature/turboquant-kv-cache",
|
| 20 |
+
"commit": "feature/turboquant-kv-cache@depth1",
|
| 21 |
+
"image_baseline": "turboquant-plus:latest",
|
| 22 |
+
"image_turboquant": "turboquant-plus:feature",
|
| 23 |
+
"note_e116": "turboquant-plus:latest was built from master branch WITHOUT turbo cache types. turboquant-plus:feature built from feature/turboquant-kv-cache has turbo2/turbo3/turbo4."
|
| 24 |
+
},
|
| 25 |
+
"baseline": {
|
| 26 |
+
"cache_type_k": "f16",
|
| 27 |
+
"cache_type_v": "f16",
|
| 28 |
+
"max_context": 8192,
|
| 29 |
+
"port": 8180,
|
| 30 |
+
"vram_total_mb": 15748,
|
| 31 |
+
"vram_total_gb": 15.38,
|
| 32 |
+
"vram_model_only_estimate_gb": 14.4,
|
| 33 |
+
"vram_kv_cache_estimate_gb": 0.98,
|
| 34 |
+
"tps_run1": 49.6,
|
| 35 |
+
"tps_run2": 48.9,
|
| 36 |
+
"tps_run3": 49.2,
|
| 37 |
+
"tps_avg": 49.2,
|
| 38 |
+
"toolcall_note": "INCOMPATIBLE: 'Failed to parse input at pos 70: </s>' — Mistral chat template + this llama.cpp version. tool_choice=auto triggers grammar-constrained parsing which fails on Mistral EOS token.",
|
| 39 |
+
"toolcall_score": null,
|
| 40 |
+
"toolcall_max": 10
|
| 41 |
+
},
|
| 42 |
+
"turbo3": {
|
| 43 |
+
"cache_type_k": "turbo3",
|
| 44 |
+
"cache_type_v": "turbo3",
|
| 45 |
+
"max_context": 100000,
|
| 46 |
+
"port": 8182,
|
| 47 |
+
"vram_total_mb": 17634,
|
| 48 |
+
"vram_total_gb": 17.22,
|
| 49 |
+
"vram_model_only_estimate_gb": 14.4,
|
| 50 |
+
"vram_kv_cache_estimate_gb": 2.82,
|
| 51 |
+
"tps_run1": 45.7,
|
| 52 |
+
"tps_run2": 44.2,
|
| 53 |
+
"tps_run3": 45.2,
|
| 54 |
+
"tps_avg": 45.0,
|
| 55 |
+
"toolcall_note": "NOT TESTED (same toolcall issue as baseline — Mistral template incompatibility, not TurboQuant-related)",
|
| 56 |
+
"toolcall_score": null,
|
| 57 |
+
"toolcall_max": 15
|
| 58 |
+
},
|
| 59 |
+
"analysis": {
|
| 60 |
+
"kv_cache_compression": {
|
| 61 |
+
"f16_8k_gb": 0.98,
|
| 62 |
+
"turbo3_100k_gb": 2.82,
|
| 63 |
+
"f16_100k_estimate_gb": 12.25,
|
| 64 |
+
"note": "f16 100K would require ~12.25GB KV-cache (OOM on RTX 3090 after 14.4GB model). turbo3 at 100K uses only 2.82GB.",
|
| 65 |
+
"turbo3_vs_f16_at_100k_ratio": 4.34,
|
| 66 |
+
"context_extension_factor": 12.2,
|
| 67 |
+
"context_extension_x": "8192 → 100000 (12.2x)"
|
| 68 |
+
},
|
| 69 |
+
"vram_delta": {
|
| 70 |
+
"baseline_mb": 15748,
|
| 71 |
+
"turbo3_mb": 17634,
|
| 72 |
+
"delta_mb": 1886,
|
| 73 |
+
"delta_gb": 1.84,
|
| 74 |
+
"note": "Turbo3 at 100K uses only 1.84GB MORE VRAM than baseline at 8K, despite 12x larger context"
|
| 75 |
+
},
|
| 76 |
+
"tps_delta": {
|
| 77 |
+
"baseline_avg": 49.2,
|
| 78 |
+
"turbo3_avg": 45.0,
|
| 79 |
+
"delta_pct": -8.5,
|
| 80 |
+
"note": "8.5% TPS reduction with turbo3 — within expected range (<10%). KV-cache reads are slightly more complex."
|
| 81 |
+
},
|
| 82 |
+
"recommendation": "TurboQuant turbo3 is production-ready for workloads requiring >8K context. At 100K context, it uses only 2.82GB KV-cache vs ~12GB for f16 (4.3x compression). TPS drops from 49.2 to 45.0 (-8.5%), which is acceptable for long-context use cases. For short-context (<8K) high-throughput workloads, f16 is still preferred."
|
| 83 |
+
},
|
| 84 |
+
"errors": ["E111", "E112", "E113", "E114", "E116"],
|
| 85 |
+
"learnings": ["L191", "L192", "L193"],
|
| 86 |
+
"notes": [
|
| 87 |
+
"E116: Default master branch of TheTom/llama-cpp-turboquant does NOT have TurboQuant. Must use --branch feature/turboquant-kv-cache",
|
| 88 |
+
"Port 8181 is occupied by ocr-api-waitress on .90. TurboQuant turbo3 uses port 8182.",
|
| 89 |
+
"ToolCall-15 test not completed: Mistral chat template incompatibility with grammar-constrained tool parsing in this llama.cpp version. Not a TurboQuant issue.",
|
| 90 |
+
"Model used: Mistral-Small-3.2-24B (instead of planned Qwen3.5-27B due to E114 hallucinated repo name)"
|
| 91 |
+
]
|
| 92 |
+
}
|
scripts/download-model.sh
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Download Mistral-Small-3.2-24B-Instruct Q4_K_M GGUF model
|
| 3 |
+
#
|
| 4 |
+
# Usage: export HF_TOKEN=hf_... && bash scripts/download-model.sh
|
| 5 |
+
#
|
| 6 |
+
# Always verify the repo name via HF Search API before downloading.
|
| 7 |
+
# HF repo names change and compressed context can reconstruct them incorrectly.
|
| 8 |
+
|
| 9 |
+
set -e
|
| 10 |
+
|
| 11 |
+
MODEL_REPO="bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF"
|
| 12 |
+
MODEL_FILE="mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf"
|
| 13 |
+
VOLUME_NAME="${VOLUME_NAME:-turboquant-models}"
|
| 14 |
+
|
| 15 |
+
if [ -z "$HF_TOKEN" ]; then
|
| 16 |
+
echo "ERROR: HF_TOKEN not set."
|
| 17 |
+
echo "Get a free token at: https://huggingface.co/settings/tokens"
|
| 18 |
+
echo "Then: export HF_TOKEN=hf_..."
|
| 19 |
+
exit 1
|
| 20 |
+
fi
|
| 21 |
+
|
| 22 |
+
echo "=== Verifying repo exists ==="
|
| 23 |
+
HF_CHECK=$(curl -s -o /dev/null -w "%{http_code}" \
|
| 24 |
+
-H "Authorization: Bearer $HF_TOKEN" \
|
| 25 |
+
"https://huggingface.co/api/models/${MODEL_REPO}")
|
| 26 |
+
|
| 27 |
+
if [ "$HF_CHECK" != "200" ]; then
|
| 28 |
+
echo "ERROR: Repo not found or unauthorized (HTTP $HF_CHECK)"
|
| 29 |
+
echo "Search for available repos:"
|
| 30 |
+
curl -s -H "Authorization: Bearer $HF_TOKEN" \
|
| 31 |
+
"https://huggingface.co/api/models?search=bartowski+mistral+small+3.2&limit=5" \
|
| 32 |
+
| python3 -c "import sys,json; [print(m['modelId']) for m in json.load(sys.stdin)]"
|
| 33 |
+
exit 1
|
| 34 |
+
fi
|
| 35 |
+
|
| 36 |
+
echo "Repo: ${MODEL_REPO} ✓"
|
| 37 |
+
echo "File: ${MODEL_FILE}"
|
| 38 |
+
echo "Volume: ${VOLUME_NAME}"
|
| 39 |
+
echo ""
|
| 40 |
+
|
| 41 |
+
docker volume create ${VOLUME_NAME} 2>/dev/null || true
|
| 42 |
+
|
| 43 |
+
echo "=== Downloading (~14 GB, may take 20-30 min) ==="
|
| 44 |
+
docker run --rm \
|
| 45 |
+
-v ${VOLUME_NAME}:/models \
|
| 46 |
+
-e HF_TOKEN="${HF_TOKEN}" \
|
| 47 |
+
python:3.11-slim \
|
| 48 |
+
bash -c "
|
| 49 |
+
pip install -q huggingface_hub && \
|
| 50 |
+
python -c \"
|
| 51 |
+
import os
|
| 52 |
+
from huggingface_hub import hf_hub_download
|
| 53 |
+
path = hf_hub_download(
|
| 54 |
+
repo_id='${MODEL_REPO}',
|
| 55 |
+
filename='${MODEL_FILE}',
|
| 56 |
+
local_dir='/models',
|
| 57 |
+
resume_download=True,
|
| 58 |
+
token=os.environ.get('HF_TOKEN')
|
| 59 |
+
)
|
| 60 |
+
print('Downloaded to:', path)
|
| 61 |
+
print('Size: {:.1f} GB'.format(os.path.getsize(path) / 1e9))
|
| 62 |
+
\"
|
| 63 |
+
"
|
| 64 |
+
|
| 65 |
+
echo ""
|
| 66 |
+
echo "=== Done ==="
|
| 67 |
+
echo "Model ready at: /models/${MODEL_FILE} (in Docker volume '${VOLUME_NAME}')"
|
| 68 |
+
echo "Run: bash scripts/run-baseline.sh"
|
scripts/run-baseline.sh
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# TurboQuant Baseline — f16 KV-Cache, context=8192
|
| 3 |
+
# Reference measurement for comparison with TurboQuant run
|
| 4 |
+
#
|
| 5 |
+
# Usage: bash scripts/run-baseline.sh [model-path] [port]
|
| 6 |
+
# Default model: /models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf
|
| 7 |
+
# Default port: 8180
|
| 8 |
+
|
| 9 |
+
MODEL="${1:-/models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf}"
|
| 10 |
+
PORT="${2:-8180}"
|
| 11 |
+
VOLUME="${VOLUME_NAME:-turboquant-models}"
|
| 12 |
+
IMAGE="${IMAGE:-turboquant:feature}"
|
| 13 |
+
|
| 14 |
+
echo "=== TurboQuant Baseline Run ==="
|
| 15 |
+
echo "Model: $MODEL"
|
| 16 |
+
echo "Cache: f16 (full precision)"
|
| 17 |
+
echo "Context: 8192 tokens"
|
| 18 |
+
echo "Port: $PORT"
|
| 19 |
+
echo ""
|
| 20 |
+
|
| 21 |
+
# Stop any existing baseline container
|
| 22 |
+
docker rm -f turboquant-baseline 2>/dev/null || true
|
| 23 |
+
|
| 24 |
+
docker run --rm --gpus all \
|
| 25 |
+
-v "${VOLUME}:/models" \
|
| 26 |
+
-p "${PORT}:8180" \
|
| 27 |
+
--name turboquant-baseline \
|
| 28 |
+
"${IMAGE}" \
|
| 29 |
+
llama-server \
|
| 30 |
+
--model "${MODEL}" \
|
| 31 |
+
--cache-type-k f16 \
|
| 32 |
+
--cache-type-v f16 \
|
| 33 |
+
-c 8192 \
|
| 34 |
+
--host 0.0.0.0 \
|
| 35 |
+
--port 8180 \
|
| 36 |
+
-ngl 99
|
| 37 |
+
|
| 38 |
+
echo ""
|
| 39 |
+
echo "Baseline serving at: http://localhost:${PORT}"
|
| 40 |
+
echo "OpenAI-compatible: http://localhost:${PORT}/v1/chat/completions"
|
| 41 |
+
echo ""
|
| 42 |
+
echo "After startup (~45s), measure VRAM: nvidia-smi --query-gpu=memory.used --format=csv,noheader"
|
scripts/run-turbo.sh
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# TurboQuant turbo3 — 3-bit KV-Cache, context=100000
|
| 3 |
+
# 12× more context than baseline, +1.8 GB VRAM only
|
| 4 |
+
#
|
| 5 |
+
# Usage: bash scripts/run-turbo.sh [model-path] [port]
|
| 6 |
+
# Default model: /models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf
|
| 7 |
+
# Default port: 8182
|
| 8 |
+
#
|
| 9 |
+
# NOTE: Port 8180 is used by the baseline run. Use a different port here.
|
| 10 |
+
|
| 11 |
+
MODEL="${1:-/models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf}"
|
| 12 |
+
PORT="${2:-8182}"
|
| 13 |
+
VOLUME="${VOLUME_NAME:-turboquant-models}"
|
| 14 |
+
IMAGE="${IMAGE:-turboquant:feature}"
|
| 15 |
+
|
| 16 |
+
echo "=== TurboQuant turbo3 Run ==="
|
| 17 |
+
echo "Model: $MODEL"
|
| 18 |
+
echo "Cache: turbo3 (3-bit KV quantization)"
|
| 19 |
+
echo "Context: 100,000 tokens"
|
| 20 |
+
echo "Port: $PORT"
|
| 21 |
+
echo ""
|
| 22 |
+
echo "Expected VRAM: ~17.2 GB (+1.8 GB vs baseline)"
|
| 23 |
+
echo "Expected TPS: ~45 (-8.5% vs baseline)"
|
| 24 |
+
echo ""
|
| 25 |
+
|
| 26 |
+
# Stop any existing turbo container
|
| 27 |
+
docker rm -f turboquant-turbo3 2>/dev/null || true
|
| 28 |
+
|
| 29 |
+
docker run --rm --gpus all \
|
| 30 |
+
-v "${VOLUME}:/models" \
|
| 31 |
+
-p "${PORT}:8182" \
|
| 32 |
+
--name turboquant-turbo3 \
|
| 33 |
+
"${IMAGE}" \
|
| 34 |
+
llama-server \
|
| 35 |
+
--model "${MODEL}" \
|
| 36 |
+
--cache-type-k turbo3 \
|
| 37 |
+
--cache-type-v turbo3 \
|
| 38 |
+
-c 100000 \
|
| 39 |
+
--host 0.0.0.0 \
|
| 40 |
+
--port 8182 \
|
| 41 |
+
-ngl 99
|
| 42 |
+
|
| 43 |
+
echo ""
|
| 44 |
+
echo "TurboQuant serving at: http://localhost:${PORT}"
|
| 45 |
+
echo "OpenAI-compatible: http://localhost:${PORT}/v1/chat/completions"
|
| 46 |
+
echo ""
|
| 47 |
+
echo "After startup (~90s, 100K context allocation takes longer):"
|
| 48 |
+
echo " VRAM: nvidia-smi --query-gpu=memory.used --format=csv,noheader"
|