turboquant / README.md
vivekvar's picture
Upload folder using huggingface_hub
d4ec3e8 verified
# TurboQuant: First Open-Source Implementation
First open-source implementation of [TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate](https://arxiv.org/abs/2504.19874) (Zandieh, Daliri, Hadian, Mirrokni β€” Google Research / Google DeepMind / NYU, April 2025).
TurboQuant compresses LLM KV caches **4-7x** at inference time using random rotation + optimal scalar quantization, with **near-zero quality loss**. No training, no calibration data, fully data-oblivious. Drop-in replacement for HuggingFace Transformers cache.
## Key Results
Benchmarked across **5 model families, 6 models (7B to 70B)** on NVIDIA H100 NVL (96GB):
| Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity | Saved @8K |
|---|---|---|---|---|---|---|
| **Qwen2.5-7B** | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact | 380 MB |
| **Llama-3.1-8B** | 32L, llama | 8 | 128 | none | exact | 890 MB |
| **Gemma-2-9B** | 42L, gemma2 | 8 | 256 | none | exact | 2,323 MB |
| **Phi-4-14B** | 40L, phi3 | 10 | 128 | none | exact | 1,392 MB |
| **Qwen2.5-32B** | 64L, qwen2 | 8 | 128 | none | exact | 1,791 MB |
| **Llama-3.3-70B** | 80L, llama | 8 | 128 | none | exact | 501 MB (@2K) |
**Prefill logits are bit-identical (0.0 difference)** across all 6 tested models. Output quality is coherent and semantically correct β€” divergence from uncompressed output is purely greedy-decoding drift, not quality degradation.
### Needle-in-a-Haystack: 100% Recall
Tested on Qwen2.5-7B across 5 context lengths (1K-16K) and 3 needle positions (25%, 50%, 75%):
| | Default Cache | TurboQuant Cache |
|---|---|---|
| **Recall** | **15/15 (100%)** | **15/15 (100%)** |
TurboQuant preserves retrieval quality perfectly, matching the paper's 0.997 recall claim.
### Memory Savings Scale with Context
Qwen2.5-32B (4-bit weights) on H100:
| Context | Default KV | TurboQuant KV | Saved |
|---|---|---|---|
| 1K tokens | 19.97 GB | 19.79 GB | 186 MB |
| 4K tokens | 21.23 GB | 20.42 GB | 833 MB |
| 8K tokens | 23.16 GB | 21.41 GB | 1,791 MB |
| 32K tokens | ~27.5 GB | ~21.8 GB | ~5,700 MB (projected) |
## Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")
# Auto-detect outlier layers, create compressed cache
skip = TurboQuantCache.calibrate_skip_layers(model, tokenizer)
cache = TurboQuantCache(model.config, nbits=4, skip_layers=skip)
# Use exactly like default cache
inputs = tokenizer("Hello world", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)
```
## How It Works
TurboQuant implements Algorithm 1 (TurboQuant_mse) from the paper:
1. **Random rotation** (QR decomposition): transforms each KV vector so coordinates follow a known Beta distribution
2. **Optimal scalar quantization** (Lloyd-Max): quantizes each coordinate to 4 bits using precomputed codebook
3. **Bit packing**: stores 128-dim vectors as 64 bytes (uint4) + 2 bytes (norm) = **66 bytes vs 256 bytes BF16**
Theoretical guarantee: MSE distortion ≀ 0.009 at 4-bit, within **2.7x of information-theoretic optimum** (Shannon lower bound).
Our measured MSE: **0.0093** β€” matches the paper.
## What We Found Beyond the Paper
### Outlier Layer Norms
The paper mentions "splitting channels into outlier and non-outlier sets" without specifying how. We discovered:
- **Qwen2.5-7B**: Layer 0 key norms = 273.8 (16.2x median). Layer 27 = outlier too.
- **Qwen2.5-32B**: Layer 0 = 37.8 (2.35x median). Mild, no skip needed.
- **Llama-3.1-8B**: Max/median ratio = 1.18x. No outliers at all.
- **Gemma-2-9B**: Max/median ratio = 1.19x. No outliers.
- **Phi-4-14B**: Max/median ratio = 1.38x. No outliers.
**Finding**: Smaller Qwen models have severe outlier layers. Larger models and non-Qwen architectures are well-balanced. Our `calibrate_skip_layers()` auto-detects outliers and keeps them in full precision.
### head_dim Compatibility
The paper only tested head_dim=128 (Llama, Mistral). We verified TurboQuant works with **head_dim=256** (Gemma-2) β€” the Lloyd-Max codebook adapts to any dimension since it's computed from the Beta distribution parameterized by d.
### Architecture Coverage
| Architecture | Paper Tested | We Tested | Works |
|---|---|---|---|
| Llama | Llama-3.1-8B | Llama-3.1-8B, 3.3-70B | Yes |
| Mistral | Ministral-7B | β€” | β€” |
| Qwen | β€” | Qwen2.5-7B, 32B | Yes (with outlier handling) |
| Gemma | β€” | Gemma-2-9B | Yes (head_dim=256) |
| Phi | β€” | Phi-4-14B | Yes |
## Files
```
turboquant/
β”œβ”€β”€ __init__.py # Public API
β”œβ”€β”€ codebook.py # Lloyd-Max solver for Beta distribution
β”œβ”€β”€ quantizer.py # Core TurboQuantizer: rotate β†’ quantize β†’ pack
β”œβ”€β”€ packing.py # uint4/uint2 bit packing
β”œβ”€β”€ cache.py # TurboQuantCache for HF Transformers
scripts/
β”œβ”€β”€ verify.py # Unit tests (MSE bounds, packing, fixed-point)
β”œβ”€β”€ test_cache.py # Cache API integration tests
β”œβ”€β”€ benchmark_models.py # Multi-model benchmark suite
β”œβ”€β”€ run_inference.py # Interactive inference demo
benchmark_results.json # Raw benchmark data (all 5 models)
```
## Verified Against Paper
| Metric | Paper | Ours |
|---|---|---|
| MSE at 4-bit (unit vectors) | ≀ 0.009 | 0.0093 |
| MSE at 2-bit (unit vectors) | ≀ 0.117 | 0.116 |
| Compression ratio (per-vector) | ~4x | 3.88x |
| System compression @8K+ | 4-7x | 7.2x |
| Prefill fidelity | "quality neutral" | exact (0.0 logit diff) |
| Double quantization | fixed point | verified (indices identical) |
## Requirements
- Python 3.10+
- PyTorch 2.7+ (CUDA 12.8 compatible)
- HuggingFace Transformers 5.0+
- scipy (for codebook computation)
- bitsandbytes (optional, for 4-bit model loading)
## Citation
If you use this implementation, please cite the original paper:
```bibtex
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
journal={arXiv preprint arXiv:2504.19874},
year={2025}
}
```
## License
This implementation is released under MIT License. The TurboQuant algorithm is described in the paper above.