DeepSeek-V3.2-TurboQuant
KV cache quantization for DeepSeek-V3.2 using TurboQuant compression.
This repository provides TurboQuant-compressed KV cache configurations for deepseek-ai/DeepSeek-V3.2, one of the most capable open-weight large language models available. TurboQuant dramatically reduces KV cache memory during inference while preserving generation quality.
Overview
| Attribute | Value |
|---|---|
| Base model | deepseek-ai/DeepSeek-V3.2 |
| Architecture | Mixture of Experts (MoE) |
| Total parameters | ~671B |
| Compression | TurboQuant KV cache (4-bit default, 2-bit aggressive) |
| License | MIT |
| Task | Text generation |
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model_id = "deepseek-ai/DeepSeek-V3.2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
device_map="auto",
torch_dtype="auto",
)
# Apply TurboQuant KV cache compression
cache = TurboQuantCache(
model,
bits=4, # 4-bit default; use bits=2 for aggressive compression
residual_length=128,
)
inputs = tokenizer("Explain mixture of experts architectures.", return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
past_key_values=cache,
max_new_tokens=512,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
What is TurboQuant?
TurboQuant (arXiv: 2504.19874) is a KV cache quantization method that compresses the key-value cache used during autoregressive generation. Rather than quantizing model weights, TurboQuant targets the runtime memory consumed by the KV cache, which grows linearly with sequence length and becomes the dominant memory cost at long contexts.
Compression Modes
| Mode | KV Cache Bits | Quality Impact | Use Case |
|---|---|---|---|
| Default | 4-bit | Minimal degradation | General-purpose inference |
| Aggressive | 2-bit | Slight degradation | Memory-constrained deployments |
Why KV Cache Compression Matters for DeepSeek-V3.2
DeepSeek-V3.2 is a massive 671B-parameter MoE model. At long context lengths, the KV cache can consume hundreds of gigabytes of memory. TurboQuant makes it feasible to serve this model with substantially reduced hardware requirements, especially for long-context workloads.
Memory Estimates
| Configuration | Approximate Size |
|---|---|
| FP16 weights | ~1.3 TB |
| FP8 weights (base) | ~671 GB |
| KV cache (FP16, 128K context) | Very large -- scales with sequence length |
| KV cache (TurboQuant 4-bit) | ~4x reduction vs FP16 cache |
| KV cache (TurboQuant 2-bit) | ~8x reduction vs FP16 cache |
Note: TurboQuant compresses the KV cache only. Model weights remain in their original precision. For weight quantization, see the MLX variants below.
See Also
- deepseek-ai/DeepSeek-V3.2 -- Base model
- TurboQuant paper (arXiv: 2504.19874) -- Method details
- majentik/DeepSeek-V3.2-RotorQuant -- Alternative KV cache compression
- majentik/DeepSeek-V3.2-TurboQuant-MLX-2bit -- MLX 2-bit weight quant + TurboQuant
- majentik/DeepSeek-V3.2-TurboQuant-MLX-1bit -- MLX 1-bit weight quant + TurboQuant
Model tree for majentik/DeepSeek-V3.2-TurboQuant
Base model
deepseek-ai/DeepSeek-V3.2-Exp-Base