Mistral-Small-4-119B-TurboQuant
KV cache quantization for Mistral Small 4 using TurboQuant -- dramatically reduce memory usage at long context lengths while preserving model quality.
This repository provides TurboQuant KV cache quantization support for mistralai/Mistral-Small-4-119B-2603. Model weights are unchanged (FP16); only the KV cache is quantized during inference.
Model Specs
| Property | Value |
|---|---|
| Base Model | Mistral Small 4 (March 2026) |
| Total Parameters | 119B |
| Active Parameters | 6.5B per token (Sparse MoE) |
| Architecture | Sparse MoE -- 128 experts, 4 active per token |
| Context Length | 256K tokens |
| Modality | Text + Images (multimodal) |
| Capabilities | Thinking / reasoning, tool use, multilingual |
| License | Apache 2.0 |
| Quantization | KV cache only (TurboQuant) |
What is TurboQuant?
TurboQuant (arXiv: 2504.19874) is a KV cache quantization method that compresses the key-value cache used during autoregressive generation. It supports:
- 4-bit (default) -- minimal quality loss, ~4x KV cache reduction
- 2-bit (aggressive) -- higher compression with modest quality trade-off
Because only the KV cache is quantized (not the model weights), this is complementary to weight quantization and particularly impactful at long context lengths where KV cache dominates memory.
Memory Estimates
| Component | FP16 Baseline | TurboQuant 4-bit | TurboQuant 2-bit |
|---|---|---|---|
| Model Weights | ~238 GB | ~238 GB | ~238 GB |
| KV Cache (256K ctx) | ~32 GB | ~8 GB | ~4 GB |
| Total | ~270 GB | ~246 GB | ~242 GB |
Note: This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count.
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model_id = "majentik/Mistral-Small-4-119B-TurboQuant"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
# Enable TurboQuant KV cache (4-bit default)
cache = TurboQuantCache(model, bits=4)
messages = [
{"role": "user", "content": "Explain sparse mixture-of-experts architectures."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=512,
past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
2-bit Aggressive Mode
cache = TurboQuantCache(model, bits=2)
See Also
- mistralai/Mistral-Small-4-119B-2603 -- Base model
- majentik/Mistral-Small-4-119B-RotorQuant -- RotorQuant KV cache variant
- majentik/Mistral-Small-4-119B-TurboQuant-MLX-4bit -- MLX 4-bit weight-quantized + TurboQuant
- TurboQuant Paper (arXiv: 2504.19874)
Model tree for majentik/Mistral-Small-4-119B-TurboQuant
Base model
mistralai/Mistral-Small-4-119B-2603