| --- |
| base_model: mistralai/Mistral-Small-4-119B-2603 |
| library_name: transformers |
| license: apache-2.0 |
| tags: |
| - rotorquant |
| - kv-cache-quantization |
| - mistral |
| - moe |
| - sparse-moe |
| - multimodal |
| - quantized |
| - 256k-context |
| - thinking |
| pipeline_tag: text-generation |
| --- |
| |
| # Mistral-Small-4-119B-RotorQuant |
|
|
| **KV cache quantization for Mistral Small 4 using RotorQuant** -- 5.3x faster prefill, 28% faster decode, with near-lossless quality (perplexity 6.91 vs 7.07 baseline). |
|
|
| This repository provides RotorQuant KV cache quantization support for [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603). Model weights are unchanged (FP16); only the KV cache is quantized during inference. |
|
|
| ## Model Specs |
|
|
| | Property | Value | |
| |---|---| |
| | Base Model | Mistral Small 4 (March 2026) | |
| | Total Parameters | 119B | |
| | Active Parameters | 6.5B per token (Sparse MoE) | |
| | Architecture | Sparse MoE -- 128 experts, 4 active per token | |
| | Context Length | 256K tokens | |
| | Modality | Text + Images (multimodal) | |
| | Capabilities | Thinking / reasoning, tool use, multilingual | |
| | License | Apache 2.0 | |
| | Quantization | KV cache only (RotorQuant) | |
|
|
| ## What is RotorQuant? |
|
|
| [RotorQuant](https://github.com/scrya-com/rotorquant) is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results: |
|
|
| - **5.3x faster prefill** compared to unquantized baseline |
| - **28% faster decode** throughput |
| - **Perplexity: 6.91** vs 7.07 for unquantized (lower is better -- RotorQuant actually improves quality due to outlier suppression) |
| - Default 3-bit quantization with minimal quality loss |
|
|
| ## Memory Estimates |
|
|
| | Component | FP16 Baseline | RotorQuant 3-bit | |
| |---|---|---| |
| | Model Weights | ~238 GB | ~238 GB | |
| | KV Cache (256K ctx) | ~32 GB | ~6.5 GB | |
| | **Total** | **~270 GB** | **~244.5 GB** | |
|
|
| > **Note:** This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count. |
|
|
| ## Quickstart |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from turboquant import IsoQuantCache |
| |
| model_id = "majentik/Mistral-Small-4-119B-RotorQuant" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype="auto", |
| device_map="auto", |
| ) |
| |
| # Enable RotorQuant KV cache |
| cache = IsoQuantCache(model) |
| |
| messages = [ |
| {"role": "user", "content": "Explain sparse mixture-of-experts architectures."} |
| ] |
| |
| inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) |
| outputs = model.generate( |
| inputs, |
| max_new_tokens=512, |
| past_key_values=cache, |
| ) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## See Also |
|
|
| - [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) -- Base model |
| - [majentik/Mistral-Small-4-119B-TurboQuant](https://huggingface.co/majentik/Mistral-Small-4-119B-TurboQuant) -- TurboQuant KV cache variant |
| - [majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit) -- MLX 4-bit weight-quantized + RotorQuant |
| - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) |
|
|