Mistral-Small-4-119B-RotorQuant

KV cache quantization for Mistral Small 4 using RotorQuant -- 5.3x faster prefill, 28% faster decode, with near-lossless quality (perplexity 6.91 vs 7.07 baseline).

This repository provides RotorQuant KV cache quantization support for mistralai/Mistral-Small-4-119B-2603. Model weights are unchanged (FP16); only the KV cache is quantized during inference.

Model Specs

Property Value
Base Model Mistral Small 4 (March 2026)
Total Parameters 119B
Active Parameters 6.5B per token (Sparse MoE)
Architecture Sparse MoE -- 128 experts, 4 active per token
Context Length 256K tokens
Modality Text + Images (multimodal)
Capabilities Thinking / reasoning, tool use, multilingual
License Apache 2.0
Quantization KV cache only (RotorQuant)

What is RotorQuant?

RotorQuant is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results:

  • 5.3x faster prefill compared to unquantized baseline
  • 28% faster decode throughput
  • Perplexity: 6.91 vs 7.07 for unquantized (lower is better -- RotorQuant actually improves quality due to outlier suppression)
  • Default 3-bit quantization with minimal quality loss

Memory Estimates

Component FP16 Baseline RotorQuant 3-bit
Model Weights ~238 GB ~238 GB
KV Cache (256K ctx) ~32 GB ~6.5 GB
Total ~270 GB ~244.5 GB

Note: This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count.

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "majentik/Mistral-Small-4-119B-RotorQuant"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

# Enable RotorQuant KV cache
cache = IsoQuantCache(model)

messages = [
    {"role": "user", "content": "Explain sparse mixture-of-experts architectures."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
    inputs,
    max_new_tokens=512,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Mistral-Small-4-119B-RotorQuant

Finetuned
(9)
this model