Mistral-Small-4-119B-TurboQuant

KV cache quantization for Mistral Small 4 using TurboQuant -- dramatically reduce memory usage at long context lengths while preserving model quality.

This repository provides TurboQuant KV cache quantization support for mistralai/Mistral-Small-4-119B-2603. Model weights are unchanged (FP16); only the KV cache is quantized during inference.

Model Specs

Property Value
Base Model Mistral Small 4 (March 2026)
Total Parameters 119B
Active Parameters 6.5B per token (Sparse MoE)
Architecture Sparse MoE -- 128 experts, 4 active per token
Context Length 256K tokens
Modality Text + Images (multimodal)
Capabilities Thinking / reasoning, tool use, multilingual
License Apache 2.0
Quantization KV cache only (TurboQuant)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV cache quantization method that compresses the key-value cache used during autoregressive generation. It supports:

  • 4-bit (default) -- minimal quality loss, ~4x KV cache reduction
  • 2-bit (aggressive) -- higher compression with modest quality trade-off

Because only the KV cache is quantized (not the model weights), this is complementary to weight quantization and particularly impactful at long context lengths where KV cache dominates memory.

Memory Estimates

Component FP16 Baseline TurboQuant 4-bit TurboQuant 2-bit
Model Weights ~238 GB ~238 GB ~238 GB
KV Cache (256K ctx) ~32 GB ~8 GB ~4 GB
Total ~270 GB ~246 GB ~242 GB

Note: This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count.

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model_id = "majentik/Mistral-Small-4-119B-TurboQuant"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

# Enable TurboQuant KV cache (4-bit default)
cache = TurboQuantCache(model, bits=4)

messages = [
    {"role": "user", "content": "Explain sparse mixture-of-experts architectures."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
    inputs,
    max_new_tokens=512,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2-bit Aggressive Mode

cache = TurboQuantCache(model, bits=2)

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Mistral-Small-4-119B-TurboQuant

Finetuned
(9)
this model

Paper for majentik/Mistral-Small-4-119B-TurboQuant