majentik
/

Mistral-Small-4-119B-RotorQuant

Text Generation

kv-cache-quantization

Mixture of Experts

Model card Files Files and versions

Mistral-Small-4-119B-RotorQuant / README.md

majentik's picture

Add model card

a0e777e verified 3 days ago

|

history blame contribute delete

3.28 kB

	---
	base_model: mistralai/Mistral-Small-4-119B-2603
	library_name: transformers
	license: apache-2.0
	tags:
	- rotorquant
	- kv-cache-quantization
	- mistral
	- moe
	- sparse-moe
	- multimodal
	- quantized
	- 256k-context
	- thinking
	pipeline_tag: text-generation
	---

	# Mistral-Small-4-119B-RotorQuant

	KV cache quantization for Mistral Small 4 using RotorQuant -- 5.3x faster prefill, 28% faster decode, with near-lossless quality (perplexity 6.91 vs 7.07 baseline).

	This repository provides RotorQuant KV cache quantization support for [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603). Model weights are unchanged (FP16); only the KV cache is quantized during inference.

	## Model Specs

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| Mistral Small 4 (March 2026) \|
	\| Total Parameters \| 119B \|
	\| Active Parameters \| 6.5B per token (Sparse MoE) \|
	\| Architecture \| Sparse MoE -- 128 experts, 4 active per token \|
	\| Context Length \| 256K tokens \|
	\| Modality \| Text + Images (multimodal) \|
	\| Capabilities \| Thinking / reasoning, tool use, multilingual \|
	\| License \| Apache 2.0 \|
	\| Quantization \| KV cache only (RotorQuant) \|

	## What is RotorQuant?

	[RotorQuant](https://github.com/scrya-com/rotorquant) is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results:

	- 5.3x faster prefill compared to unquantized baseline
	- 28% faster decode throughput
	- Perplexity: 6.91 vs 7.07 for unquantized (lower is better -- RotorQuant actually improves quality due to outlier suppression)
	- Default 3-bit quantization with minimal quality loss

	## Memory Estimates

	\| Component \| FP16 Baseline \| RotorQuant 3-bit \|
	\|---\|---\|---\|
	\| Model Weights \| ~238 GB \| ~238 GB \|
	\| KV Cache (256K ctx) \| ~32 GB \| ~6.5 GB \|
	\| Total \| ~270 GB \| ~244.5 GB \|

	> Note: This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count.

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from turboquant import IsoQuantCache

	model_id = "majentik/Mistral-Small-4-119B-RotorQuant"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype="auto",
	device_map="auto",
	)

	# Enable RotorQuant KV cache
	cache = IsoQuantCache(model)

	messages = [
	{"role": "user", "content": "Explain sparse mixture-of-experts architectures."}
	]

	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
	outputs = model.generate(
	inputs,
	max_new_tokens=512,
	past_key_values=cache,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## See Also

	- [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) -- Base model
	- [majentik/Mistral-Small-4-119B-TurboQuant](https://huggingface.co/majentik/Mistral-Small-4-119B-TurboQuant) -- TurboQuant KV cache variant
	- [majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit) -- MLX 4-bit weight-quantized + RotorQuant
	- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)