Mistral-Small-4-119B-RotorQuant-MLX-4bit

Dual compression: 4-bit MLX weight quantization + RotorQuant KV cache quantization for Mistral Small 4 on Apple Silicon.

This repository provides a 4-bit weight-quantized MLX conversion of mistralai/Mistral-Small-4-119B-2603 with RotorQuant KV cache quantization support. Designed for efficient inference on Apple Silicon Macs.

Overview

This model applies two complementary compression techniques:

  1. 4-bit weight quantization (MLX) -- reduces model weights from ~238 GB to ~60 GB
  2. RotorQuant KV cache quantization -- reduces KV cache from ~32 GB to ~6.5 GB at 256K context

Together, these make it feasible to run a 119B-parameter MoE model on high-memory Apple Silicon machines with excellent throughput.

Model Specs

Property Value
Base Model Mistral Small 4 (March 2026)
Total Parameters 119B
Active Parameters 6.5B per token (Sparse MoE)
Architecture Sparse MoE -- 128 experts, 4 active per token
Context Length 256K tokens
Modality Text + Images (multimodal)
Capabilities Thinking / reasoning, tool use, multilingual
License Apache 2.0
Weight Quantization 4-bit (MLX)
KV Cache Quantization RotorQuant 3-bit

Memory Estimates

Configuration Weights KV Cache (256K) Total
FP16 baseline ~238 GB ~32 GB ~270 GB
This model (4-bit MLX + RotorQuant) ~60 GB ~6.5 GB ~66.5 GB

Note: This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit")

prompt = "Explain sparse mixture-of-experts architectures."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=text, max_tokens=512)
print(response)

What is RotorQuant?

RotorQuant is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results on the base model:

  • 5.3x faster prefill compared to unquantized baseline
  • 28% faster decode throughput
  • Perplexity: 6.91 vs 7.07 for unquantized (lower is better)

Because it targets the KV cache rather than weights, it stacks with weight quantization for compounding memory savings.

See Also

Downloads last month
74
Safetensors
Model size
19B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit

Quantized
(36)
this model