Mistral-Small-4-119B-TurboQuant-MLX-2bit

Dual compression: 2-bit MLX weight quantization + TurboQuant KV cache quantization for Mistral Small 4 on Apple Silicon.

This repository provides a 2-bit weight-quantized MLX conversion of mistralai/Mistral-Small-4-119B-2603 with TurboQuant KV cache quantization support. Aggressive compression for running on consumer Apple Silicon hardware.

Overview

This model applies two complementary compression techniques:

  1. 2-bit weight quantization (MLX) -- reduces model weights from ~238 GB to ~30 GB
  2. TurboQuant KV cache quantization -- reduces KV cache from ~32 GB to ~8 GB at 256K context

This enables running a 119B-parameter MoE model on Apple Silicon Macs with 64 GB+ unified memory.

Model Specs

Property Value
Base Model Mistral Small 4 (March 2026)
Total Parameters 119B
Active Parameters 6.5B per token (Sparse MoE)
Architecture Sparse MoE -- 128 experts, 4 active per token
Context Length 256K tokens
Modality Text + Images (multimodal)
Capabilities Thinking / reasoning, tool use, multilingual
License Apache 2.0
Weight Quantization 2-bit (MLX)
KV Cache Quantization TurboQuant 4-bit

Memory Estimates

Configuration Weights KV Cache (256K) Total
FP16 baseline ~238 GB ~32 GB ~270 GB
This model (2-bit MLX + TurboQuant) ~30 GB ~8 GB ~38 GB

Note: This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count. The 2-bit quantization trades some quality for significantly reduced memory. Expect modest degradation on complex reasoning tasks compared to 4-bit.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Mistral-Small-4-119B-TurboQuant-MLX-2bit")

prompt = "Explain sparse mixture-of-experts architectures."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=text, max_tokens=512)
print(response)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV cache quantization method that compresses the key-value cache used during autoregressive generation. It supports 4-bit (default) and 2-bit (aggressive) modes. Because it targets the KV cache rather than weights, it stacks with weight quantization for compounding memory savings.

See Also

Downloads last month
143
Safetensors
Model size
12B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Mistral-Small-4-119B-TurboQuant-MLX-2bit

Quantized
(36)
this model

Paper for majentik/Mistral-Small-4-119B-TurboQuant-MLX-2bit