Gemma 4 26B-A4B - TurboQuant MLX 4-bit

4-bit weight-quantized MLX version of google/gemma-4-26B-A4B with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. A good balance between model quality and memory efficiency. Only 4B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: ~14 GB

Model Specifications

Property	Value
Base Model	google/gemma-4-26B-A4B
Parameters	26 billion total (4 billion active per token)
Architecture	Mixture-of-Experts (MoE) (4B active per token)
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Weight Quantization	4-bit (~14 GB)
KV-Cache Quantization	TurboQuant
Framework	MLX (Apple Silicon)

Quickstart

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/gemma-4-26B-A4B-TurboQuant-MLX-4bit")

prompt = "The history of artificial intelligence began"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

For multimodal usage with images:

from mlx_vlm import load, generate

model, processor = load("majentik/gemma-4-26B-A4B-TurboQuant-MLX-4bit")

prompt = "Describe the contents of this image."
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 4-bit weight quantization in MLX, this provides a dual compression strategy: smaller model weights for reduced disk and memory footprint, plus compressed KV cache for efficient long-context generation.

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

Memory Estimates (Gemma 4 26B-A4B)

Precision	Approximate Size	MLX Variant
FP16 (original)	~52 GB	--
8-bit quantized	~26 GB	TurboQuant-MLX-8bit
4-bit quantized	~14 GB	This model
2-bit quantized	~7 GB	TurboQuant-MLX-2bit

Hardware Requirements

This model requires approximately 14 GB of unified memory. Recommended hardware:

Apple M2 Pro (24 GB+)
Apple M3 Pro (24 GB+)
Apple M4 Pro (24 GB+)
Any Apple Silicon Mac with 24 GB+ unified memory

Model tree for majentik/gemma-4-26B-A4B-TurboQuant-MLX-4bit

Base model

google/gemma-4-26B-A4B

Quantized

(20)

this model

Paper for majentik/gemma-4-26B-A4B-TurboQuant-MLX-4bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 33

majentik
/

gemma-4-26B-A4B-TurboQuant-MLX-4bit