Gemma 4 26B-A4B-it - RotorQuant MLX 2-bit
2-bit weight-quantized MLX version of google/gemma-4-26B-A4B-it with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. The most aggressive quantization, fitting the full model in the smallest possible footprint. Only 4B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.
Approximate model size: ~7 GB
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-26B-A4B-it |
| Parameters | 26 billion total (4 billion active per token) |
| Architecture | Mixture-of-Experts (MoE) (4B active per token) |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | 2-bit (~7 GB) |
| KV-Cache Quantization | RotorQuant |
| Framework | MLX (Apple Silicon) |
Quickstart
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit")
prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
For multimodal usage with images:
from mlx_vlm import load, generate
model, processor = load("majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit")
prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Combined with 2-bit weight quantization in MLX, this provides maximum compression with the best available KV-cache performance: the smallest possible model footprint plus the fastest compressed KV cache for efficient long-context generation.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
Note: 2-bit quantization is the most aggressive option and may result in some quality degradation compared to higher-precision variants. It is best suited for experimentation, rapid prototyping, or hardware-constrained environments.
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
Memory Estimates (Gemma 4 26B-A4B-it)
| Precision | Approximate Size | MLX Variant |
|---|---|---|
| FP16 (original) | ~52 GB | -- |
| 8-bit quantized | ~26 GB | RotorQuant-MLX-8bit |
| 4-bit quantized | ~14 GB | RotorQuant-MLX-4bit |
| 2-bit quantized | ~7 GB | This model |
Hardware Requirements
This model requires approximately 7 GB of unified memory. Recommended hardware:
- Apple M1 (16 GB+)
- Apple M2 (16 GB+)
- Apple M3 (16 GB+)
- Apple M4 (16 GB+)
- Any Apple Silicon Mac with 16 GB+ unified memory
See Also
- google/gemma-4-26B-A4B-it -- Base model
- majentik/gemma-4-26B-A4B-it-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-8bit -- MLX 8-bit variant
- majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-4bit -- MLX 4-bit variant
- majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-2bit -- TurboQuant MLX 2-bit variant
- RotorQuant GitHub
- MLX Framework
- Downloads last month
- 61
2-bit
Model tree for majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit
Base model
google/gemma-4-26B-A4B-it