gemma-4-19b-a4b-it-REAP-MLX-4bit

PLE-safe MLX 4-bit weights for 0xSero/gemma-4-19b-a4b-it-REAP on Apple Silicon.

REAP (Router-weighted Expert Activation Pruning) removes 30% of MoE experts while keeping the same active parameters per token (8 of 90 experts selected). Combined with PLE-safe 4-bit quantization, this model runs in 12.6 GB — fits on 24GB+ Macs.

Original 26B REAP (19B MoE (30% pruned)) This model
Experts/layer 128 90 90
Precision BF16 BF16 4-bit
Disk size ~52 GB ~36 GB 12.6 GB

Also available

Accuracy Benchmarks

0-shot generative, thinking enabled, 50 samples per task, identical eval harness. Apple M4 Max 36GB.

Task 26B-A4B 4-bit (16.4 GB) This model (12.6 GB)
Elementary Mathematics 84% 44%
Philosophy 66% 54%
World Religions 66% 34%
College Computer Science 58% 34%
High School Mathematics 26% 22%
Abstract Algebra 44% 36%
College Mathematics 36% 16%
Gsm8K 64% 62%

The 30% expert pruning compounds with 4-bit quantization. Note: high extraction failure rates (up to 60%) on some tasks — the model generates verbose explanations instead of single-letter answers, so true accuracy may be higher than reported. Consider the 21B variant for better accuracy.

Extraction failures (unparseable responses) are counted as incorrect. REAP-19B: 113/400 (28%). True accuracy may be higher. Full methodology: GitHub.

Quantization Details

  • Bits: 4
  • Group size: 64
  • Strategy: PLE-safe — only large nn.Linear and SwitchLinear (MoE expert) layers are quantized. All PLE/ScaledLinear/vision layers stay in bf16.
Quantized (4-bit) Kept in bf16
Attention projections (q/k/v/o_proj) ScaledEmbedding (embed_tokens)
MLP layers (gate/up/down_proj) ScaledLinear (PLE pathway)
MoE expert layers (SwitchLinear) Per-layer embeddings (per_layer_*)
Vision encoder
All norms and scalars

Usage

pip install -U mlx-vlm

Vision

from mlx_vlm import load, generate

model, processor = load("ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit")
tokenizer = processor.tokenizer

messages = [{"role": "user", "content": [
    {"type": "image", "url": "photo.jpg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
    max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)

Text

messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)

Validation

Trimodal validation: 10/10 vision, 3/3 chat (EN/ZH/JA). Full results: GitHub.

Bugs Fixed in mlx-vlm

# Bug Fix
1 ScaledLinear inherits nn.Module not nn.Linear Change to ScaledLinear(nn.Linear)
2 Standard quantization quantizes PLE layers PLE-safe class_predicate
3 processor.save_pretrained() strips audio config Copy processor_config.json from source
4 SwitchLinear (MoE) not quantized Check hasattr(module, 'to_quantized')
5 embed_scale double-scaling (mlx-vlm 0.4.4+) Set Gemma4TextModel.embed_scale = 1.0

License

Model weights: Google Gemma License. Quantization scripts: MIT.

Downloads last month
574
Safetensors
Model size
4B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit

Quantized
(7)
this model

Paper for ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit