gemma-4-19b-a4b-it-REAP-MLX-4bit

PLE-safe MLX 4-bit weights for 0xSero/gemma-4-19b-a4b-it-REAP on Apple Silicon.

REAP (Router-weighted Expert Activation Pruning) removes 30% of MoE experts while keeping the same active parameters per token (8 of 90 experts selected). Combined with PLE-safe 4-bit quantization, this model runs in 12.6 GB — fits on 24GB+ Macs.

	Original 26B	REAP (19B MoE (30% pruned))	This model
Experts/layer	128	90	90
Precision	BF16	BF16	4-bit
Disk size	~52 GB	~36 GB	12.6 GB

Also available

gemma-4-21b-a4b-it-REAP-MLX-4bit — 21B MoE (20% pruned), 13.9 GB

Accuracy Benchmarks

0-shot generative, thinking enabled, 50 samples per task, identical eval harness. Apple M4 Max 36GB.

Task	26B-A4B 4-bit (16.4 GB)	This model (12.6 GB)
Elementary Mathematics	84%	44%
Philosophy	66%	54%
World Religions	66%	34%
College Computer Science	58%	34%
High School Mathematics	26%	22%
Abstract Algebra	44%	36%
College Mathematics	36%	16%
Gsm8K	64%	62%

The 30% expert pruning compounds with 4-bit quantization. Note: high extraction failure rates (up to 60%) on some tasks — the model generates verbose explanations instead of single-letter answers, so true accuracy may be higher than reported. Consider the 21B variant for better accuracy.

Extraction failures (unparseable responses) are counted as incorrect. REAP-19B: 113/400 (28%). True accuracy may be higher. Full methodology: GitHub.

Quantization Details

Bits: 4
Group size: 64
Strategy: PLE-safe — only large nn.Linear and SwitchLinear (MoE expert) layers are quantized. All PLE/ScaledLinear/vision layers stay in bf16.

Quantized (4-bit)	Kept in bf16
Attention projections (q/k/v/o_proj)	ScaledEmbedding (embed_tokens)
MLP layers (gate/up/down_proj)	ScaledLinear (PLE pathway)
MoE expert layers (SwitchLinear)	Per-layer embeddings (per_layer_*)
	Vision encoder
	All norms and scalars

Usage

pip install -U mlx-vlm

Vision

from mlx_vlm import load, generate

model, processor = load("ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit")
tokenizer = processor.tokenizer

messages = [{"role": "user", "content": [
    {"type": "image", "url": "photo.jpg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
    max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)

Text

messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)

Validation

Trimodal validation: 10/10 vision, 3/3 chat (EN/ZH/JA). Full results: GitHub.

Bugs Fixed in mlx-vlm

#	Bug	Fix
1	`ScaledLinear` inherits `nn.Module` not `nn.Linear`	Change to `ScaledLinear(nn.Linear)`
2	Standard quantization quantizes PLE layers	PLE-safe `class_predicate`
3	`processor.save_pretrained()` strips audio config	Copy `processor_config.json` from source
4	`SwitchLinear` (MoE) not quantized	Check `hasattr(module, 'to_quantized')`
5	`embed_scale` double-scaling (mlx-vlm 0.4.4+)	Set `Gemma4TextModel.embed_scale = 1.0`

License

Model weights: Google Gemma License. Quantization scripts: MIT.

Downloads last month: 574

Safetensors

Model size

4B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit

Base model

0xSero/gemma-4-19b-a4b-it-REAP

Quantized

(7)

this model

Paper for ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19