gemma-4-19b-a4b-it-REAP-MLX-4bit
PLE-safe MLX 4-bit weights for 0xSero/gemma-4-19b-a4b-it-REAP on Apple Silicon.
REAP (Router-weighted Expert Activation Pruning) removes 30% of MoE experts while keeping the same active parameters per token (8 of 90 experts selected). Combined with PLE-safe 4-bit quantization, this model runs in 12.6 GB — fits on 24GB+ Macs.
| Original 26B | REAP (19B MoE (30% pruned)) | This model | |
|---|---|---|---|
| Experts/layer | 128 | 90 | 90 |
| Precision | BF16 | BF16 | 4-bit |
| Disk size | ~52 GB | ~36 GB | 12.6 GB |
Also available
- gemma-4-21b-a4b-it-REAP-MLX-4bit — 21B MoE (20% pruned), 13.9 GB
Accuracy Benchmarks
0-shot generative, thinking enabled, 50 samples per task, identical eval harness. Apple M4 Max 36GB.
| Task | 26B-A4B 4-bit (16.4 GB) | This model (12.6 GB) |
|---|---|---|
| Elementary Mathematics | 84% | 44% |
| Philosophy | 66% | 54% |
| World Religions | 66% | 34% |
| College Computer Science | 58% | 34% |
| High School Mathematics | 26% | 22% |
| Abstract Algebra | 44% | 36% |
| College Mathematics | 36% | 16% |
| Gsm8K | 64% | 62% |
The 30% expert pruning compounds with 4-bit quantization. Note: high extraction failure rates (up to 60%) on some tasks — the model generates verbose explanations instead of single-letter answers, so true accuracy may be higher than reported. Consider the 21B variant for better accuracy.
Extraction failures (unparseable responses) are counted as incorrect. REAP-19B: 113/400 (28%). True accuracy may be higher. Full methodology: GitHub.
Quantization Details
- Bits: 4
- Group size: 64
- Strategy: PLE-safe — only large
nn.LinearandSwitchLinear(MoE expert) layers are quantized. All PLE/ScaledLinear/vision layers stay in bf16.
| Quantized (4-bit) | Kept in bf16 |
|---|---|
| Attention projections (q/k/v/o_proj) | ScaledEmbedding (embed_tokens) |
| MLP layers (gate/up/down_proj) | ScaledLinear (PLE pathway) |
| MoE expert layers (SwitchLinear) | Per-layer embeddings (per_layer_*) |
| Vision encoder | |
| All norms and scalars |
Usage
pip install -U mlx-vlm
Vision
from mlx_vlm import load, generate
model, processor = load("ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit")
tokenizer = processor.tokenizer
messages = [{"role": "user", "content": [
{"type": "image", "url": "photo.jpg"},
{"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)
Text
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)
Validation
Trimodal validation: 10/10 vision, 3/3 chat (EN/ZH/JA). Full results: GitHub.
Bugs Fixed in mlx-vlm
| # | Bug | Fix |
|---|---|---|
| 1 | ScaledLinear inherits nn.Module not nn.Linear |
Change to ScaledLinear(nn.Linear) |
| 2 | Standard quantization quantizes PLE layers | PLE-safe class_predicate |
| 3 | processor.save_pretrained() strips audio config |
Copy processor_config.json from source |
| 4 | SwitchLinear (MoE) not quantized |
Check hasattr(module, 'to_quantized') |
| 5 | embed_scale double-scaling (mlx-vlm 0.4.4+) |
Set Gemma4TextModel.embed_scale = 1.0 |
License
Model weights: Google Gemma License. Quantization scripts: MIT.
- Downloads last month
- 574
4-bit
Model tree for ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit
Base model
0xSero/gemma-4-19b-a4b-it-REAP