𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
📄 Paper • 💻 Code • 📝 Blog
GLM-4.7-REAP-09
⚠️ Model Status & Deployment Note
Important Update regarding GGUF support:
A critical bug was recently identified in llama.cpp regarding the scoring_func for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes.
- Status: GGUF files are scheduled for re-upload by January 24, 2026.
- Recommendation: If you are using local inference via
llama.cppor Unsloth, please refer to the official Unsloth GLM-4.7-Flash documentation for the most stable configuration parameters. - Native Support: The BF16/FP16 weights remain compatible with
transformersandvLLMfor immediate use.
✨ Highlights
50% Expert-Pruned GLM-4.7 Flash optimized for code generation, function calling, and agentic workflows.
Created using REAP (Router-weighted Expert Activation Pruning) by Cerebras:
- Calibrated for Code & Tools: Preserves coding and function-calling capabilities
- One-Shot Compression: No fine-tuning required
- Drop-in Compatible: Works with vLLM, Transformers, SGLang
🙏 Acknowledgments
- Runpod — Compute for REAP
- Cerebras — REAP methodology
--
The Science Behind Dataset Selection
REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight × activation_norm
4. Prune lowest-saliency experts
Key Insight: Experts are TASK-SPECIFIC
├── Some experts specialize in natural language
├── Some experts specialize in code syntax
├── Some experts specialize in JSON/structured output
└── Some experts specialize in multi-turn context
If calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability
Cerebras' Original Mix (from paper)
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
- evol-codealpaca-v1 for code generation
- xlam-function-calling-60k for tool calling
- SWE-smith-trajectories for agentic tasks
We followed this exact recipe for reproducibility.
🚀 Deployment
vLLM (Recommended)
vllm serve Akicou/GLM-4.7-Flash-REAP-09 \
--tensor-parallel-size 2 \
--trust-remote-code \
--dtype bfloat16
Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Akicou/GLM-4.7-Flash-REAP-09",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-09", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
⚖️ License
MIT (inherited from GLM-4.7 Flash)
🧾 Citation
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025},
url={https://arxiv.org/abs/2510.13999}
}
- Downloads last month
- 171