๐ณ REAP๐ณ the Experts: Why Pruning Prevails for One-Shot MoE Compression
๐ Paper โข ๐ป Code โข ๐ Blog
GLM-4.7-REAP-39
โ ๏ธ Model Status & Deployment Note
Important Update regarding GGUF support:
A critical bug was recently identified in llama.cpp regarding the scoring_func for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes.
- Status: GGUF files are scheduled for re-upload by January 24, 2026.
- Recommendation: If you are using local inference via
llama.cppor Unsloth, please refer to the official Unsloth GLM-4.7-Flash documentation for the most stable configuration parameters. - Native Support: The BF16/FP16 weights remain compatible with
transformersandvLLMfor immediate use.
โจ Highlights
50% Expert-Pruned GLM-4.7 Flash optimized for code generation, function calling, and agentic workflows.
Created using REAP (Router-weighted Expert Activation Pruning) by Cerebras:
- Calibrated for Code & Tools: Preserves coding and function-calling capabilities
- One-Shot Compression: No fine-tuning required
- Drop-in Compatible: Works with vLLM, Transformers, SGLang
๐ Acknowledgments
- Runpod โ Compute for REAP
- Cerebras โ REAP methodology
--
The Science Behind Dataset Selection
REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight ร activation_norm
4. Prune lowest-saliency experts
Key Insight: Experts are TASK-SPECIFIC
โโโ Some experts specialize in natural language
โโโ Some experts specialize in code syntax
โโโ Some experts specialize in JSON/structured output
โโโ Some experts specialize in multi-turn context
If calibration lacks code โ code-specialized experts appear "unused" โ get pruned โ model loses coding ability
Cerebras' Original Mix (from paper)
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
- evol-codealpaca-v1 for code generation
- xlam-function-calling-60k for tool calling
- SWE-smith-trajectories for agentic tasks
We followed this exact recipe for reproducibility.
๐ Deployment
vLLM (Recommended)
vllm serve Akicou/GLM-4.7-Flash-REAP-39 \
--tensor-parallel-size 2 \
--trust-remote-code \
--dtype bfloat16
Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Akicou/GLM-4.7-Flash-REAP-39",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
โ๏ธ License
MIT (inherited from GLM-4.7 Flash)
๐งพ Citation
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025},
url={https://arxiv.org/abs/2510.13999}
}
- Downloads last month
- 22