language:
- en
library_name: transformers
tags:
- glm
- glm4.7
- MOE
- pruning
- compression
- reap
- cerebras
- code
- function-calling
- agentic
license: mit
pipeline_tag: text-generation
base_model:
- zai-org/GLM-4.7-Flash
π³ REAPπ³ the Experts: Why Pruning Prevails for One-Shot MoE Compression
π Paper β’ π» Code β’ π Blog
GLM-4.7-REAP-39
β οΈ Model Status & Deployment Note
Important Update regarding GGUF support:
A critical bug was recently identified in llama.cpp regarding the scoring_func for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes.
- Status: GGUF files are scheduled for re-upload by January 24, 2026.
- Recommendation: If you are using local inference via
llama.cppor Unsloth, please refer to the official Unsloth GLM-4.7-Flash documentation for the most stable configuration parameters. - Native Support: The BF16/FP16 weights remain compatible with
transformersandvLLMfor immediate use.
β¨ Highlights
50% Expert-Pruned GLM-4.7 Flash optimized for code generation, function calling, and agentic workflows.
Created using REAP (Router-weighted Expert Activation Pruning) by Cerebras:
- Calibrated for Code & Tools: Preserves coding and function-calling capabilities
- One-Shot Compression: No fine-tuning required
- Drop-in Compatible: Works with vLLM, Transformers, SGLang
π Acknowledgments
- Runpod β Compute for REAP
- Cerebras β REAP methodology
--
The Science Behind Dataset Selection
REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight Γ activation_norm
4. Prune lowest-saliency experts
Key Insight: Experts are TASK-SPECIFIC
βββ Some experts specialize in natural language
βββ Some experts specialize in code syntax
βββ Some experts specialize in JSON/structured output
βββ Some experts specialize in multi-turn context
If calibration lacks code β code-specialized experts appear "unused" β get pruned β model loses coding ability
Cerebras' Original Mix (from paper)
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
- evol-codealpaca-v1 for code generation
- xlam-function-calling-60k for tool calling
- SWE-smith-trajectories for agentic tasks
We followed this exact recipe for reproducibility.
π Deployment
vLLM (Recommended)
vllm serve Akicou/GLM-4.7-Flash-REAP-39 \
--tensor-parallel-size 2 \
--trust-remote-code \
--dtype bfloat16
Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Akicou/GLM-4.7-Flash-REAP-39",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
βοΈ License
MIT (inherited from GLM-4.7 Flash)
π§Ύ Citation
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025},
url={https://arxiv.org/abs/2510.13999}
}