Akicou's picture
Update README.md
4ec79d3 verified
---
language:
- en
library_name: transformers
tags:
- glm
- glm4.7
- MOE
- pruning
- compression
- reap
- cerebras
- code
- function-calling
- agentic
license: mit
pipeline_tag: text-generation
base_model:
- zai-org/GLM-4.7-Flash
---
<p align="center">
<em>π“Œ³ <strong>REAP</strong>π“Œ³ the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
<a href="https://arxiv.org/abs/2510.13999">πŸ“„ Paper</a> β€’ <a href="https://github.com/CerebrasResearch/reap">πŸ’» Code</a> β€’ <a href="https://www.cerebras.ai/blog/reap">πŸ“ Blog</a>
</p>
# GLM-4.7-REAP-39
## ⚠️ Model Status & Deployment Note
**Important Update regarding GGUF support:**
A critical bug was recently identified in `llama.cpp` regarding the `scoring_func` for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes.
* **Status:** GGUF files are scheduled for re-upload by **January 24, 2026**.
* **Recommendation:** If you are using local inference via `llama.cpp` or Unsloth, please refer to the [official Unsloth GLM-4.7-Flash documentation](https://unsloth.ai/docs/models/glm-4.7-flash) for the most stable configuration parameters.
* **Native Support:** The BF16/FP16 weights remain compatible with `transformers` and `vLLM` for immediate use.
## ✨ Highlights
**50% Expert-Pruned** GLM-4.7 Flash optimized for **code generation**, **function calling**, and **agentic workflows**.
Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:
- **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
- **One-Shot Compression**: No fine-tuning required
- **Drop-in Compatible**: Works with vLLM, Transformers, SGLang
### πŸ™ Acknowledgments
- **[Runpod](https://www.runpod.io/)** β€” Compute for REAP
- **[Cerebras](https://www.cerebras.net/)** β€” [REAP methodology](https://arxiv.org/abs/2510.13999)
--
### The Science Behind Dataset Selection
```
REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight Γ— activation_norm
4. Prune lowest-saliency experts
Key Insight: Experts are TASK-SPECIFIC
β”œβ”€β”€ Some experts specialize in natural language
β”œβ”€β”€ Some experts specialize in code syntax
β”œβ”€β”€ Some experts specialize in JSON/structured output
└── Some experts specialize in multi-turn context
If calibration lacks code β†’ code-specialized experts appear "unused" β†’ get pruned β†’ model loses coding ability
```
### Cerebras' Original Mix (from paper)
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
- evol-codealpaca-v1 for code generation
- xlam-function-calling-60k for tool calling
- SWE-smith-trajectories for agentic tasks
We followed this exact recipe for reproducibility.
---
## πŸš€ Deployment
### vLLM (Recommended)
```bash
vllm serve Akicou/GLM-4.7-Flash-REAP-39 \
--tensor-parallel-size 2 \
--trust-remote-code \
--dtype bfloat16
```
### Transformers
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Akicou/GLM-4.7-Flash-REAP-39",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## βš–οΈ License
MIT (inherited from GLM-4.7 Flash)
---
## 🧾 Citation
```bibtex
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025},
url={https://arxiv.org/abs/2510.13999}
}
```