---
language:
- en
library_name: transformers
tags:
- glm
- glm4.7
- MOE
- pruning
- compression
- reap
- cerebras
- code
- function-calling
- agentic
license: mit
pipeline_tag: text-generation
base_model:
- zai-org/GLM-4.7-Flash
---

<p align="center">
  <em>𓌳 <strong>REAP</strong>𓌳  the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
  <a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
</p>

# GLM-4.7-REAP-39


## ⚠️ Model Status & Deployment Note

**Important Update regarding GGUF support:**
A critical bug was recently identified in `llama.cpp` regarding the `scoring_func` for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes.

* **Status:** GGUF files are scheduled for re-upload by **January 24, 2026**.
* **Recommendation:** If you are using local inference via `llama.cpp` or Unsloth, please refer to the [official Unsloth GLM-4.7-Flash documentation](https://unsloth.ai/docs/models/glm-4.7-flash) for the most stable configuration parameters.
* **Native Support:** The BF16/FP16 weights remain compatible with `transformers` and `vLLM` for immediate use.


## ✨ Highlights

**50% Expert-Pruned** GLM-4.7 Flash optimized for **code generation**, **function calling**, and **agentic workflows**.

Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:

- **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
- **One-Shot Compression**: No fine-tuning required
- **Drop-in Compatible**: Works with vLLM, Transformers, SGLang

### 🙏 Acknowledgments

- **[Runpod](https://www.runpod.io/)** — Compute for REAP
- **[Cerebras](https://www.cerebras.net/)** — [REAP methodology](https://arxiv.org/abs/2510.13999)

--

### The Science Behind Dataset Selection

```
REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight × activation_norm
4. Prune lowest-saliency experts

Key Insight: Experts are TASK-SPECIFIC
├── Some experts specialize in natural language
├── Some experts specialize in code syntax
├── Some experts specialize in JSON/structured output
└── Some experts specialize in multi-turn context

If calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability
```

### Cerebras' Original Mix (from paper)

Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
- evol-codealpaca-v1 for code generation
- xlam-function-calling-60k for tool calling
- SWE-smith-trajectories for agentic tasks

We followed this exact recipe for reproducibility.


---

## 🚀 Deployment

### vLLM (Recommended)

```bash
vllm serve Akicou/GLM-4.7-Flash-REAP-39 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --dtype bfloat16
```

### Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Akicou/GLM-4.7-Flash-REAP-39",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## ⚖️ License

MIT  (inherited from GLM-4.7 Flash)

---

## 🧾 Citation

```bibtex
@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025},
  url={https://arxiv.org/abs/2510.13999}
}
```