|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- glm |
|
|
- glm4.7 |
|
|
- MOE |
|
|
- pruning |
|
|
- compression |
|
|
- reap |
|
|
- cerebras |
|
|
- code |
|
|
- function-calling |
|
|
- agentic |
|
|
license: mit |
|
|
pipeline_tag: text-generation |
|
|
base_model: |
|
|
- zai-org/GLM-4.7-Flash |
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<em>π³ <strong>REAP</strong>π³ the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br> |
|
|
<a href="https://arxiv.org/abs/2510.13999">π Paper</a> β’ <a href="https://github.com/CerebrasResearch/reap">π» Code</a> β’ <a href="https://www.cerebras.ai/blog/reap">π Blog</a> |
|
|
</p> |
|
|
|
|
|
# GLM-4.7-REAP-39 |
|
|
|
|
|
|
|
|
## β οΈ Model Status & Deployment Note |
|
|
|
|
|
**Important Update regarding GGUF support:** |
|
|
A critical bug was recently identified in `llama.cpp` regarding the `scoring_func` for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes. |
|
|
|
|
|
* **Status:** GGUF files are scheduled for re-upload by **January 24, 2026**. |
|
|
* **Recommendation:** If you are using local inference via `llama.cpp` or Unsloth, please refer to the [official Unsloth GLM-4.7-Flash documentation](https://unsloth.ai/docs/models/glm-4.7-flash) for the most stable configuration parameters. |
|
|
* **Native Support:** The BF16/FP16 weights remain compatible with `transformers` and `vLLM` for immediate use. |
|
|
|
|
|
|
|
|
## β¨ Highlights |
|
|
|
|
|
**50% Expert-Pruned** GLM-4.7 Flash optimized for **code generation**, **function calling**, and **agentic workflows**. |
|
|
|
|
|
Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras: |
|
|
|
|
|
- **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities |
|
|
- **One-Shot Compression**: No fine-tuning required |
|
|
- **Drop-in Compatible**: Works with vLLM, Transformers, SGLang |
|
|
|
|
|
### π Acknowledgments |
|
|
|
|
|
- **[Runpod](https://www.runpod.io/)** β Compute for REAP |
|
|
- **[Cerebras](https://www.cerebras.net/)** β [REAP methodology](https://arxiv.org/abs/2510.13999) |
|
|
|
|
|
-- |
|
|
|
|
|
### The Science Behind Dataset Selection |
|
|
|
|
|
``` |
|
|
REAP Algorithm: |
|
|
1. Forward pass calibration samples through model |
|
|
2. Record which experts activate and their magnitudes |
|
|
3. Compute saliency = router_weight Γ activation_norm |
|
|
4. Prune lowest-saliency experts |
|
|
|
|
|
Key Insight: Experts are TASK-SPECIFIC |
|
|
βββ Some experts specialize in natural language |
|
|
βββ Some experts specialize in code syntax |
|
|
βββ Some experts specialize in JSON/structured output |
|
|
βββ Some experts specialize in multi-turn context |
|
|
|
|
|
If calibration lacks code β code-specialized experts appear "unused" β get pruned β model loses coding ability |
|
|
``` |
|
|
|
|
|
### Cerebras' Original Mix (from paper) |
|
|
|
|
|
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments: |
|
|
- evol-codealpaca-v1 for code generation |
|
|
- xlam-function-calling-60k for tool calling |
|
|
- SWE-smith-trajectories for agentic tasks |
|
|
|
|
|
We followed this exact recipe for reproducibility. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π Deployment |
|
|
|
|
|
### vLLM (Recommended) |
|
|
|
|
|
```bash |
|
|
vllm serve Akicou/GLM-4.7-Flash-REAP-39 \ |
|
|
--tensor-parallel-size 2 \ |
|
|
--trust-remote-code \ |
|
|
--dtype bfloat16 |
|
|
``` |
|
|
|
|
|
### Transformers |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Akicou/GLM-4.7-Flash-REAP-39", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True) |
|
|
|
|
|
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}] |
|
|
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True) |
|
|
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ License |
|
|
|
|
|
MIT (inherited from GLM-4.7 Flash) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§Ύ Citation |
|
|
|
|
|
```bibtex |
|
|
@article{lasby2025reap, |
|
|
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}, |
|
|
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, |
|
|
journal={arXiv preprint arXiv:2510.13999}, |
|
|
year={2025}, |
|
|
url={https://arxiv.org/abs/2510.13999} |
|
|
} |
|
|
``` |
|
|
|