README.md · Akicou/GLM-4.7-Flash-REAP-39 at main

GLM-4.7-Flash-REAP-39 / README.md

Akicou

Update README.md

4ec79d3 verified about 20 hours ago

preview code

raw

history blame contribute delete

4.3 kB

	---
	language:
	- en
	library_name: transformers
	tags:
	- glm
	- glm4.7
	- MOE
	- pruning
	- compression
	- reap
	- cerebras
	- code
	- function-calling
	- agentic
	license: mit
	pipeline_tag: text-generation
	base_model:
	- zai-org/GLM-4.7-Flash
	---

	<p align="center">
	<em>𓌳 <strong>REAP</strong>𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
	<a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
	</p>

	# GLM-4.7-REAP-39


	## ⚠️ Model Status & Deployment Note

	Important Update regarding GGUF support:
	A critical bug was recently identified in `llama.cpp` regarding the `scoring_func` for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes.

	* Status: GGUF files are scheduled for re-upload by January 24, 2026.
	* Recommendation: If you are using local inference via `llama.cpp` or Unsloth, please refer to the [official Unsloth GLM-4.7-Flash documentation](https://unsloth.ai/docs/models/glm-4.7-flash) for the most stable configuration parameters.
	* Native Support: The BF16/FP16 weights remain compatible with `transformers` and `vLLM` for immediate use.


	## ✨ Highlights

	50% Expert-Pruned GLM-4.7 Flash optimized for code generation, function calling, and agentic workflows.

	Created using [REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) by Cerebras:

	- Calibrated for Code & Tools: Preserves coding and function-calling capabilities
	- One-Shot Compression: No fine-tuning required
	- Drop-in Compatible: Works with vLLM, Transformers, SGLang

	### 🙏 Acknowledgments

	- [Runpod](https://www.runpod.io/) — Compute for REAP
	- [Cerebras](https://www.cerebras.net/) — [REAP methodology](https://arxiv.org/abs/2510.13999)

	--

	### The Science Behind Dataset Selection

	```
	REAP Algorithm:
	1. Forward pass calibration samples through model
	2. Record which experts activate and their magnitudes
	3. Compute saliency = router_weight × activation_norm
	4. Prune lowest-saliency experts

	Key Insight: Experts are TASK-SPECIFIC
	├── Some experts specialize in natural language
	├── Some experts specialize in code syntax
	├── Some experts specialize in JSON/structured output
	└── Some experts specialize in multi-turn context

	If calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability
	```

	### Cerebras' Original Mix (from paper)

	Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
	- evol-codealpaca-v1 for code generation
	- xlam-function-calling-60k for tool calling
	- SWE-smith-trajectories for agentic tasks

	We followed this exact recipe for reproducibility.



	---

	## 🚀 Deployment

	### vLLM (Recommended)

	```bash
	vllm serve Akicou/GLM-4.7-Flash-REAP-39 \
	--tensor-parallel-size 2 \
	--trust-remote-code \
	--dtype bfloat16
	```

	### Transformers

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"Akicou/GLM-4.7-Flash-REAP-39",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True)

	messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
	outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## ⚖️ License

	MIT (inherited from GLM-4.7 Flash)

	---

	## 🧾 Citation

	```bibtex
	@article{lasby2025reap,
	title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
	author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
	journal={arXiv preprint arXiv:2510.13999},
	year={2025},
	url={https://arxiv.org/abs/2510.13999}
	}
	```