--- language: - en library_name: transformers tags: - glm - glm4.7 - MOE - pruning - compression - reap - cerebras - code - function-calling - agentic license: mit pipeline_tag: text-generation base_model: - zai-org/GLM-4.7-Flash ---

๐“Œณ REAP๐“Œณ the Experts: Why Pruning Prevails for One-Shot MoE Compression
๐Ÿ“„ Paper โ€ข ๐Ÿ’ป Code โ€ข ๐Ÿ“ Blog

# GLM-4.7-REAP-39 ## โš ๏ธ Model Status & Deployment Note **Important Update regarding GGUF support:** A critical bug was recently identified in `llama.cpp` regarding the `scoring_func` for GLM models (which caused looping and poor output quality). While the base weights are functional, the GGUF files for this model are currently being re-generated to ensure full compatibility with the latest fixes. * **Status:** GGUF files are scheduled for re-upload by **January 24, 2026**. * **Recommendation:** If you are using local inference via `llama.cpp` or Unsloth, please refer to the [official Unsloth GLM-4.7-Flash documentation](https://unsloth.ai/docs/models/glm-4.7-flash) for the most stable configuration parameters. * **Native Support:** The BF16/FP16 weights remain compatible with `transformers` and `vLLM` for immediate use. ## โœจ Highlights **50% Expert-Pruned** GLM-4.7 Flash optimized for **code generation**, **function calling**, and **agentic workflows**. Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras: - **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities - **One-Shot Compression**: No fine-tuning required - **Drop-in Compatible**: Works with vLLM, Transformers, SGLang ### ๐Ÿ™ Acknowledgments - **[Runpod](https://www.runpod.io/)** โ€” Compute for REAP - **[Cerebras](https://www.cerebras.net/)** โ€” [REAP methodology](https://arxiv.org/abs/2510.13999) -- ### The Science Behind Dataset Selection ``` REAP Algorithm: 1. Forward pass calibration samples through model 2. Record which experts activate and their magnitudes 3. Compute saliency = router_weight ร— activation_norm 4. Prune lowest-saliency experts Key Insight: Experts are TASK-SPECIFIC โ”œโ”€โ”€ Some experts specialize in natural language โ”œโ”€โ”€ Some experts specialize in code syntax โ”œโ”€โ”€ Some experts specialize in JSON/structured output โ””โ”€โ”€ Some experts specialize in multi-turn context If calibration lacks code โ†’ code-specialized experts appear "unused" โ†’ get pruned โ†’ model loses coding ability ``` ### Cerebras' Original Mix (from paper) Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments: - evol-codealpaca-v1 for code generation - xlam-function-calling-60k for tool calling - SWE-smith-trajectories for agentic tasks We followed this exact recipe for reproducibility. --- ## ๐Ÿš€ Deployment ### vLLM (Recommended) ```bash vllm serve Akicou/GLM-4.7-Flash-REAP-39 \ --tensor-parallel-size 2 \ --trust-remote-code \ --dtype bfloat16 ``` ### Transformers ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Akicou/GLM-4.7-Flash-REAP-39", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("Akicou/GLM-4.7-REAP-39", trust_remote_code=True) messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True) outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## โš–๏ธ License MIT (inherited from GLM-4.7 Flash) --- ## ๐Ÿงพ Citation ```bibtex @article{lasby2025reap, title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}, author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, journal={arXiv preprint arXiv:2510.13999}, year={2025}, url={https://arxiv.org/abs/2510.13999} } ```