Deep technical update: calibration rationale, embedded scripts, dataset science

Browse files

Files changed (1) hide show

README.md +142 -41

README.md CHANGED Viewed

@@ -12,6 +12,7 @@ tags:
 - cerebras
 - code
 - function-calling
 license: apache-2.0
 pipeline_tag: text-generation
 base_model:
@@ -27,14 +28,14 @@ base_model:
 ## ✨ Highlights
-Introducing **GLM-4.7-REAP-50**, a **memory-efficient compressed variant** of GLM-4.7 optimized for **code generation and function calling**.
-This model was created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)**, developed by Cerebras. Key features:
-- **50% Expert Pruning**: Compressed from 358B to 179B parameters
-- **Optimized for Code & Tools**: Calibrated specifically on code generation and function calling datasets
-- **One-Shot Compression**: No fine-tuning required - ready for immediate deployment
-- **Drop-in Compatibility**: Works with vLLM, Transformers, and other standard frameworks
 ### 🙏 Acknowledgments
@@ -43,40 +44,88 @@ This model was created using **[REAP (Router-weighted Expert Activation Pruning)
 ---
-## 📋 Model Overview
 | Property | Value |
 |----------|-------|
 | **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
-| **Compression Method** | REAP (Router-weighted Expert Activation Pruning) |
-| **Compression Ratio** | 50% expert pruning |
-| **Type** | Sparse Mixture-of-Experts (SMoE) |
-| **Total Parameters** | 179B (was 358B) |
 | **Experts per Layer** | 80 (was 160) |
 | **MoE Layers** | 92 |
 | **Precision** | BF16 |
 | **Disk Size** | ~345GB |
 | **VRAM Required** | ~345GB |
 ---
 ## 📦 Related Models
-| Model | Params | Experts | Size | Format | Link |
-|-------|--------|---------|------|--------|------|
-| GLM-4.7-REAP-30 | 251B | 112 | ~470GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-30) |
-| GLM-4.7-REAP-35 | 233B | 104 | ~439GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-35) |
-| GLM-4.7-REAP-40 | 218B | 96 | ~407GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-40) |
-| GLM-4.7-REAP-45 | 197B | 88 | ~370GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-45) |
-| GLM-4.7-REAP-50 | 179B | 80 | ~345GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-50) |
-| GLM-4.7-REAP-40-W4A16 | 218B | 96 | ~108GB | GPTQ | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) |
-| GLM-4.7-REAP-50-W4A16 | 179B | 80 | ~92GB | GPTQ | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) |
 ---
 ## 🚀 Deployment
-### With vLLM (Recommended)
 ```bash
 vllm serve 0xSero/GLM-4.7-REAP-50 \
@@ -85,7 +134,7 @@ vllm serve 0xSero/GLM-4.7-REAP-50 \
     --dtype bfloat16
 ```
-### With Transformers
 ```python
 import torch
@@ -99,48 +148,100 @@ model = AutoModelForCausalLM.from_pretrained(
 )
 tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-50", trust_remote_code=True)
-messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
 inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
-outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ---
-## 🧩 Model Creation
-This model was created by applying **REAP** uniformly across all MoE blocks with a **50% pruning rate**.
-### How REAP Works
-REAP selects experts to prune based on a **saliency criterion** that considers:
-- **Router gate values**: How frequently and strongly the router activates each expert
-- **Expert activation norms**: The magnitude of each expert's output contributions
-### Calibration for Code & Function Calling
-This model was specifically calibrated on datasets optimized for **code generation** and **function/tool calling**:
-| Dataset | Samples | Purpose |
-|---------|---------|---------|
-| [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation |
-| [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling |
-| [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn |
-Combined calibration dataset: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
 ---
 ## ⚖️ License
-Apache 2.0 (inherited from base GLM-4 model)
 ---
 ## 🧾 Citation
-If you use this model, please cite the REAP paper:
 ```bibtex
 @article{lasby2025reap,
   title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},

 - cerebras
 - code
 - function-calling
+- agentic
 license: apache-2.0
 pipeline_tag: text-generation
 base_model:
 ## ✨ Highlights
+**50% Expert-Pruned** GLM-4.7 optimized for **code generation**, **function calling**, and **agentic workflows**.
+Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:
+- **358B → 179B**: 50% of MoE experts pruned (80/160 remaining)
+- **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
+- **One-Shot Compression**: No fine-tuning required
+- **Drop-in Compatible**: Works with vLLM, Transformers, SGLang
 ### 🙏 Acknowledgments
 ---
+## 📋 Model Specifications
 | Property | Value |
 |----------|-------|
 | **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
+| **Architecture** | Sparse Mixture-of-Experts (SMoE) |
+| **Original Parameters** | 358B |
+| **Pruned Parameters** | 179B |
+| **Compression** | 50% experts removed |
 | **Experts per Layer** | 80 (was 160) |
 | **MoE Layers** | 92 |
+| **Activated Experts** | 8 per token |
 | **Precision** | BF16 |
 | **Disk Size** | ~345GB |
 | **VRAM Required** | ~345GB |
+---
+## 🔬 Calibration Dataset: Deep Dive
+REAP's effectiveness depends critically on **calibration data that represents the target use case**. We specifically optimized for **code generation**, **function/tool calling**, and **agentic workflows**.
+### Why These 3 Datasets?
+| Dataset | Samples | Purpose | Why It Matters |
+|---------|---------|---------|----------------|
+| [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation | **51% of mix** — Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability |
+| [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling | **24% of mix** — Tool use requires structured JSON output; experts handling schema generation must be preserved |
+| [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn | **24% of mix** — Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning |
+### The Science Behind Dataset Selection
+```
+REAP Algorithm:
+1. Forward pass calibration samples through model
+2. Record which experts activate and their magnitudes
+3. Compute saliency = router_weight × activation_norm
+4. Prune lowest-saliency experts
+Key Insight: Experts are TASK-SPECIFIC
+├── Some experts specialize in natural language
+├── Some experts specialize in code syntax
+├── Some experts specialize in JSON/structured output
+└── Some experts specialize in multi-turn context
+If calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability
+```
+### Cerebras' Original Mix (from paper)
+Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
+- evol-codealpaca-v1 for code generation
+- xlam-function-calling-60k for tool calling
+- SWE-smith-trajectories for agentic tasks
+We followed this exact recipe for reproducibility.
+### Combined Dataset
+Our calibration mix: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
 ---
 ## 📦 Related Models
+| Model | Params | Experts | Size | Format |
+|-------|--------|---------|------|--------|
+| [GLM-4.7-REAP-30](https://huggingface.co/0xSero/GLM-4.7-REAP-30) | 251B | 112 | ~470GB | BF16 |
+| [GLM-4.7-REAP-35](https://huggingface.co/0xSero/GLM-4.7-REAP-35) | 233B | 104 | ~439GB | BF16 |
+| [GLM-4.7-REAP-40](https://huggingface.co/0xSero/GLM-4.7-REAP-40) | 218B | 96 | ~407GB | BF16 |
+| [GLM-4.7-REAP-45](https://huggingface.co/0xSero/GLM-4.7-REAP-45) | 197B | 88 | ~370GB | BF16 |
+| [GLM-4.7-REAP-50](https://huggingface.co/0xSero/GLM-4.7-REAP-50) | 179B | 80 | ~345GB | BF16 |
+| [GLM-4.7-REAP-40-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) | 218B | 96 | ~108GB | GPTQ |
+| [GLM-4.7-REAP-50-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) | 179B | 80 | ~92GB | GPTQ |
 ---
 ## 🚀 Deployment
+### vLLM (Recommended)
 ```bash
 vllm serve 0xSero/GLM-4.7-REAP-50 \
     --dtype bfloat16
 ```
+### Transformers
 ```python
 import torch
 )
 tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-50", trust_remote_code=True)
+messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
 inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
+outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ---
+## 🧩 Reproduction
+### REAP Pruning Script
+```python
+#!/usr/bin/env python3
+"""
+REAP Pruning Script for MoE Models
+Adapted from: https://github.com/CerebrasResearch/reap
+"""
+import subprocess
+import sys
+def run_reap(
+    model_path: str,
+    compression_ratio: float,
+    dataset: str = "0xSero/glm47-reap-calibration-v2",
+    samples: int = 1360,
+    seed: int = 42,
+    distance: str = "angular",
+    reuse_observations: str = None,
+):
+    """
+    Run REAP expert pruning.
+    Args:
+        model_path: Path to base model
+        compression_ratio: 0.30 = prune 30%, keep 70%
+        dataset: Calibration dataset (code + tools + agentic)
+        samples: Number of calibration samples
+        seed: Random seed for reproducibility
+        distance: Distance metric for expert clustering
+        reuse_observations: Path to pre-computed observations for instant pruning
+    """
+    cmd = [
+        sys.executable, "src/reap/prune.py",
+        "--model-name", model_path,
+        "--dataset-name", dataset,
+        "--compression-ratio", str(compression_ratio),
+        "--prune-method", "reap",
+        "--seed", str(seed),
+        "--samples_per_category", str(samples),
+        "--model_max_length", "2048",
+        "--distance_measure", distance,
+        "--record_pruning_metrics_only", "true",
+    ]
+    if reuse_observations:
+        # Instant pruning: skip calibration, reuse precomputed expert scores
+        cmd.extend(["--load_observations", reuse_observations])
+    subprocess.run(cmd, check=True)
+# Example: Create 40% pruned model
+run_reap(
+    model_path="/path/to/GLM-4.7",
+    compression_ratio=0.40,  # Prune 40% of experts
+)
+```
+### Observation Reuse (Instant Multi-Ratio Pruning)
+REAP computes expert saliency scores during calibration. These scores are **compression-ratio independent**, enabling instant pruning at any ratio:
+```bash
+# First run: compute observations (~5 hours)
+python prune.py --compression-ratio 0.40 --output_file_name observations.pt
+# Subsequent runs: instant pruning (<5 minutes)
+python prune.py --compression-ratio 0.30 --load_observations observations.pt
+python prune.py --compression-ratio 0.50 --load_observations observations.pt
+```
 ---
 ## ⚖️ License
+Apache 2.0 (inherited from GLM-4)
 ---
 ## 🧾 Citation
 ```bibtex
 @article{lasby2025reap,
   title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},