0xSero commited on
Commit
5deeeba
Β·
verified Β·
1 Parent(s): 49d65fb

Deep technical update: calibration rationale, embedded scripts, dataset science

Browse files
Files changed (1) hide show
  1. README.md +142 -41
README.md CHANGED
@@ -12,6 +12,7 @@ tags:
12
  - cerebras
13
  - code
14
  - function-calling
 
15
  license: apache-2.0
16
  pipeline_tag: text-generation
17
  base_model:
@@ -27,14 +28,14 @@ base_model:
27
 
28
  ## ✨ Highlights
29
 
30
- Introducing **GLM-4.7-REAP-50**, a **memory-efficient compressed variant** of GLM-4.7 optimized for **code generation and function calling**.
31
 
32
- This model was created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)**, developed by Cerebras. Key features:
33
 
34
- - **50% Expert Pruning**: Compressed from 358B to 179B parameters
35
- - **Optimized for Code & Tools**: Calibrated specifically on code generation and function calling datasets
36
- - **One-Shot Compression**: No fine-tuning required - ready for immediate deployment
37
- - **Drop-in Compatibility**: Works with vLLM, Transformers, and other standard frameworks
38
 
39
  ### πŸ™ Acknowledgments
40
 
@@ -43,40 +44,88 @@ This model was created using **[REAP (Router-weighted Expert Activation Pruning)
43
 
44
  ---
45
 
46
- ## πŸ“‹ Model Overview
47
 
48
  | Property | Value |
49
  |----------|-------|
50
  | **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
51
- | **Compression Method** | REAP (Router-weighted Expert Activation Pruning) |
52
- | **Compression Ratio** | 50% expert pruning |
53
- | **Type** | Sparse Mixture-of-Experts (SMoE) |
54
- | **Total Parameters** | 179B (was 358B) |
55
  | **Experts per Layer** | 80 (was 160) |
56
  | **MoE Layers** | 92 |
 
57
  | **Precision** | BF16 |
58
  | **Disk Size** | ~345GB |
59
  | **VRAM Required** | ~345GB |
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ---
62
 
63
  ## πŸ“¦ Related Models
64
 
65
- | Model | Params | Experts | Size | Format | Link |
66
- |-------|--------|---------|------|--------|------|
67
- | GLM-4.7-REAP-30 | 251B | 112 | ~470GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-30) |
68
- | GLM-4.7-REAP-35 | 233B | 104 | ~439GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-35) |
69
- | GLM-4.7-REAP-40 | 218B | 96 | ~407GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-40) |
70
- | GLM-4.7-REAP-45 | 197B | 88 | ~370GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-45) |
71
- | GLM-4.7-REAP-50 | 179B | 80 | ~345GB | BF16 | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-50) |
72
- | GLM-4.7-REAP-40-W4A16 | 218B | 96 | ~108GB | GPTQ | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) |
73
- | GLM-4.7-REAP-50-W4A16 | 179B | 80 | ~92GB | GPTQ | [Link](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) |
74
 
75
  ---
76
 
77
  ## πŸš€ Deployment
78
 
79
- ### With vLLM (Recommended)
80
 
81
  ```bash
82
  vllm serve 0xSero/GLM-4.7-REAP-50 \
@@ -85,7 +134,7 @@ vllm serve 0xSero/GLM-4.7-REAP-50 \
85
  --dtype bfloat16
86
  ```
87
 
88
- ### With Transformers
89
 
90
  ```python
91
  import torch
@@ -99,48 +148,100 @@ model = AutoModelForCausalLM.from_pretrained(
99
  )
100
  tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-50", trust_remote_code=True)
101
 
102
- messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
103
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
104
- outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
105
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
106
  ```
107
 
108
  ---
109
 
110
- ## 🧩 Model Creation
 
 
111
 
112
- This model was created by applying **REAP** uniformly across all MoE blocks with a **50% pruning rate**.
113
 
114
- ### How REAP Works
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
- REAP selects experts to prune based on a **saliency criterion** that considers:
117
- - **Router gate values**: How frequently and strongly the router activates each expert
118
- - **Expert activation norms**: The magnitude of each expert's output contributions
119
 
120
- ### Calibration for Code & Function Calling
121
 
122
- This model was specifically calibrated on datasets optimized for **code generation** and **function/tool calling**:
123
 
124
- | Dataset | Samples | Purpose |
125
- |---------|---------|---------|
126
- | [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation |
127
- | [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling |
128
- | [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn |
129
 
130
- Combined calibration dataset: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
 
 
 
131
 
132
  ---
133
 
134
  ## βš–οΈ License
135
 
136
- Apache 2.0 (inherited from base GLM-4 model)
137
 
138
  ---
139
 
140
  ## 🧾 Citation
141
 
142
- If you use this model, please cite the REAP paper:
143
-
144
  ```bibtex
145
  @article{lasby2025reap,
146
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
 
12
  - cerebras
13
  - code
14
  - function-calling
15
+ - agentic
16
  license: apache-2.0
17
  pipeline_tag: text-generation
18
  base_model:
 
28
 
29
  ## ✨ Highlights
30
 
31
+ **50% Expert-Pruned** GLM-4.7 optimized for **code generation**, **function calling**, and **agentic workflows**.
32
 
33
+ Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:
34
 
35
+ - **358B β†’ 179B**: 50% of MoE experts pruned (80/160 remaining)
36
+ - **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
37
+ - **One-Shot Compression**: No fine-tuning required
38
+ - **Drop-in Compatible**: Works with vLLM, Transformers, SGLang
39
 
40
  ### πŸ™ Acknowledgments
41
 
 
44
 
45
  ---
46
 
47
+ ## πŸ“‹ Model Specifications
48
 
49
  | Property | Value |
50
  |----------|-------|
51
  | **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
52
+ | **Architecture** | Sparse Mixture-of-Experts (SMoE) |
53
+ | **Original Parameters** | 358B |
54
+ | **Pruned Parameters** | 179B |
55
+ | **Compression** | 50% experts removed |
56
  | **Experts per Layer** | 80 (was 160) |
57
  | **MoE Layers** | 92 |
58
+ | **Activated Experts** | 8 per token |
59
  | **Precision** | BF16 |
60
  | **Disk Size** | ~345GB |
61
  | **VRAM Required** | ~345GB |
62
 
63
+ ---
64
+
65
+
66
+ ## πŸ”¬ Calibration Dataset: Deep Dive
67
+
68
+ REAP's effectiveness depends critically on **calibration data that represents the target use case**. We specifically optimized for **code generation**, **function/tool calling**, and **agentic workflows**.
69
+
70
+ ### Why These 3 Datasets?
71
+
72
+ | Dataset | Samples | Purpose | Why It Matters |
73
+ |---------|---------|---------|----------------|
74
+ | [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation | **51% of mix** β€” Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability |
75
+ | [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling | **24% of mix** β€” Tool use requires structured JSON output; experts handling schema generation must be preserved |
76
+ | [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn | **24% of mix** β€” Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning |
77
+
78
+ ### The Science Behind Dataset Selection
79
+
80
+ ```
81
+ REAP Algorithm:
82
+ 1. Forward pass calibration samples through model
83
+ 2. Record which experts activate and their magnitudes
84
+ 3. Compute saliency = router_weight Γ— activation_norm
85
+ 4. Prune lowest-saliency experts
86
+
87
+ Key Insight: Experts are TASK-SPECIFIC
88
+ β”œβ”€β”€ Some experts specialize in natural language
89
+ β”œβ”€β”€ Some experts specialize in code syntax
90
+ β”œβ”€β”€ Some experts specialize in JSON/structured output
91
+ └── Some experts specialize in multi-turn context
92
+
93
+ If calibration lacks code β†’ code-specialized experts appear "unused" β†’ get pruned β†’ model loses coding ability
94
+ ```
95
+
96
+ ### Cerebras' Original Mix (from paper)
97
+
98
+ Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
99
+ - evol-codealpaca-v1 for code generation
100
+ - xlam-function-calling-60k for tool calling
101
+ - SWE-smith-trajectories for agentic tasks
102
+
103
+ We followed this exact recipe for reproducibility.
104
+
105
+ ### Combined Dataset
106
+
107
+ Our calibration mix: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
108
+
109
+
110
  ---
111
 
112
  ## πŸ“¦ Related Models
113
 
114
+ | Model | Params | Experts | Size | Format |
115
+ |-------|--------|---------|------|--------|
116
+ | [GLM-4.7-REAP-30](https://huggingface.co/0xSero/GLM-4.7-REAP-30) | 251B | 112 | ~470GB | BF16 |
117
+ | [GLM-4.7-REAP-35](https://huggingface.co/0xSero/GLM-4.7-REAP-35) | 233B | 104 | ~439GB | BF16 |
118
+ | [GLM-4.7-REAP-40](https://huggingface.co/0xSero/GLM-4.7-REAP-40) | 218B | 96 | ~407GB | BF16 |
119
+ | [GLM-4.7-REAP-45](https://huggingface.co/0xSero/GLM-4.7-REAP-45) | 197B | 88 | ~370GB | BF16 |
120
+ | [GLM-4.7-REAP-50](https://huggingface.co/0xSero/GLM-4.7-REAP-50) | 179B | 80 | ~345GB | BF16 |
121
+ | [GLM-4.7-REAP-40-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) | 218B | 96 | ~108GB | GPTQ |
122
+ | [GLM-4.7-REAP-50-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) | 179B | 80 | ~92GB | GPTQ |
123
 
124
  ---
125
 
126
  ## πŸš€ Deployment
127
 
128
+ ### vLLM (Recommended)
129
 
130
  ```bash
131
  vllm serve 0xSero/GLM-4.7-REAP-50 \
 
134
  --dtype bfloat16
135
  ```
136
 
137
+ ### Transformers
138
 
139
  ```python
140
  import torch
 
148
  )
149
  tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-50", trust_remote_code=True)
150
 
151
+ messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
152
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
153
+ outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
154
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
155
  ```
156
 
157
  ---
158
 
159
+ ## 🧩 Reproduction
160
+
161
+ ### REAP Pruning Script
162
 
 
163
 
164
+ ```python
165
+ #!/usr/bin/env python3
166
+ """
167
+ REAP Pruning Script for MoE Models
168
+ Adapted from: https://github.com/CerebrasResearch/reap
169
+ """
170
+
171
+ import subprocess
172
+ import sys
173
+
174
+ def run_reap(
175
+ model_path: str,
176
+ compression_ratio: float,
177
+ dataset: str = "0xSero/glm47-reap-calibration-v2",
178
+ samples: int = 1360,
179
+ seed: int = 42,
180
+ distance: str = "angular",
181
+ reuse_observations: str = None,
182
+ ):
183
+ """
184
+ Run REAP expert pruning.
185
+
186
+ Args:
187
+ model_path: Path to base model
188
+ compression_ratio: 0.30 = prune 30%, keep 70%
189
+ dataset: Calibration dataset (code + tools + agentic)
190
+ samples: Number of calibration samples
191
+ seed: Random seed for reproducibility
192
+ distance: Distance metric for expert clustering
193
+ reuse_observations: Path to pre-computed observations for instant pruning
194
+ """
195
+ cmd = [
196
+ sys.executable, "src/reap/prune.py",
197
+ "--model-name", model_path,
198
+ "--dataset-name", dataset,
199
+ "--compression-ratio", str(compression_ratio),
200
+ "--prune-method", "reap",
201
+ "--seed", str(seed),
202
+ "--samples_per_category", str(samples),
203
+ "--model_max_length", "2048",
204
+ "--distance_measure", distance,
205
+ "--record_pruning_metrics_only", "true",
206
+ ]
207
+
208
+ if reuse_observations:
209
+ # Instant pruning: skip calibration, reuse precomputed expert scores
210
+ cmd.extend(["--load_observations", reuse_observations])
211
+
212
+ subprocess.run(cmd, check=True)
213
+
214
+ # Example: Create 40% pruned model
215
+ run_reap(
216
+ model_path="/path/to/GLM-4.7",
217
+ compression_ratio=0.40, # Prune 40% of experts
218
+ )
219
+ ```
220
 
 
 
 
221
 
222
+ ### Observation Reuse (Instant Multi-Ratio Pruning)
223
 
224
+ REAP computes expert saliency scores during calibration. These scores are **compression-ratio independent**, enabling instant pruning at any ratio:
225
 
226
+ ```bash
227
+ # First run: compute observations (~5 hours)
228
+ python prune.py --compression-ratio 0.40 --output_file_name observations.pt
 
 
229
 
230
+ # Subsequent runs: instant pruning (<5 minutes)
231
+ python prune.py --compression-ratio 0.30 --load_observations observations.pt
232
+ python prune.py --compression-ratio 0.50 --load_observations observations.pt
233
+ ```
234
 
235
  ---
236
 
237
  ## βš–οΈ License
238
 
239
+ Apache 2.0 (inherited from GLM-4)
240
 
241
  ---
242
 
243
  ## 🧾 Citation
244
 
 
 
245
  ```bibtex
246
  @article{lasby2025reap,
247
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},