0xSero commited on
Commit
7878a35
·
verified ·
1 Parent(s): fe29206

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: PrimeIntellect/INTELLECT-3
4
+ tags:
5
+ - REAP
6
+ - pruned
7
+ - MoE
8
+ - glm4_moe
9
+ - Cerebras
10
+ library_name: transformers
11
+ ---
12
+
13
+ # INTELLECT-3-REAP-50
14
+
15
+ **50% expert-pruned version of [PrimeIntellect/INTELLECT-3](https://huggingface.co/PrimeIntellect/INTELLECT-3) using Cerebras REAP (Router-weighted Expert Activation Pruning).**
16
+
17
+ ## Model Details
18
+
19
+ | Property | Value |
20
+ |----------|-------|
21
+ | Base Model | PrimeIntellect/INTELLECT-3 (248B MoE) |
22
+ | Architecture | GLM-4 MoE (glm4_moe) |
23
+ | Compression | 50% (64 experts pruned) |
24
+ | Remaining Experts | 64 per layer |
25
+ | Parameters | ~124B |
26
+ | Format | BF16 SafeTensors |
27
+ | Size | 107 GB |
28
+
29
+ ## REAP Configuration
30
+
31
+ ```yaml
32
+ dataset: 0xSero/glm47-reap-calibration-v2
33
+ samples: 1360
34
+ - evol-codealpaca-v1: 700 (code generation)
35
+ - xlam-function-calling-60k: 330 (function calling)
36
+ - SWE-smith-trajectories: 330 (agentic multi-turn)
37
+ distance_measure: angular
38
+ seed: 42
39
+ model_max_length: 2048
40
+ compression_ratio: 0.50
41
+ prune_method: reap
42
+ ```
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ from transformers import AutoModelForCausalLM, AutoTokenizer
48
+
49
+ model = AutoModelForCausalLM.from_pretrained(
50
+ "0xSero/INTELLECT-3-REAP-50",
51
+ torch_dtype="auto",
52
+ device_map="auto",
53
+ trust_remote_code=True
54
+ )
55
+ tokenizer = AutoTokenizer.from_pretrained("0xSero/INTELLECT-3-REAP-50", trust_remote_code=True)
56
+
57
+ messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
58
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
59
+ outputs = model.generate(inputs, max_new_tokens=512)
60
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
61
+ ```
62
+
63
+ ## Related Models
64
+
65
+ | Model | Compression | Format | Size |
66
+ |-------|-------------|--------|------|
67
+ | [INTELLECT-3-REAP-50](https://huggingface.co/0xSero/INTELLECT-3-REAP-50) | 50% | BF16 | 107GB |
68
+ | INTELLECT-3-REAP-50-W4A16 | 50% | W4A16 GPTQ | ~30GB (coming soon) |
69
+
70
+ ## Citation
71
+
72
+ ```bibtex
73
+ @article{cerebras2025reap,
74
+ title={REAP: Router-weighted Expert Activation Pruning for MoE Models},
75
+ author={Cerebras Systems},
76
+ year={2025}
77
+ }
78
+ ```
79
+
80
+ ## Acknowledgments
81
+
82
+ - **[Prime Intellect](https://primeintellect.ai/)** - For sponsoring compute and creating INTELLECT-3
83
+ - **[Cerebras](https://www.cerebras.net/)** - For the REAP pruning methodology
84
+ - Pruned using the [Cerebras REAP implementation](https://github.com/Cerebras/reap)
85
+
86
+ ---
87
+ *This model was created as part of efficiency research for large MoE models.*