lujangusface commited on
Commit
f600f87
·
verified ·
1 Parent(s): 73afa42

Release EAGLE3 draft head for GLM-4.7-FP8 (exp-e, acc=0.97)

Browse files
Files changed (3) hide show
  1. README.md +167 -0
  2. config.json +39 -0
  3. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ base_model: THUDM/GLM-4.7
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - eagle3
10
+ - speculative-decoding
11
+ - sglang
12
+ - draft-model
13
+ - moe
14
+ - mixture-of-experts
15
+ - fp8
16
+ ---
17
+
18
+ <!-- Internal: exp-e (gpu/glm47-fp8) -->
19
+
20
+ # EAGLE3 Draft Head — GLM-4.7-FP8
21
+
22
+ A lightweight EAGLE3 draft head for [GLM-4.7](https://huggingface.co/THUDM/GLM-4.7) (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
23
+
24
+ GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.
25
+
26
+ **Blog post**: [TODO: link after publication]
27
+
28
+ ## Usage
29
+
30
+ ### SGLang (GPU)
31
+
32
+ Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for GLM-4.7 Eagle3 support.
33
+
34
+ **B=1 server** (wide tree — optimal for single-user, real-time requests):
35
+
36
+ ```bash
37
+ pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
38
+
39
+ python -m sglang.launch_server \
40
+ --model-path THUDM/GLM-4.7 \
41
+ --speculative-algorithm EAGLE3 \
42
+ --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
43
+ --speculative-num-steps 3 \
44
+ --speculative-num-draft-tokens 6 \
45
+ --speculative-eagle-topk 4 \
46
+ --tp 8 \
47
+ --trust-remote-code \
48
+ --port 30000
49
+ ```
50
+
51
+ **B=32 server** (wide tree is also recommended at B=32 for this model):
52
+
53
+ ```bash
54
+ python -m sglang.launch_server \
55
+ --model-path THUDM/GLM-4.7 \
56
+ --speculative-algorithm EAGLE3 \
57
+ --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
58
+ --speculative-num-steps 3 \
59
+ --speculative-num-draft-tokens 6 \
60
+ --speculative-eagle-topk 4 \
61
+ --tp 8 \
62
+ --trust-remote-code \
63
+ --port 30000
64
+ ```
65
+
66
+ **Note**: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads.
67
+
68
+ ### Python Client
69
+
70
+ ```python
71
+ import requests
72
+
73
+ response = requests.post(
74
+ "http://localhost:30000/v1/chat/completions",
75
+ json={
76
+ "model": "default",
77
+ "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
78
+ "max_tokens": 512,
79
+ "temperature": 0,
80
+ }
81
+ )
82
+ print(response.json()["choices"][0]["message"]["content"])
83
+ ```
84
+
85
+ ## Training Details
86
+
87
+ | Parameter | Value |
88
+ |-----------|-------|
89
+ | Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
90
+ | Hardware | 8x NVIDIA H200 144GB (TP=8, DP=1) |
91
+ | Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 |
92
+ | Fine-tuning | 3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5 |
93
+ | Optimizer | AdamW |
94
+ | Batch size | 1 (per device) |
95
+ | max_length | 1024 |
96
+ | TTT (tree training tokens) | 7 |
97
+ | Precision | bfloat16 |
98
+ | Training accuracy (acc_0) | 0.97 |
99
+
100
+ ### Training Method
101
+
102
+ EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
103
+
104
+ ### Regenerated Data
105
+
106
+ The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more.
107
+
108
+ ## Performance
109
+
110
+ ### B=1 Inference Benchmarks (temp=0, FP8, TP=8)
111
+
112
+ | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length |
113
+ |---------|-----------------|----------------|---------|-------------|---------------|
114
+ | Terminal-Bench | 55.0 | 113.6 | **2.07x** | 42.5% | 2.55 |
115
+ | MT-Bench | 66.5 | 106.7 | **1.60x** | 42.5% | 2.55 |
116
+ | SWEBench-Verified | 66.1 | 104.0 | **1.57x** | 45.0% | 2.70 |
117
+ | HumanEval | 66.8 | 102.2 | **1.53x** | 54.2% | 3.25 |
118
+ | **Mean** | **63.6** | **106.6** | **1.69x** | **46.1%** | **2.76** |
119
+
120
+ ### B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree)
121
+
122
+ | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
123
+ |---------|-----------------|----------------|---------|
124
+ | SWEBench-Verified | 922.7 | 1,108.4 | **1.20x** |
125
+ | MT-Bench | 954.2 | 1,109.7 | **1.16x** |
126
+ | Terminal-Bench | 952.3 | 1,104.3 | **1.16x** |
127
+ | HumanEval | 915.1 | 1,035.9 | **1.13x** |
128
+ | **Mean** | **936.1** | **1,089.6** | **1.16x** |
129
+
130
+ *Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit `63291f7f51`.*
131
+
132
+ ## Model Architecture
133
+
134
+ | Parameter | Value |
135
+ |-----------|-------|
136
+ | Architecture | LlamaForCausalLMEagle3 |
137
+ | Hidden size | 5120 |
138
+ | Num hidden layers | 1 |
139
+ | Num attention heads | 40 (8 KV heads) |
140
+ | head_dim | 128 |
141
+ | Intermediate size | 16384 |
142
+ | Auxiliary layers | [2, 46, 89] |
143
+ | Vocab size | 151552 (target) / 32000 (draft) |
144
+ | Checkpoint size | ~1.2 GB |
145
+
146
+ ## Limitations
147
+
148
+ - **TP=8 required.** FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary.
149
+ - **Temperature sensitivity.** Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads.
150
+ - **FP8 quantization.** The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference.
151
+ - **Requires SGLang fork.** Upstream SGLang does not yet include all patches needed for Eagle3 on this model.
152
+ - **JIT deep_gemm incompatible.** Training requires `SGLANG_ENABLE_JIT_DEEPGEMM=0` to avoid kernel assertion failures.
153
+
154
+ ## License
155
+
156
+ This draft head is released under Apache 2.0. Please verify the [GLM-4.7 license](https://huggingface.co/THUDM/GLM-4.7) for the target model.
157
+
158
+ ## Citation
159
+
160
+ ```bibtex
161
+ @inproceedings{li2025eagle3,
162
+ title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
163
+ author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
164
+ booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
165
+ year={2025}
166
+ }
167
+ ```
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLMEagle3"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151329,
8
+ "draft_vocab_size": 32000,
9
+ "dtype": "bfloat16",
10
+ "eagle_aux_hidden_state_layer_ids": [
11
+ 2,
12
+ 46,
13
+ 89
14
+ ],
15
+ "eos_token_id": 151336,
16
+ "head_dim": 128,
17
+ "hidden_act": "silu",
18
+ "hidden_size": 5120,
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 16384,
21
+ "max_position_embeddings": 4096,
22
+ "mlp_bias": false,
23
+ "model_type": "llama",
24
+ "num_attention_heads": 40,
25
+ "num_hidden_layers": 1,
26
+ "num_key_value_heads": 8,
27
+ "pad_token_id": null,
28
+ "pretraining_tp": 1,
29
+ "rms_norm_eps": 1e-05,
30
+ "rope_parameters": {
31
+ "rope_theta": 1000000.0,
32
+ "rope_type": "default"
33
+ },
34
+ "target_hidden_size": 5120,
35
+ "tie_word_embeddings": false,
36
+ "transformers_version": "5.3.0",
37
+ "use_cache": true,
38
+ "vocab_size": 151552
39
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:770b811326f1be2f2881d47b00871d9ef724dad72dcffbdf20a574824043522f
3
+ size 1187962360