lujangusface commited on
Commit
5652f19
·
verified ·
1 Parent(s): 0335a31

docs: polish model card, remove internal details, fix citation

Browse files
Files changed (1) hide show
  1. README.md +17 -36
README.md CHANGED
@@ -3,17 +3,11 @@ license: apache-2.0
3
  library_name: transformers
4
  pipeline_tag: text-generation
5
  tags:
6
-
7
  - speculative-decoding
8
-
9
  - eagle3
10
-
11
  - glm
12
-
13
  - draft-model
14
-
15
  - text-generation
16
-
17
  ---
18
 
19
  # EAGLE3 Draft Model for GLM-4.7-Flash
@@ -29,7 +23,7 @@ GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding w
29
 
30
  ## Model Overview
31
 
32
- This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.7× throughput improvement** under concurrent load.
33
 
34
  **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
35
  **Draft Model Size**: 277.4 MB
@@ -39,8 +33,8 @@ This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https:
39
 
40
  - **FlashInfer Compatible**: head_dim=128 ✓
41
  - **Acceptance Rate**: 40.0% (MT-Bench, B=1)
42
- - **Speedup**: 1.39× TPOT (B=1), 1.7× throughput (B=32)
43
- - **Hardware**: Optimized for TP=4 deployment
44
 
45
  ---
46
 
@@ -68,14 +62,10 @@ This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https:
68
  **Mixed Diversity** — 54K samples
69
 
70
  Composition:
71
-
72
  - 45% ShareGPT
73
-
74
  - 35% UltraChat
75
-
76
  - 20% PerfectBlend
77
 
78
-
79
  Average tokens per sample: 1300
80
 
81
  ### Hyperparameters
@@ -87,15 +77,10 @@ Average tokens per sample: 1300
87
  | Learning Rate | 1e-4 |
88
  | Warmup Ratio | 0.03 |
89
  | Max Length | 1024 |
90
- | TP Size | 4 |
91
-
92
-
93
 
94
  ### Training Results
95
 
96
- - **Training Acceptance Rate**: 79.2% (at position k=0)
97
- - **Best Checkpoint**: epoch_2_step_37323
98
- - **Experiment ID**: exp-K
99
 
100
  ---
101
 
@@ -123,8 +108,8 @@ Average tokens per sample: 1300
123
  | TTFT (ms) | 76.1 | 74.74 | **1.02×** |
124
  | TPOT (ms) | 8.18 | 5.89 | **1.39×** |
125
  | Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
126
- | Acceptance Rate | -- | **40.0%** | -- |
127
- | Acceptance Length | -- | **2.4** | -- |
128
 
129
  ### Batch Size 32 (Concurrent Load - Throughput Optimization)
130
 
@@ -132,9 +117,13 @@ Average tokens per sample: 1300
132
 
133
  | Metric | Baseline | EAGLE3 | Speedup |
134
  |--------|----------|--------|---------|
135
- | TTFT (ms) | 2988 | 3210 | **0.93×** |
136
- | TPOT (ms) | 22.57 | 17.33 | **1.3×** |
137
- | Throughput (tok/s) | 258.61 | 440.15 | **1.7×** |
 
 
 
 
138
 
139
  **Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
140
 
@@ -162,7 +151,6 @@ python -m sglang.launch_server \
162
  --trust-remote-code \
163
  --port 30000 \
164
  --enable-metrics
165
-
166
  ```
167
 
168
  ### Python API
@@ -193,15 +181,10 @@ print(response.json())
193
 
194
  ## Limitations
195
 
196
-
197
  - Requires SGLang backend with EAGLE3 support
198
-
199
  - Optimized for TP=1 inference (single GPU deployment)
200
-
201
  - FlashInfer backend recommended for optimal performance
202
 
203
- - Head dimension 128 ensures FlashInfer compatibility
204
-
205
 
206
  ---
207
 
@@ -219,11 +202,11 @@ print(response.json())
219
  ### EAGLE3 Paper
220
 
221
  ```bibtex
222
- @article{wang2024eagle3,
223
  title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
224
  author={Wang, Yuhui and others},
225
- journal={arXiv preprint arXiv:2501.XXXXX},
226
- year={2024}
227
  }
228
  ```
229
 
@@ -231,8 +214,6 @@ print(response.json())
231
 
232
  ## Additional Resources
233
 
234
- - **Benchmark Results**: [https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md](https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md)
235
- - **Training Guide**: [https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md](https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md)
236
  - **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
237
 
238
  ---
@@ -245,4 +226,4 @@ apache-2.0
245
 
246
  ## Contact
247
 
248
- For questions or issues, please contact ThoughtWorks or open an issue in the [baby-shark repository](https://github.com/thoughtworks/baby-shark).
 
3
  library_name: transformers
4
  pipeline_tag: text-generation
5
  tags:
 
6
  - speculative-decoding
 
7
  - eagle3
 
8
  - glm
 
9
  - draft-model
 
10
  - text-generation
 
11
  ---
12
 
13
  # EAGLE3 Draft Model for GLM-4.7-Flash
 
23
 
24
  ## Model Overview
25
 
26
+ This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.70× throughput improvement** under concurrent load.
27
 
28
  **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
29
  **Draft Model Size**: 277.4 MB
 
33
 
34
  - **FlashInfer Compatible**: head_dim=128 ✓
35
  - **Acceptance Rate**: 40.0% (MT-Bench, B=1)
36
+ - **Speedup**: 1.39× TPOT (B=1), 1.70× throughput (B=32)
37
+ - **Hardware**: Optimized for single GPU (TP=1) deployment
38
 
39
  ---
40
 
 
62
  **Mixed Diversity** — 54K samples
63
 
64
  Composition:
 
65
  - 45% ShareGPT
 
66
  - 35% UltraChat
 
67
  - 20% PerfectBlend
68
 
 
69
  Average tokens per sample: 1300
70
 
71
  ### Hyperparameters
 
77
  | Learning Rate | 1e-4 |
78
  | Warmup Ratio | 0.03 |
79
  | Max Length | 1024 |
 
 
 
80
 
81
  ### Training Results
82
 
83
+ - **Training Acceptance Rate**: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)
 
 
84
 
85
  ---
86
 
 
108
  | TTFT (ms) | 76.1 | 74.74 | **1.02×** |
109
  | TPOT (ms) | 8.18 | 5.89 | **1.39×** |
110
  | Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
111
+ | Acceptance Rate (%) | | **40.0%** | |
112
+ | Acceptance Length | | **2.4** | |
113
 
114
  ### Batch Size 32 (Concurrent Load - Throughput Optimization)
115
 
 
117
 
118
  | Metric | Baseline | EAGLE3 | Speedup |
119
  |--------|----------|--------|---------|
120
+ | TTFT (ms) | 2988 | 3210 | 0.93× |
121
+ | TPOT (ms) | 22.57 | 17.33 | **1.30×** |
122
+ | Throughput (tok/s) | 258.61 | 440.15 | **1.70×** |
123
+ | Acceptance Rate (%) | — | **40.0%†** | — |
124
+ | Acceptance Length | — | **2.4†** | — |
125
+
126
+ †Same server session as B=1; concurrent benchmark does not collect per-request accept stats.
127
 
128
  **Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
129
 
 
151
  --trust-remote-code \
152
  --port 30000 \
153
  --enable-metrics
 
154
  ```
155
 
156
  ### Python API
 
181
 
182
  ## Limitations
183
 
 
184
  - Requires SGLang backend with EAGLE3 support
 
185
  - Optimized for TP=1 inference (single GPU deployment)
 
186
  - FlashInfer backend recommended for optimal performance
187
 
 
 
188
 
189
  ---
190
 
 
202
  ### EAGLE3 Paper
203
 
204
  ```bibtex
205
+ @article{wang2025eagle3,
206
  title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
207
  author={Wang, Yuhui and others},
208
+ journal={arXiv preprint arXiv:2503.01840},
209
+ year={2025}
210
  }
211
  ```
212
 
 
214
 
215
  ## Additional Resources
216
 
 
 
217
  - **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
218
 
219
  ---
 
226
 
227
  ## Contact
228
 
229
+ For questions or issues, open a discussion on the [model page](https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3/discussions).