lujangusface commited on
Commit
77926dd
·
verified ·
1 Parent(s): 1bfb467

Add model card

Browse files
Files changed (1) hide show
  1. README.md +230 -0
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EAGLE3 Draft Model for GLM-4.7-Flash
2
+
3
+ GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with **GLM-4.7-Flash**. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.
4
+
5
+ **Version:** 1.0
6
+ **Release Date:** 2026-02-16
7
+ **Organization:** ThoughtWorks
8
+ **License:** Apache-2.0
9
+
10
+ ---
11
+
12
+ ## Model Overview
13
+
14
+ This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.7× throughput improvement** under concurrent load.
15
+
16
+ **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
17
+ **Draft Model Size**: 277.4 MB
18
+ **Architecture**: 1-layer transformer with 2048 hidden dimensions
19
+
20
+ ### Key Features
21
+
22
+ - **FlashInfer Compatible**: head_dim=128 ✓
23
+ - **Acceptance Rate**: 40.0% (MT-Bench, B=1)
24
+ - **Speedup**: 1.39× TPOT (B=1), 1.7× throughput (B=32)
25
+ - **Hardware**: Optimized for TP=4 deployment
26
+
27
+ ---
28
+
29
+ ## Architecture Specifications
30
+
31
+ | Parameter | Value |
32
+ |-----------|-------|
33
+ | Hidden Size | 2048 |
34
+ | Attention Heads | 16 |
35
+ | KV Heads (GQA) | 4 |
36
+ | Head Dimension | 128 |
37
+ | Intermediate Size | 8192 |
38
+ | Layers | 1 |
39
+ | Vocabulary Size | 154880 |
40
+ | Draft Vocab Size | 32000 |
41
+
42
+ **Note**: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.
43
+
44
+ ---
45
+
46
+ ## Training Details
47
+
48
+ ### Dataset
49
+
50
+ **Mixed Diversity** — 54K samples
51
+
52
+ Composition:
53
+
54
+ - 45% ShareGPT
55
+
56
+ - 35% UltraChat
57
+
58
+ - 20% PerfectBlend
59
+
60
+
61
+ Average tokens per sample: 1300
62
+
63
+ ### Hyperparameters
64
+
65
+ | Parameter | Value |
66
+ |-----------|-------|
67
+ | Epochs | 3 |
68
+ | Batch Size | 1 |
69
+ | Learning Rate | 1e-4 |
70
+ | Warmup Ratio | 0.03 |
71
+ | Max Length | 1024 |
72
+ | TP Size | 4 |
73
+
74
+
75
+
76
+ ### Training Results
77
+
78
+ - **Training Acceptance Rate**: 79.2% (at position k=0)
79
+ - **Best Checkpoint**: epoch_2_step_37323
80
+ - **Experiment ID**: exp-K
81
+
82
+ ---
83
+
84
+ ## Benchmark Results
85
+
86
+ **Dataset**: MT-Bench (154 prompts, max_tokens=512, temperature=0.7)
87
+ **Hardware**: Single NVIDIA H100 (79GB), TP=1
88
+ **Backend**: FlashInfer
89
+ **Spec Config**: num_steps=3, num_draft_tokens=6, eagle_topk=4
90
+
91
+ ### Metric Definitions
92
+
93
+ - **Acceptance Rate**: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
94
+ - **Acceptance Length**: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
95
+ - **TTFT**: Time To First Token (prefill latency) in milliseconds
96
+ - **TPOT**: Time Per Output Token (decode latency) in milliseconds
97
+ - **Throughput**: Tokens generated per second
98
+
99
+ ### Batch Size 1 (Single Request - Latency Optimization)
100
+
101
+ #### Server-Side Metrics (Prometheus — Ground Truth)
102
+
103
+ | Metric | Baseline | EAGLE3 | Speedup |
104
+ |--------|----------|--------|---------|
105
+ | TTFT (ms) | 76.1 | 74.74 | **1.02×** |
106
+ | TPOT (ms) | 8.18 | 5.89 | **1.39×** |
107
+ | Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
108
+ | Acceptance Rate | -- | **40.0%** | -- |
109
+ | Acceptance Length | -- | **2.4** | -- |
110
+
111
+ ### Batch Size 32 (Concurrent Load - Throughput Optimization)
112
+
113
+ #### Server-Side Metrics (Prometheus — Ground Truth)
114
+
115
+ | Metric | Baseline | EAGLE3 | Speedup |
116
+ |--------|----------|--------|---------|
117
+ | TTFT (ms) | 2988 | 3210 | **0.93×** |
118
+ | TPOT (ms) | 22.57 | 17.33 | **1.3×** |
119
+ | Throughput (tok/s) | 258.61 | 440.15 | **1.7×** |
120
+
121
+ **Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
122
+
123
+ ---
124
+
125
+ ## Usage
126
+
127
+ ### Installation
128
+
129
+ ```bash
130
+ pip install sglang transformers
131
+ ```
132
+
133
+ ### Basic Usage
134
+
135
+ ```bash
136
+ python -m sglang.launch_server \
137
+ --model-path zai-org/GLM-4.7-Flash \
138
+ --speculative-algorithm EAGLE3 \
139
+ --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
140
+ --speculative-num-steps 3 \
141
+ --speculative-num-draft-tokens 6 \
142
+ --speculative-eagle-topk 4 \
143
+ --tp 1 \
144
+ --trust-remote-code \
145
+ --port 30000 \
146
+ --enable-metrics
147
+
148
+ ```
149
+
150
+ ### Python API
151
+
152
+ ```python
153
+ import requests
154
+
155
+ response = requests.post(
156
+ "http://localhost:30000/v1/chat/completions",
157
+ json={
158
+ "model": "default",
159
+ "messages": [{"role": "user", "content": "Hello!"}],
160
+ "max_tokens": 100,
161
+ "temperature": 0.7,
162
+ }
163
+ )
164
+ print(response.json())
165
+ ```
166
+
167
+ ### Performance Tips
168
+
169
+ 1. **Backend Selection**: Use FlashInfer backend (default) for optimal performance
170
+ 2. **Tuning**: Adjust `num_draft_tokens` based on workload (3-6 recommended)
171
+ 3. **Monitoring**: Enable `--enable-metrics` flag and monitor `/metrics` endpoint for acceptance rates
172
+ 4. **Validation**: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly
173
+
174
+ ---
175
+
176
+ ## Limitations
177
+
178
+
179
+ - Requires SGLang backend with EAGLE3 support
180
+
181
+ - Optimized for TP=1 inference (single GPU deployment)
182
+
183
+ - FlashInfer backend recommended for optimal performance
184
+
185
+ - Head dimension 128 ensures FlashInfer compatibility
186
+
187
+
188
+ ---
189
+
190
+ ## Citation
191
+
192
+ ```bibtex
193
+ @misc{glm_4.7_flash_eagle3_2026,
194
+ title={EAGLE3 Draft Model for GLM-4.7-Flash},
195
+ author={ThoughtWorks},
196
+ year={2026},
197
+ howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
198
+ }
199
+ ```
200
+
201
+ ### EAGLE3 Paper
202
+
203
+ ```bibtex
204
+ @article{wang2024eagle3,
205
+ title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
206
+ author={Wang, Yuhui and others},
207
+ journal={arXiv preprint arXiv:2501.XXXXX},
208
+ year={2024}
209
+ }
210
+ ```
211
+
212
+ ---
213
+
214
+ ## Additional Resources
215
+
216
+ - **Benchmark Results**: [https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md](https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md)
217
+ - **Training Guide**: [https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md](https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md)
218
+ - **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
219
+
220
+ ---
221
+
222
+ ## License
223
+
224
+ Apache-2.0
225
+
226
+ ---
227
+
228
+ ## Contact
229
+
230
+ For questions or issues, please contact ThoughtWorks or open an issue in the [baby-shark repository](https://github.com/thoughtworks/baby-shark).