lujangusface commited on
Commit
c5e920c
·
verified ·
1 Parent(s): 42b815f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ base_model: MiniMaxAI/MiniMax-M2.5
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - eagle3
10
+ - speculative-decoding
11
+ - sglang
12
+ - draft-model
13
+ - moe
14
+ - mixture-of-experts
15
+ ---
16
+
17
+ <!-- Internal: exp-f (gpu/minimax-m2) -->
18
+
19
+ # EAGLE3 Draft Head — MiniMax-M2.5
20
+
21
+ A lightweight EAGLE3 draft head for [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (229B MoE, ~10B active parameters). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
22
+
23
+ **Blog post**: [2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5](https://huggingface.co/blog/lujangusface/tw-eagle3-minimax)
24
+
25
+ ## Usage
26
+
27
+ ### SGLang (GPU)
28
+
29
+ Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for MiniMax-M2.5 Eagle3 support + FP8 dtype fixes.
30
+
31
+ **B=1 server** (wide tree — optimal for single-user, real-time requests):
32
+
33
+ ```bash
34
+ pip install git+https://github.com/tails-mpt/sglang.git
35
+
36
+ python -m sglang.launch_server \
37
+ --model-path MiniMaxAI/MiniMax-M2.5 \
38
+ --speculative-algorithm EAGLE3 \
39
+ --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
40
+ --speculative-num-steps 3 \
41
+ --speculative-num-draft-tokens 8 \
42
+ --speculative-eagle-topk 4 \
43
+ --dtype fp8 \
44
+ --tp 4 \
45
+ --port 30000
46
+ ```
47
+
48
+ **B=32 server** (narrow tree — optimal for batch workloads):
49
+
50
+ ```bash
51
+ python -m sglang.launch_server \
52
+ --model-path MiniMaxAI/MiniMax-M2.5 \
53
+ --speculative-algorithm EAGLE3 \
54
+ --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
55
+ --speculative-num-steps 5 \
56
+ --speculative-num-draft-tokens 6 \
57
+ --speculative-eagle-topk 1 \
58
+ --dtype fp8 \
59
+ --tp 4 \
60
+ --port 30002
61
+ ```
62
+
63
+ **Important**: Use different speculative configs for B=1 vs B=32. A wider tree (topk=4) exploits idle GPU compute at low batch; a narrow tree (topk=1) minimizes MoE expert dispatch overhead at high batch.
64
+
65
+ ### Python Client
66
+
67
+ ```python
68
+ import requests
69
+
70
+ response = requests.post(
71
+ "http://localhost:30000/v1/chat/completions",
72
+ json={
73
+ "model": "default",
74
+ "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
75
+ "max_tokens": 512,
76
+ "temperature": 0,
77
+ }
78
+ )
79
+ print(response.json()["choices"][0]["message"]["content"])
80
+ ```
81
+
82
+ ## Training Details
83
+
84
+ | Parameter | Value |
85
+ |-----------|-------|
86
+ | Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
87
+ | Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) |
88
+ | Dataset | 20K regenerated samples (target-model responses at temp=0.8) |
89
+ | Pre-training | 9 epochs on 54K mixed data (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%) |
90
+ | Fine-tuning | 6 epochs on 20K regenerated data |
91
+ | Learning rate | 2e-5 (final stage) |
92
+ | Optimizer | AdamW |
93
+ | Batch size | 1 (per device) |
94
+ | max_length | 2048 |
95
+ | TTT (tree training tokens) | 7 |
96
+ | Precision | bfloat16 |
97
+
98
+ ### Training Method
99
+
100
+ EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 1, 30, 58 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
101
+
102
+ ## Performance
103
+
104
+ ### Training Accuracy (base checkpoint, before regenerated data fine-tuning)
105
+
106
+ | Position | Accuracy |
107
+ |----------|----------|
108
+ | acc_0 | 0.820 |
109
+ | acc_1 | 0.809 |
110
+ | acc_2 | 0.781 |
111
+ | acc_3 | 0.789 |
112
+ | acc_4 | 0.777 |
113
+ | acc_5 | 0.761 |
114
+ | acc_6 | 0.730 |
115
+
116
+ *The released model was fine-tuned for 6 additional epochs on 20K regenerated samples from the target model. The fine-tuned accuracy is expected to be equal or higher than these base values.*
117
+
118
+ ### Inference Benchmarks (B=1, temp=0, FP8, TP=4)
119
+
120
+ | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
121
+ |---------|-----------------|----------------|---------|
122
+ | HumanEval | 109.3 | 230.6 | **2.11x** |
123
+ | MT-Bench | 109.9 | 195.6 | **1.78x** |
124
+ | SWEBench-Verified | 109.6 | 191.8 | **1.75x** |
125
+ | Aider | 109.9 | 186.8 | **1.70x** |
126
+
127
+ *Config: steps=3, topk=4, draft_tokens=8. All datasets at temp=0 on 8x H200 (TP=4).*
128
+
129
+ ## Model Architecture
130
+
131
+ | Parameter | Value |
132
+ |-----------|-------|
133
+ | Architecture | LlamaForCausalLMEagle3 |
134
+ | Hidden size | 3072 |
135
+ | Num hidden layers | 1 |
136
+ | Num attention heads | 24 (8 KV heads) |
137
+ | Intermediate size | 8192 |
138
+ | Auxiliary layers | [1, 30, 58] |
139
+ | Vocab size | 200064 (target) / 32000 (draft) |
140
+ | Checkpoint size | ~464 MB |
141
+
142
+ ## Limitations
143
+
144
+ - **TP=4 only.** TP=8 fails due to FP8 block size constraint (`intermediate_size / 8 = 192`, not divisible by `block_n=128`).
145
+ - **Temperature sensitivity.** Best performance at temp=0 (greedy). At temp=0.7, B=1 speedup drops to 1.27-1.80x and some B=32 datasets regress below baseline.
146
+ - **Coding-focused benchmarks.** All benchmarks use coding-oriented datasets (HumanEval, SWEBench, Aider). Conversational workloads may show different patterns.
147
+ - **SPEC_V2 incompatible.** The overlap scheduler (`SGLANG_ENABLE_SPEC_V2=true`) is not supported — standard (non-overlapped) speculation only.
148
+ - **Requires SGLang fork.** Upstream SGLang does not yet include the FP8 dtype patches needed for Eagle3 on this model.
149
+
150
+ ## License
151
+
152
+ This draft head is released under Apache 2.0, matching the [MiniMax-M2.5 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.5).
153
+
154
+ ## Citation
155
+
156
+ ```bibtex
157
+ @inproceedings{li2025eagle3,
158
+ title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
159
+ author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
160
+ booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
161
+ year={2025}
162
+ }
163
+ ```