Safetensors
llama
eerrr9 commited on
Commit
9bb9c96
Β·
verified Β·
1 Parent(s): ecfb8d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -3
README.md CHANGED
@@ -1,3 +1,178 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate
5
+ ---
6
+ # Ling-Flash-2.0-eagle3
7
+
8
+ ## Model Overview
9
+
10
+ **Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
11
+
12
+ The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
13
+
14
+ ## Key Features
15
+
16
+ - **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
17
+ - **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94%
18
+ - **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks
19
+ - **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200
20
+
21
+ ## Performance
22
+
23
+ ### Speculative Sampling Efficiency
24
+
25
+ Average Acceptance Length with speculative length of 4:
26
+
27
+ | Benchmark | Average Acceptance Length |
28
+ |-----------|---------------------------|
29
+ | HumanEval | 3.100 |
30
+ | GSM8K | 3.412 |
31
+ | Math-500 | 3.428 |
32
+
33
+ ### Throughput Improvement
34
+
35
+ Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency:
36
+
37
+ | Benchmark | Throughput Improvement |
38
+ |-----------|------------------------|
39
+ | HumanEval | **+71%** |
40
+ | GSM8K | **+45%** |
41
+ | Math-500 | **+94%** |
42
+
43
+ ### Ultimate Inference Performance
44
+
45
+ - **Hardware Environment**: NVIDIA H200 single GPU
46
+ - **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency
47
+ - **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks
48
+
49
+ ![H200_Accuracy_Refined](https://hackmd.io/_uploads/r1zVyhM7Zg.png)
50
+ ![H200_Final_Poster_Math-500](https://hackmd.io/_uploads/rkfVJ2zmWg.png)
51
+ ![H200_Final_Poster_HumanEval](https://hackmd.io/_uploads/H1fN13G7-g.png)
52
+ ![H200_Final_Poster_GSM8K](https://hackmd.io/_uploads/H1MVyhzmbx.png)
53
+
54
+
55
+
56
+
57
+ *Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
58
+
59
+ ## Technical Specifications
60
+
61
+ - **Model Architecture**: LlamaForCausalLMEagle3
62
+ - **Number of Layers**: 1 layer (Draft Model)
63
+ - **Hidden Size**: 4096
64
+ - **Attention Heads**: 32 (KV heads: 8)
65
+ - **Intermediate Size**: 14336
66
+ - **Vocabulary Size**: 157,184
67
+ - **Max Position Embeddings**: 32,768
68
+ - **Data Type**: bfloat16
69
+
70
+ ## Quick Start
71
+
72
+ ### Requirements
73
+
74
+ - NVIDIA GPU
75
+ - CUDA 12.0+
76
+ - PyTorch 2.0+
77
+
78
+ ### Installation
79
+
80
+ ```bash
81
+ pip install sglang==0.5.6
82
+ ```
83
+
84
+ ### Inference with SGLang
85
+
86
+ ```python
87
+ python3 -m sglang.launch_server \
88
+ --model-path /models/Ling-flash-2.0-FP8 \
89
+ --host 0.0.0.0 --port 30012 \
90
+ --trust-remote-code \
91
+ --attention-backend fa3 \
92
+ --mem-fraction-static 0.9 \
93
+ --tp-size 1 \
94
+ --speculative-algorithm EAGLE3 \
95
+ --speculative-draft-model-path AQ-MedAI/Ling-Flash-2.0-eagle3 \
96
+ --speculative-num-steps 3 \
97
+ --speculative-eagle-topk 1 \
98
+ --speculative-num-draft-tokens 4
99
+ ```
100
+
101
+ ## Evaluation Results
102
+
103
+ ### Accuracy Comparison
104
+
105
+ | Dataset | FP8 | FP8 + EAGLE3 |
106
+ |---------|-----|--------------|
107
+ | HumanEval | 93.29% | 93.29% |
108
+ | GSM8K | 96.59% | 96.74% |
109
+ | Math-500 | 95.80% | 96.20% |
110
+
111
+ ### Detailed Throughput Data (tokens/s on 1xH200)
112
+
113
+ **HumanEval:**
114
+ - Concurrency 1: 196 β†’ 330 (+68%)
115
+ - Concurrency 4: 513 β†’ 807 (+57%)
116
+ - Concurrency 8: 725 β†’ 1187 (+64%)
117
+ - Concurrency 16: 1029 β†’ 1704 (+66%)
118
+ - Concurrency 32: 1432 β†’ 2451 (+71%)
119
+ - Concurrency 64: 1931 β†’ 3005 (+56%)
120
+
121
+ **GSM8K:**
122
+ - Concurrency 1: 186 β†’ 328 (+76%)
123
+ - Concurrency 4: 469 β†’ 721 (+54%)
124
+ - Concurrency 8: 673 β†’ 1023 (+52%)
125
+ - Concurrency 16: 955 β†’ 1412 (+48%)
126
+ - Concurrency 32: 1364 β†’ 1982 (+45%)
127
+ - Concurrency 64: 2020 β†’ 2420 (+20%)
128
+
129
+ **Math-500:**
130
+ - Concurrency 1: 197 β†’ 364 (+85%)
131
+ - Concurrency 4: 521 β†’ 896 (+72%)
132
+ - Concurrency 8: 755 β†’ 1354 (+79%)
133
+ - Concurrency 16: 1103 β†’ 2048 (+86%)
134
+ - Concurrency 32: 1612 β†’ 3120 (+94%)
135
+ - Concurrency 64: 2415 β†’ 3954 (+64%)
136
+
137
+ ## Training Data
138
+
139
+ - **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data
140
+ - **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
141
+
142
+ ## Use Cases
143
+
144
+ - High-concurrency inference services
145
+ - Real-time dialogue systems
146
+ - Code generation and completion
147
+ - Mathematical reasoning and computation
148
+ - Production environments requiring low-latency responses
149
+
150
+ ## Open Source Contribution
151
+
152
+ We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
153
+ - PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
154
+
155
+ ## Limitations and Notes
156
+
157
+ - This model is a draft model that needs to be used with a target model to achieve speculative sampling
158
+ - FP8 quantization is recommended for optimal performance
159
+ - Performance may vary across different hardware platforms
160
+ - Medical domain applications must comply with relevant regulations; model outputs are for reference only
161
+
162
+ ## Citation
163
+
164
+ If you use this model in your research, please cite:
165
+
166
+ ```bibtex
167
+ @misc{Ling-flash-2-eagle3,
168
+ title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding},
169
+ author={Ant AQ Team},
170
+ year={2025},
171
+ }
172
+ ```
173
+
174
+ ## License
175
+
176
+ The model weights are released under the MIT License.
177
+
178
+ ---