Safetensors
llama
File size: 5,962 Bytes
9bb9c96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aeae16e
 
 
 
 
 
 
 
 
9bb9c96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97262c3
9bb9c96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: mit
datasets:
- AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate
---
# Ling-Flash-2.0-eagle3

## Model Overview

**Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

## Key Features

- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94%
- **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks
- **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200

## Efficient Download Guide

To minimize download time and storage usage, please note the function of the files in the repository:

**For Inference**: You only need to download config.json and model.safetensors.

**For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file.


## Performance

### Speculative Sampling Efficiency

Average Acceptance Length with speculative length of 4:

| Benchmark | Average Acceptance Length |
|-----------|---------------------------|
| HumanEval | 3.100 |
| GSM8K | 3.412 |
| Math-500 | 3.428 |

### Throughput Improvement

Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency:

| Benchmark | Throughput Improvement |
|-----------|------------------------|
| HumanEval | **+71%** |
| GSM8K | **+45%** |
| Math-500 | **+94%** |

### Ultimate Inference Performance

- **Hardware Environment**: NVIDIA H200 single GPU
- **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency
- **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks

![H200_Accuracy_Refined](https://hackmd.io/_uploads/r1zVyhM7Zg.png)
![H200_Final_Poster_Math-500](https://hackmd.io/_uploads/rkfVJ2zmWg.png)
![H200_Final_Poster_HumanEval](https://hackmd.io/_uploads/H1fN13G7-g.png)
![H200_Final_Poster_GSM8K](https://hackmd.io/_uploads/H1MVyhzmbx.png)




*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*

## Technical Specifications

- **Model Architecture**: LlamaForCausalLMEagle3
- **Number of Layers**: 1 layer (Draft Model)
- **Hidden Size**: 4096
- **Attention Heads**: 32 (KV heads: 8)
- **Intermediate Size**: 14336
- **Vocabulary Size**: 157,184
- **Max Position Embeddings**: 32,768
- **Data Type**: bfloat16

## Quick Start

### Requirements

- NVIDIA GPU 
- CUDA 12.0+
- PyTorch 2.0+

### Installation

```bash
pip install sglang==0.5.6
```
and include PR https://github.com/sgl-project/sglang/pull/15119

### Inference with SGLang

```python
python3 -m sglang.launch_server  \
    --model-path /models/Ling-flash-2.0-FP8 \
    --host 0.0.0.0 --port 30012  \
    --trust-remote-code  \
    --attention-backend fa3  \
    --mem-fraction-static 0.9 \
    --tp-size 1  \
    --speculative-algorithm EAGLE3  \
    --speculative-draft-model-path  AQ-MedAI/Ling-Flash-2.0-eagle3 \
    --speculative-num-steps 3  \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 4 
```

## Evaluation Results

### Accuracy Comparison

| Dataset | FP8 | FP8 + EAGLE3 |
|---------|-----|--------------|
| HumanEval | 93.29% | 93.29% |
| GSM8K | 96.59% | 96.74% |
| Math-500 | 95.80% | 96.20% |

### Detailed Throughput Data (tokens/s on 1xH200)

**HumanEval:**
- Concurrency 1: 196 β†’ 330 (+68%)
- Concurrency 4: 513 β†’ 807 (+57%)
- Concurrency 8: 725 β†’ 1187 (+64%)
- Concurrency 16: 1029 β†’ 1704 (+66%)
- Concurrency 32: 1432 β†’ 2451 (+71%)
- Concurrency 64: 1931 β†’ 3005 (+56%)

**GSM8K:**
- Concurrency 1: 186 β†’ 328 (+76%)
- Concurrency 4: 469 β†’ 721 (+54%)
- Concurrency 8: 673 β†’ 1023 (+52%)
- Concurrency 16: 955 β†’ 1412 (+48%)
- Concurrency 32: 1364 β†’ 1982 (+45%)
- Concurrency 64: 2020 β†’ 2420 (+20%)

**Math-500:**
- Concurrency 1: 197 β†’ 364 (+85%)
- Concurrency 4: 521 β†’ 896 (+72%)
- Concurrency 8: 755 β†’ 1354 (+79%)
- Concurrency 16: 1103 β†’ 2048 (+86%)
- Concurrency 32: 1612 β†’ 3120 (+94%)
- Concurrency 64: 2415 β†’ 3954 (+64%)

## Training Data

- **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data

## Use Cases

- High-concurrency inference services
- Real-time dialogue systems
- Code generation and completion
- Mathematical reasoning and computation
- Production environments requiring low-latency responses

## Open Source Contribution

We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)

## Limitations and Notes

- This model is a draft model that needs to be used with a target model to achieve speculative sampling
- FP8 quantization is recommended for optimal performance
- Performance may vary across different hardware platforms
- Medical domain applications must comply with relevant regulations; model outputs are for reference only

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{Ling-flash-2-eagle3,
  title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding},
  author={Ant AQ Team},
  year={2025},
}
```

## License

The model weights are released under the MIT License. 

---