File size: 5,962 Bytes
9bb9c96 aeae16e 9bb9c96 97262c3 9bb9c96 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
---
license: mit
datasets:
- AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate
---
# Ling-Flash-2.0-eagle3
## Model Overview
**Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
## Key Features
- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94%
- **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks
- **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200
## Efficient Download Guide
To minimize download time and storage usage, please note the function of the files in the repository:
**For Inference**: You only need to download config.json and model.safetensors.
**For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file.
## Performance
### Speculative Sampling Efficiency
Average Acceptance Length with speculative length of 4:
| Benchmark | Average Acceptance Length |
|-----------|---------------------------|
| HumanEval | 3.100 |
| GSM8K | 3.412 |
| Math-500 | 3.428 |
### Throughput Improvement
Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency:
| Benchmark | Throughput Improvement |
|-----------|------------------------|
| HumanEval | **+71%** |
| GSM8K | **+45%** |
| Math-500 | **+94%** |
### Ultimate Inference Performance
- **Hardware Environment**: NVIDIA H200 single GPU
- **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency
- **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks




*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
## Technical Specifications
- **Model Architecture**: LlamaForCausalLMEagle3
- **Number of Layers**: 1 layer (Draft Model)
- **Hidden Size**: 4096
- **Attention Heads**: 32 (KV heads: 8)
- **Intermediate Size**: 14336
- **Vocabulary Size**: 157,184
- **Max Position Embeddings**: 32,768
- **Data Type**: bfloat16
## Quick Start
### Requirements
- NVIDIA GPU
- CUDA 12.0+
- PyTorch 2.0+
### Installation
```bash
pip install sglang==0.5.6
```
and include PR https://github.com/sgl-project/sglang/pull/15119
### Inference with SGLang
```python
python3 -m sglang.launch_server \
--model-path /models/Ling-flash-2.0-FP8 \
--host 0.0.0.0 --port 30012 \
--trust-remote-code \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--tp-size 1 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path AQ-MedAI/Ling-Flash-2.0-eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
```
## Evaluation Results
### Accuracy Comparison
| Dataset | FP8 | FP8 + EAGLE3 |
|---------|-----|--------------|
| HumanEval | 93.29% | 93.29% |
| GSM8K | 96.59% | 96.74% |
| Math-500 | 95.80% | 96.20% |
### Detailed Throughput Data (tokens/s on 1xH200)
**HumanEval:**
- Concurrency 1: 196 β 330 (+68%)
- Concurrency 4: 513 β 807 (+57%)
- Concurrency 8: 725 β 1187 (+64%)
- Concurrency 16: 1029 β 1704 (+66%)
- Concurrency 32: 1432 β 2451 (+71%)
- Concurrency 64: 1931 β 3005 (+56%)
**GSM8K:**
- Concurrency 1: 186 β 328 (+76%)
- Concurrency 4: 469 β 721 (+54%)
- Concurrency 8: 673 β 1023 (+52%)
- Concurrency 16: 955 β 1412 (+48%)
- Concurrency 32: 1364 β 1982 (+45%)
- Concurrency 64: 2020 β 2420 (+20%)
**Math-500:**
- Concurrency 1: 197 β 364 (+85%)
- Concurrency 4: 521 β 896 (+72%)
- Concurrency 8: 755 β 1354 (+79%)
- Concurrency 16: 1103 β 2048 (+86%)
- Concurrency 32: 1612 β 3120 (+94%)
- Concurrency 64: 2415 β 3954 (+64%)
## Training Data
- **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
## Use Cases
- High-concurrency inference services
- Real-time dialogue systems
- Code generation and completion
- Mathematical reasoning and computation
- Production environments requiring low-latency responses
## Open Source Contribution
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
## Limitations and Notes
- This model is a draft model that needs to be used with a target model to achieve speculative sampling
- FP8 quantization is recommended for optimal performance
- Performance may vary across different hardware platforms
- Medical domain applications must comply with relevant regulations; model outputs are for reference only
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{Ling-flash-2-eagle3,
title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding},
author={Ant AQ Team},
year={2025},
}
```
## License
The model weights are released under the MIT License.
--- |