Safetensors
llama
xiaomenshen's picture
add PR #15119
97262c3 verified
---
license: mit
datasets:
- AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate
---
# Ling-Flash-2.0-eagle3
## Model Overview
**Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
## Key Features
- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94%
- **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks
- **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200
## Efficient Download Guide
To minimize download time and storage usage, please note the function of the files in the repository:
**For Inference**: You only need to download config.json and model.safetensors.
**For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file.
## Performance
### Speculative Sampling Efficiency
Average Acceptance Length with speculative length of 4:
| Benchmark | Average Acceptance Length |
|-----------|---------------------------|
| HumanEval | 3.100 |
| GSM8K | 3.412 |
| Math-500 | 3.428 |
### Throughput Improvement
Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency:
| Benchmark | Throughput Improvement |
|-----------|------------------------|
| HumanEval | **+71%** |
| GSM8K | **+45%** |
| Math-500 | **+94%** |
### Ultimate Inference Performance
- **Hardware Environment**: NVIDIA H200 single GPU
- **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency
- **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks
![H200_Accuracy_Refined](https://hackmd.io/_uploads/r1zVyhM7Zg.png)
![H200_Final_Poster_Math-500](https://hackmd.io/_uploads/rkfVJ2zmWg.png)
![H200_Final_Poster_HumanEval](https://hackmd.io/_uploads/H1fN13G7-g.png)
![H200_Final_Poster_GSM8K](https://hackmd.io/_uploads/H1MVyhzmbx.png)
*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
## Technical Specifications
- **Model Architecture**: LlamaForCausalLMEagle3
- **Number of Layers**: 1 layer (Draft Model)
- **Hidden Size**: 4096
- **Attention Heads**: 32 (KV heads: 8)
- **Intermediate Size**: 14336
- **Vocabulary Size**: 157,184
- **Max Position Embeddings**: 32,768
- **Data Type**: bfloat16
## Quick Start
### Requirements
- NVIDIA GPU
- CUDA 12.0+
- PyTorch 2.0+
### Installation
```bash
pip install sglang==0.5.6
```
and include PR https://github.com/sgl-project/sglang/pull/15119
### Inference with SGLang
```python
python3 -m sglang.launch_server \
--model-path /models/Ling-flash-2.0-FP8 \
--host 0.0.0.0 --port 30012 \
--trust-remote-code \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--tp-size 1 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path AQ-MedAI/Ling-Flash-2.0-eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
```
## Evaluation Results
### Accuracy Comparison
| Dataset | FP8 | FP8 + EAGLE3 |
|---------|-----|--------------|
| HumanEval | 93.29% | 93.29% |
| GSM8K | 96.59% | 96.74% |
| Math-500 | 95.80% | 96.20% |
### Detailed Throughput Data (tokens/s on 1xH200)
**HumanEval:**
- Concurrency 1: 196 β†’ 330 (+68%)
- Concurrency 4: 513 β†’ 807 (+57%)
- Concurrency 8: 725 β†’ 1187 (+64%)
- Concurrency 16: 1029 β†’ 1704 (+66%)
- Concurrency 32: 1432 β†’ 2451 (+71%)
- Concurrency 64: 1931 β†’ 3005 (+56%)
**GSM8K:**
- Concurrency 1: 186 β†’ 328 (+76%)
- Concurrency 4: 469 β†’ 721 (+54%)
- Concurrency 8: 673 β†’ 1023 (+52%)
- Concurrency 16: 955 β†’ 1412 (+48%)
- Concurrency 32: 1364 β†’ 1982 (+45%)
- Concurrency 64: 2020 β†’ 2420 (+20%)
**Math-500:**
- Concurrency 1: 197 β†’ 364 (+85%)
- Concurrency 4: 521 β†’ 896 (+72%)
- Concurrency 8: 755 β†’ 1354 (+79%)
- Concurrency 16: 1103 β†’ 2048 (+86%)
- Concurrency 32: 1612 β†’ 3120 (+94%)
- Concurrency 64: 2415 β†’ 3954 (+64%)
## Training Data
- **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
## Use Cases
- High-concurrency inference services
- Real-time dialogue systems
- Code generation and completion
- Mathematical reasoning and computation
- Production environments requiring low-latency responses
## Open Source Contribution
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
## Limitations and Notes
- This model is a draft model that needs to be used with a target model to achieve speculative sampling
- FP8 quantization is recommended for optimal performance
- Performance may vary across different hardware platforms
- Medical domain applications must comply with relevant regulations; model outputs are for reference only
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{Ling-flash-2-eagle3,
title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding},
author={Ant AQ Team},
year={2025},
}
```
## License
The model weights are released under the MIT License.
---