|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate |
|
|
--- |
|
|
# Ling-Flash-2.0-eagle3 |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
**Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability. |
|
|
|
|
|
The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4 |
|
|
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94% |
|
|
- **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks |
|
|
- **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200 |
|
|
|
|
|
## Efficient Download Guide |
|
|
|
|
|
To minimize download time and storage usage, please note the function of the files in the repository: |
|
|
|
|
|
**For Inference**: You only need to download config.json and model.safetensors. |
|
|
|
|
|
**For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file. |
|
|
|
|
|
|
|
|
## Performance |
|
|
|
|
|
### Speculative Sampling Efficiency |
|
|
|
|
|
Average Acceptance Length with speculative length of 4: |
|
|
|
|
|
| Benchmark | Average Acceptance Length | |
|
|
|-----------|---------------------------| |
|
|
| HumanEval | 3.100 | |
|
|
| GSM8K | 3.412 | |
|
|
| Math-500 | 3.428 | |
|
|
|
|
|
### Throughput Improvement |
|
|
|
|
|
Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency: |
|
|
|
|
|
| Benchmark | Throughput Improvement | |
|
|
|-----------|------------------------| |
|
|
| HumanEval | **+71%** | |
|
|
| GSM8K | **+45%** | |
|
|
| Math-500 | **+94%** | |
|
|
|
|
|
### Ultimate Inference Performance |
|
|
|
|
|
- **Hardware Environment**: NVIDIA H200 single GPU |
|
|
- **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency |
|
|
- **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200* |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
- **Model Architecture**: LlamaForCausalLMEagle3 |
|
|
- **Number of Layers**: 1 layer (Draft Model) |
|
|
- **Hidden Size**: 4096 |
|
|
- **Attention Heads**: 32 (KV heads: 8) |
|
|
- **Intermediate Size**: 14336 |
|
|
- **Vocabulary Size**: 157,184 |
|
|
- **Max Position Embeddings**: 32,768 |
|
|
- **Data Type**: bfloat16 |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Requirements |
|
|
|
|
|
- NVIDIA GPU |
|
|
- CUDA 12.0+ |
|
|
- PyTorch 2.0+ |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install sglang==0.5.6 |
|
|
``` |
|
|
and include PR https://github.com/sgl-project/sglang/pull/15119 |
|
|
|
|
|
### Inference with SGLang |
|
|
|
|
|
```python |
|
|
python3 -m sglang.launch_server \ |
|
|
--model-path /models/Ling-flash-2.0-FP8 \ |
|
|
--host 0.0.0.0 --port 30012 \ |
|
|
--trust-remote-code \ |
|
|
--attention-backend fa3 \ |
|
|
--mem-fraction-static 0.9 \ |
|
|
--tp-size 1 \ |
|
|
--speculative-algorithm EAGLE3 \ |
|
|
--speculative-draft-model-path AQ-MedAI/Ling-Flash-2.0-eagle3 \ |
|
|
--speculative-num-steps 3 \ |
|
|
--speculative-eagle-topk 1 \ |
|
|
--speculative-num-draft-tokens 4 |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Accuracy Comparison |
|
|
|
|
|
| Dataset | FP8 | FP8 + EAGLE3 | |
|
|
|---------|-----|--------------| |
|
|
| HumanEval | 93.29% | 93.29% | |
|
|
| GSM8K | 96.59% | 96.74% | |
|
|
| Math-500 | 95.80% | 96.20% | |
|
|
|
|
|
### Detailed Throughput Data (tokens/s on 1xH200) |
|
|
|
|
|
**HumanEval:** |
|
|
- Concurrency 1: 196 β 330 (+68%) |
|
|
- Concurrency 4: 513 β 807 (+57%) |
|
|
- Concurrency 8: 725 β 1187 (+64%) |
|
|
- Concurrency 16: 1029 β 1704 (+66%) |
|
|
- Concurrency 32: 1432 β 2451 (+71%) |
|
|
- Concurrency 64: 1931 β 3005 (+56%) |
|
|
|
|
|
**GSM8K:** |
|
|
- Concurrency 1: 186 β 328 (+76%) |
|
|
- Concurrency 4: 469 β 721 (+54%) |
|
|
- Concurrency 8: 673 β 1023 (+52%) |
|
|
- Concurrency 16: 955 β 1412 (+48%) |
|
|
- Concurrency 32: 1364 β 1982 (+45%) |
|
|
- Concurrency 64: 2020 β 2420 (+20%) |
|
|
|
|
|
**Math-500:** |
|
|
- Concurrency 1: 197 β 364 (+85%) |
|
|
- Concurrency 4: 521 β 896 (+72%) |
|
|
- Concurrency 8: 755 β 1354 (+79%) |
|
|
- Concurrency 16: 1103 β 2048 (+86%) |
|
|
- Concurrency 32: 1612 β 3120 (+94%) |
|
|
- Concurrency 64: 2415 β 3954 (+64%) |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data |
|
|
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- High-concurrency inference services |
|
|
- Real-time dialogue systems |
|
|
- Code generation and completion |
|
|
- Mathematical reasoning and computation |
|
|
- Production environments requiring low-latency responses |
|
|
|
|
|
## Open Source Contribution |
|
|
|
|
|
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**: |
|
|
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119) |
|
|
|
|
|
## Limitations and Notes |
|
|
|
|
|
- This model is a draft model that needs to be used with a target model to achieve speculative sampling |
|
|
- FP8 quantization is recommended for optimal performance |
|
|
- Performance may vary across different hardware platforms |
|
|
- Medical domain applications must comply with relevant regulations; model outputs are for reference only |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{Ling-flash-2-eagle3, |
|
|
title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding}, |
|
|
author={Ant AQ Team}, |
|
|
year={2025}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
The model weights are released under the MIT License. |
|
|
|
|
|
--- |