add PR #15119

97262c3 verified 9 days ago

5.96 kB

	---
	license: mit
	datasets:
	- AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate
	---
	# Ling-Flash-2.0-eagle3

	## Model Overview

	Ling-Flash-2.0-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

	The model is trained on 1.4 million high-quality Open-PerfectBlend instruction datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

	## Key Features

	- Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
	- Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 94%
	- High Accuracy Guarantee: Maintaining 93%+ accuracy on mainstream benchmarks
	- Production-Grade Optimization: Achieving 3954 tokens/s output throughput on single NVIDIA H200

	## Efficient Download Guide

	To minimize download time and storage usage, please note the function of the files in the repository:

	For Inference: You only need to download config.json and model.safetensors.

	For Continued Training: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file.


	## Performance

	### Speculative Sampling Efficiency

	Average Acceptance Length with speculative length of 4:

	\| Benchmark \| Average Acceptance Length \|
	\|-----------\|---------------------------\|
	\| HumanEval \| 3.100 \|
	\| GSM8K \| 3.412 \|
	\| Math-500 \| 3.428 \|

	### Throughput Improvement

	Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 32 concurrency:

	\| Benchmark \| Throughput Improvement \|
	\|-----------\|------------------------\|
	\| HumanEval \| +71% \|
	\| GSM8K \| +45% \|
	\| Math-500 \| +94% \|

	### Ultimate Inference Performance

	- Hardware Environment: NVIDIA H200 single GPU
	- Peak Throughput: Math-500 reaches 3954 tokens/s at 64 concurrency
	- Accuracy: Maintains 93%-97% high accuracy on mainstream benchmarks

	![H200_Accuracy_Refined](https://hackmd.io/_uploads/r1zVyhM7Zg.png)
	![H200_Final_Poster_Math-500](https://hackmd.io/_uploads/rkfVJ2zmWg.png)
	![H200_Final_Poster_HumanEval](https://hackmd.io/_uploads/H1fN13G7-g.png)
	![H200_Final_Poster_GSM8K](https://hackmd.io/_uploads/H1MVyhzmbx.png)




	Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200

	## Technical Specifications

	- Model Architecture: LlamaForCausalLMEagle3
	- Number of Layers: 1 layer (Draft Model)
	- Hidden Size: 4096
	- Attention Heads: 32 (KV heads: 8)
	- Intermediate Size: 14336
	- Vocabulary Size: 157,184
	- Max Position Embeddings: 32,768
	- Data Type: bfloat16

	## Quick Start

	### Requirements

	- NVIDIA GPU
	- CUDA 12.0+
	- PyTorch 2.0+

	### Installation

	```bash
	pip install sglang==0.5.6
	```
	and include PR https://github.com/sgl-project/sglang/pull/15119

	### Inference with SGLang

	```python
	python3 -m sglang.launch_server \
	--model-path /models/Ling-flash-2.0-FP8 \
	--host 0.0.0.0 --port 30012 \
	--trust-remote-code \
	--attention-backend fa3 \
	--mem-fraction-static 0.9 \
	--tp-size 1 \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path AQ-MedAI/Ling-Flash-2.0-eagle3 \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4
	```

	## Evaluation Results

	### Accuracy Comparison

	\| Dataset \| FP8 \| FP8 + EAGLE3 \|
	\|---------\|-----\|--------------\|
	\| HumanEval \| 93.29% \| 93.29% \|
	\| GSM8K \| 96.59% \| 96.74% \|
	\| Math-500 \| 95.80% \| 96.20% \|

	### Detailed Throughput Data (tokens/s on 1xH200)

	HumanEval:
	- Concurrency 1: 196 → 330 (+68%)
	- Concurrency 4: 513 → 807 (+57%)
	- Concurrency 8: 725 → 1187 (+64%)
	- Concurrency 16: 1029 → 1704 (+66%)
	- Concurrency 32: 1432 → 2451 (+71%)
	- Concurrency 64: 1931 → 3005 (+56%)

	GSM8K:
	- Concurrency 1: 186 → 328 (+76%)
	- Concurrency 4: 469 → 721 (+54%)
	- Concurrency 8: 673 → 1023 (+52%)
	- Concurrency 16: 955 → 1412 (+48%)
	- Concurrency 32: 1364 → 1982 (+45%)
	- Concurrency 64: 2020 → 2420 (+20%)

	Math-500:
	- Concurrency 1: 197 → 364 (+85%)
	- Concurrency 4: 521 → 896 (+72%)
	- Concurrency 8: 755 → 1354 (+79%)
	- Concurrency 16: 1103 → 2048 (+86%)
	- Concurrency 32: 1612 → 3120 (+94%)
	- Concurrency 64: 2415 → 3954 (+64%)

	## Training Data

	- Open-PerfectBlend Instruction Set: 1.4 million high-quality instruction data
	- Data Quality: Rigorously filtered and cleaned to ensure high-quality training data

	## Use Cases

	- High-concurrency inference services
	- Real-time dialogue systems
	- Code generation and completion
	- Mathematical reasoning and computation
	- Production environments requiring low-latency responses

	## Open Source Contribution

	We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:
	- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)

	## Limitations and Notes

	- This model is a draft model that needs to be used with a target model to achieve speculative sampling
	- FP8 quantization is recommended for optimal performance
	- Performance may vary across different hardware platforms
	- Medical domain applications must comply with relevant regulations; model outputs are for reference only

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{Ling-flash-2-eagle3,
	title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding},
	author={Ant AQ Team},
	year={2025},
	}
	```

	## License

	The model weights are released under the MIT License.

	---