--- license: mit datasets: - AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate --- # Ling-Flash-2.0-eagle3 ## Model Overview **Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability. The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments. ## Key Features - **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4 - **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94% - **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks - **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200 ## Efficient Download Guide To minimize download time and storage usage, please note the function of the files in the repository: **For Inference**: You only need to download config.json and model.safetensors. **For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file. ## Performance ### Speculative Sampling Efficiency Average Acceptance Length with speculative length of 4: | Benchmark | Average Acceptance Length | |-----------|---------------------------| | HumanEval | 3.100 | | GSM8K | 3.412 | | Math-500 | 3.428 | ### Throughput Improvement Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency: | Benchmark | Throughput Improvement | |-----------|------------------------| | HumanEval | **+71%** | | GSM8K | **+45%** | | Math-500 | **+94%** | ### Ultimate Inference Performance - **Hardware Environment**: NVIDIA H200 single GPU - **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency - **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks ![H200_Accuracy_Refined](https://hackmd.io/_uploads/r1zVyhM7Zg.png) ![H200_Final_Poster_Math-500](https://hackmd.io/_uploads/rkfVJ2zmWg.png) ![H200_Final_Poster_HumanEval](https://hackmd.io/_uploads/H1fN13G7-g.png) ![H200_Final_Poster_GSM8K](https://hackmd.io/_uploads/H1MVyhzmbx.png) *Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200* ## Technical Specifications - **Model Architecture**: LlamaForCausalLMEagle3 - **Number of Layers**: 1 layer (Draft Model) - **Hidden Size**: 4096 - **Attention Heads**: 32 (KV heads: 8) - **Intermediate Size**: 14336 - **Vocabulary Size**: 157,184 - **Max Position Embeddings**: 32,768 - **Data Type**: bfloat16 ## Quick Start ### Requirements - NVIDIA GPU - CUDA 12.0+ - PyTorch 2.0+ ### Installation ```bash pip install sglang==0.5.6 ``` and include PR https://github.com/sgl-project/sglang/pull/15119 ### Inference with SGLang ```python python3 -m sglang.launch_server \ --model-path /models/Ling-flash-2.0-FP8 \ --host 0.0.0.0 --port 30012 \ --trust-remote-code \ --attention-backend fa3 \ --mem-fraction-static 0.9 \ --tp-size 1 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path AQ-MedAI/Ling-Flash-2.0-eagle3 \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 ``` ## Evaluation Results ### Accuracy Comparison | Dataset | FP8 | FP8 + EAGLE3 | |---------|-----|--------------| | HumanEval | 93.29% | 93.29% | | GSM8K | 96.59% | 96.74% | | Math-500 | 95.80% | 96.20% | ### Detailed Throughput Data (tokens/s on 1xH200) **HumanEval:** - Concurrency 1: 196 → 330 (+68%) - Concurrency 4: 513 → 807 (+57%) - Concurrency 8: 725 → 1187 (+64%) - Concurrency 16: 1029 → 1704 (+66%) - Concurrency 32: 1432 → 2451 (+71%) - Concurrency 64: 1931 → 3005 (+56%) **GSM8K:** - Concurrency 1: 186 → 328 (+76%) - Concurrency 4: 469 → 721 (+54%) - Concurrency 8: 673 → 1023 (+52%) - Concurrency 16: 955 → 1412 (+48%) - Concurrency 32: 1364 → 1982 (+45%) - Concurrency 64: 2020 → 2420 (+20%) **Math-500:** - Concurrency 1: 197 → 364 (+85%) - Concurrency 4: 521 → 896 (+72%) - Concurrency 8: 755 → 1354 (+79%) - Concurrency 16: 1103 → 2048 (+86%) - Concurrency 32: 1612 → 3120 (+94%) - Concurrency 64: 2415 → 3954 (+64%) ## Training Data - **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data - **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data ## Use Cases - High-concurrency inference services - Real-time dialogue systems - Code generation and completion - Mathematical reasoning and computation - Production environments requiring low-latency responses ## Open Source Contribution We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**: - PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119) ## Limitations and Notes - This model is a draft model that needs to be used with a target model to achieve speculative sampling - FP8 quantization is recommended for optimal performance - Performance may vary across different hardware platforms - Medical domain applications must comply with relevant regulations; model outputs are for reference only ## Citation If you use this model in your research, please cite: ```bibtex @misc{Ling-flash-2-eagle3, title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding}, author={Ant AQ Team}, year={2025}, } ``` ## License The model weights are released under the MIT License. ---