AQ-MedAI
/

Ling-Flash-2.0-eagle3

Safetensors

llama

Model card Files Files and versions

xet

Community

eerrr9 commited on Dec 19, 2025

Commit

9bb9c96

verified ·

1 Parent(s): ecfb8d4

Update README.md

Browse files

Files changed (1) hide show

README.md +178 -3

README.md CHANGED Viewed

@@ -1,3 +1,178 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- AQ-MedAI/Ling-flash-2.0-open-perfectblend-regenerate
+---
+# Ling-Flash-2.0-eagle3
+## Model Overview
+**Ling-Flash-2.0-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
+The model is trained on **1.4 million high-quality Open-PerfectBlend instruction datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
+## Key Features
+- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
+- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 94%
+- **High Accuracy Guarantee**: Maintaining 93%+ accuracy on mainstream benchmarks
+- **Production-Grade Optimization**: Achieving 3954 tokens/s output throughput on single NVIDIA H200
+## Performance
+### Speculative Sampling Efficiency
+Average Acceptance Length with speculative length of 4:
+| Benchmark | Average Acceptance Length |
+|-----------|---------------------------|
+| HumanEval | 3.100 |
+| GSM8K | 3.412 |
+| Math-500 | 3.428 |
+### Throughput Improvement
+Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 32 concurrency:
+| Benchmark | Throughput Improvement |
+|-----------|------------------------|
+| HumanEval | **+71%** |
+| GSM8K | **+45%** |
+| Math-500 | **+94%** |
+### Ultimate Inference Performance
+- **Hardware Environment**: NVIDIA H200 single GPU
+- **Peak Throughput**: Math-500 reaches **3954 tokens/s** at 64 concurrency
+- **Accuracy**: Maintains 93%-97% high accuracy on mainstream benchmarks
+![H200_Accuracy_Refined](https://hackmd.io/_uploads/r1zVyhM7Zg.png)
+![H200_Final_Poster_Math-500](https://hackmd.io/_uploads/rkfVJ2zmWg.png)
+![H200_Final_Poster_HumanEval](https://hackmd.io/_uploads/H1fN13G7-g.png)
+![H200_Final_Poster_GSM8K](https://hackmd.io/_uploads/H1MVyhzmbx.png)
+*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
+## Technical Specifications
+- **Model Architecture**: LlamaForCausalLMEagle3
+- **Number of Layers**: 1 layer (Draft Model)
+- **Hidden Size**: 4096
+- **Attention Heads**: 32 (KV heads: 8)
+- **Intermediate Size**: 14336
+- **Vocabulary Size**: 157,184
+- **Max Position Embeddings**: 32,768
+- **Data Type**: bfloat16
+## Quick Start
+### Requirements
+- NVIDIA GPU
+- CUDA 12.0+
+- PyTorch 2.0+
+### Installation
+```bash
+pip install sglang==0.5.6
+```
+### Inference with SGLang
+```python
+python3 -m sglang.launch_server  \
+    --model-path /models/Ling-flash-2.0-FP8 \
+    --host 0.0.0.0 --port 30012  \
+    --trust-remote-code  \
+    --attention-backend fa3  \
+    --mem-fraction-static 0.9 \
+    --tp-size 1  \
+    --speculative-algorithm EAGLE3  \
+    --speculative-draft-model-path  AQ-MedAI/Ling-Flash-2.0-eagle3 \
+    --speculative-num-steps 3  \
+    --speculative-eagle-topk 1   \
+    --speculative-num-draft-tokens 4
+```
+## Evaluation Results
+### Accuracy Comparison
+| Dataset | FP8 | FP8 + EAGLE3 |
+|---------|-----|--------------|
+| HumanEval | 93.29% | 93.29% |
+| GSM8K | 96.59% | 96.74% |
+| Math-500 | 95.80% | 96.20% |
+### Detailed Throughput Data (tokens/s on 1xH200)
+**HumanEval:**
+- Concurrency 1: 196 → 330 (+68%)
+- Concurrency 4: 513 → 807 (+57%)
+- Concurrency 8: 725 → 1187 (+64%)
+- Concurrency 16: 1029 → 1704 (+66%)
+- Concurrency 32: 1432 → 2451 (+71%)
+- Concurrency 64: 1931 → 3005 (+56%)
+**GSM8K:**
+- Concurrency 1: 186 → 328 (+76%)
+- Concurrency 4: 469 → 721 (+54%)
+- Concurrency 8: 673 → 1023 (+52%)
+- Concurrency 16: 955 → 1412 (+48%)
+- Concurrency 32: 1364 → 1982 (+45%)
+- Concurrency 64: 2020 → 2420 (+20%)
+**Math-500:**
+- Concurrency 1: 197 → 364 (+85%)
+- Concurrency 4: 521 → 896 (+72%)
+- Concurrency 8: 755 → 1354 (+79%)
+- Concurrency 16: 1103 → 2048 (+86%)
+- Concurrency 32: 1612 → 3120 (+94%)
+- Concurrency 64: 2415 → 3954 (+64%)
+## Training Data
+- **Open-PerfectBlend Instruction Set**: 1.4 million high-quality instruction data
+- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
+## Use Cases
+- High-concurrency inference services
+- Real-time dialogue systems
+- Code generation and completion
+- Mathematical reasoning and computation
+- Production environments requiring low-latency responses
+## Open Source Contribution
+We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
+- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
+## Limitations and Notes
+- This model is a draft model that needs to be used with a target model to achieve speculative sampling
+- FP8 quantization is recommended for optimal performance
+- Performance may vary across different hardware platforms
+- Medical domain applications must comply with relevant regulations; model outputs are for reference only
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{Ling-flash-2-eagle3,
+  title={Ling-Flash-2.0-eagle3: High-Performance Draft Model for Speculative Decoding},
+  author={Ant AQ Team},
+  year={2025},
+}
+```
+## License
+The model weights are released under the MIT License.
+---