AntAngelMed-eagle3 / README.md
yarkcy's picture
Update README.md
e4c3b30 verified
---
license: apache-2.0
base_model:
- MedAIBase/AntAngelMed
---
# AntAngelMed-eagle3
## Model Overview
**AntAngelMed-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
The model is trained on **high-quality medical datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
## Key Features
- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
- **Production-Grade Optimization**: Achieving 3267 tokens/s output throughput on single NVIDIA H200
## Performance
### Speculative Sampling Efficiency
Average Acceptance Length with speculative length of 4:
| Benchmark | Average Acceptance Length |
|-----------|---------------------------|
| HumanEval | 2.816 |
| GSM8K | 3.24 |
| Math-500 | 3.326 |
| Med_MCPA | 2.600 |
| Health_Bench | 2.446 |
### Throughput Improvement
Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 16 concurrency:
| Benchmark | Throughput Improvement |
|-----------|------------------------|
| HumanEval | **+67.3%** |
| GSM8K | **+58.6%** |
| Math-500 | **+89.8%** |
| Med_MCPA | **+46%** |
| Health_Bench | **+45.3%** |
### Ultimate Inference Performance
- **Hardware Environment**: NVIDIA H200 single GPU
![1](https://hackmd.io/_uploads/BJF9a7MNZe.png)
![2](https://hackmd.io/_uploads/H15K1NMV-e.png)
![3](https://hackmd.io/_uploads/H16nT7fN-e.png)
*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
## Technical Specifications
- **Model Architecture**: LlamaForCausalLMEagle3
- **Number of Layers**: 1 layer (Draft Model)
- **Hidden Size**: 4096
- **Attention Heads**: 32 (KV heads: 8)
- **Intermediate Size**: 14336
- **Vocabulary Size**: 157,184
- **Max Position Embeddings**: 32,768
- **Data Type**: bfloat16
## Quick Start
### Requirements
- H200-class Computational Performance
- CUDA 12.0+
- PyTorch 2.0+
### Installation
```bash
pip install sglang==0.5.6
```
and include PR https://github.com/sgl-project/sglang/pull/15119
### Inference with SGLang
```python
python3 -m sglang.launch_server \
--model-path MedAIBase/AntAngelMed-FP8 \
--host 0.0.0.0 --port 30012 \
--trust-remote-code \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--tp-size 1 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
```
## Training Data
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
## Use Cases
- High-concurrency inference services
- Real-time dialogue systems
- Code generation and completion
- Mathematical reasoning and computation
- Production environments requiring low-latency responses
## Open Source Contribution
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
## Limitations and Notes
- This model is a draft model that needs to be used with a target model to achieve speculative sampling
- FP8 quantization is recommended for optimal performance
- Performance may vary across different hardware platforms
- Medical domain applications must comply with relevant regulations; model outputs are for reference only
## License
This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE).