|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- MedAIBase/AntAngelMed |
|
|
--- |
|
|
|
|
|
# AntAngelMed-eagle3 |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
**AntAngelMed-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability. |
|
|
|
|
|
The model is trained on **high-quality medical datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4 |
|
|
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+% |
|
|
- **Production-Grade Optimization**: Achieving 3267 tokens/s output throughput on single NVIDIA H200 |
|
|
|
|
|
|
|
|
## Performance |
|
|
|
|
|
### Speculative Sampling Efficiency |
|
|
|
|
|
Average Acceptance Length with speculative length of 4: |
|
|
|
|
|
| Benchmark | Average Acceptance Length | |
|
|
|-----------|---------------------------| |
|
|
| HumanEval | 2.816 | |
|
|
| GSM8K | 3.24 | |
|
|
| Math-500 | 3.326 | |
|
|
| Med_MCPA | 2.600 | |
|
|
| Health_Bench | 2.446 | |
|
|
|
|
|
### Throughput Improvement |
|
|
|
|
|
Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 16 concurrency: |
|
|
|
|
|
| Benchmark | Throughput Improvement | |
|
|
|-----------|------------------------| |
|
|
| HumanEval | **+67.3%** | |
|
|
| GSM8K | **+58.6%** | |
|
|
| Math-500 | **+89.8%** | |
|
|
| Med_MCPA | **+46%** | |
|
|
| Health_Bench | **+45.3%** | |
|
|
|
|
|
### Ultimate Inference Performance |
|
|
|
|
|
- **Hardware Environment**: NVIDIA H200 single GPU |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
|
|
|
*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200* |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
- **Model Architecture**: LlamaForCausalLMEagle3 |
|
|
- **Number of Layers**: 1 layer (Draft Model) |
|
|
- **Hidden Size**: 4096 |
|
|
- **Attention Heads**: 32 (KV heads: 8) |
|
|
- **Intermediate Size**: 14336 |
|
|
- **Vocabulary Size**: 157,184 |
|
|
- **Max Position Embeddings**: 32,768 |
|
|
- **Data Type**: bfloat16 |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Requirements |
|
|
|
|
|
- H200-class Computational Performance |
|
|
- CUDA 12.0+ |
|
|
- PyTorch 2.0+ |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install sglang==0.5.6 |
|
|
``` |
|
|
and include PR https://github.com/sgl-project/sglang/pull/15119 |
|
|
|
|
|
### Inference with SGLang |
|
|
|
|
|
```python |
|
|
python3 -m sglang.launch_server \ |
|
|
--model-path MedAIBase/AntAngelMed-FP8 \ |
|
|
--host 0.0.0.0 --port 30012 \ |
|
|
--trust-remote-code \ |
|
|
--attention-backend fa3 \ |
|
|
--mem-fraction-static 0.9 \ |
|
|
--tp-size 1 \ |
|
|
--speculative-algorithm EAGLE3 \ |
|
|
--speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \ |
|
|
--speculative-num-steps 3 \ |
|
|
--speculative-eagle-topk 1 \ |
|
|
--speculative-num-draft-tokens 4 |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- High-concurrency inference services |
|
|
- Real-time dialogue systems |
|
|
- Code generation and completion |
|
|
- Mathematical reasoning and computation |
|
|
- Production environments requiring low-latency responses |
|
|
|
|
|
## Open Source Contribution |
|
|
|
|
|
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**: |
|
|
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119) |
|
|
|
|
|
|
|
|
## Limitations and Notes |
|
|
|
|
|
- This model is a draft model that needs to be used with a target model to achieve speculative sampling |
|
|
- FP8 quantization is recommended for optimal performance |
|
|
- Performance may vary across different hardware platforms |
|
|
- Medical domain applications must comply with relevant regulations; model outputs are for reference only |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE). |