File size: 3,998 Bytes
e4c3b30 530ef2f e4c3b30 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: apache-2.0
base_model:
- MedAIBase/AntAngelMed
---
# AntAngelMed-eagle3
## Model Overview
**AntAngelMed-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
The model is trained on **high-quality medical datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
## Key Features
- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
- **Production-Grade Optimization**: Achieving 3267 tokens/s output throughput on single NVIDIA H200
## Performance
### Speculative Sampling Efficiency
Average Acceptance Length with speculative length of 4:
| Benchmark | Average Acceptance Length |
|-----------|---------------------------|
| HumanEval | 2.816 |
| GSM8K | 3.24 |
| Math-500 | 3.326 |
| Med_MCPA | 2.600 |
| Health_Bench | 2.446 |
### Throughput Improvement
Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 16 concurrency:
| Benchmark | Throughput Improvement |
|-----------|------------------------|
| HumanEval | **+67.3%** |
| GSM8K | **+58.6%** |
| Math-500 | **+89.8%** |
| Med_MCPA | **+46%** |
| Health_Bench | **+45.3%** |
### Ultimate Inference Performance
- **Hardware Environment**: NVIDIA H200 single GPU



*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
## Technical Specifications
- **Model Architecture**: LlamaForCausalLMEagle3
- **Number of Layers**: 1 layer (Draft Model)
- **Hidden Size**: 4096
- **Attention Heads**: 32 (KV heads: 8)
- **Intermediate Size**: 14336
- **Vocabulary Size**: 157,184
- **Max Position Embeddings**: 32,768
- **Data Type**: bfloat16
## Quick Start
### Requirements
- H200-class Computational Performance
- CUDA 12.0+
- PyTorch 2.0+
### Installation
```bash
pip install sglang==0.5.6
```
and include PR https://github.com/sgl-project/sglang/pull/15119
### Inference with SGLang
```python
python3 -m sglang.launch_server \
--model-path MedAIBase/AntAngelMed-FP8 \
--host 0.0.0.0 --port 30012 \
--trust-remote-code \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--tp-size 1 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
```
## Training Data
- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
## Use Cases
- High-concurrency inference services
- Real-time dialogue systems
- Code generation and completion
- Mathematical reasoning and computation
- Production environments requiring low-latency responses
## Open Source Contribution
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
## Limitations and Notes
- This model is a draft model that needs to be used with a target model to achieve speculative sampling
- FP8 quantization is recommended for optimal performance
- Performance may vary across different hardware platforms
- Medical domain applications must comply with relevant regulations; model outputs are for reference only
## License
This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE). |