MedAIBase
/

AntAngelMed-eagle3

Safetensors

llama

Model card Files Files and versions

xet

Community

yarkcy commited on Jan 1

Commit

530ef2f

verified ·

1 Parent(s): 04c8dfa

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +125 -3

README.md CHANGED Viewed

@@ -1,3 +1,125 @@
----
-license: apache-2.0
----

+# AntAngelMed-eagle3
+## Model Overview
+**AntAngelMed-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
+The model is trained on **high-quality medical datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
+## Key Features
+- **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
+- **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
+- **Production-Grade Optimization**: Achieving 3267 tokens/s output throughput on single NVIDIA H200
+## Performance
+### Speculative Sampling Efficiency
+Average Acceptance Length with speculative length of 4:
+| Benchmark | Average Acceptance Length |
+|-----------|---------------------------|
+| HumanEval | 2.816 |
+| GSM8K | 3.24 |
+| Math-500 | 3.326 |
+| Med_MCPA | 2.600 |
+| Health_Bench | 2.446 |
+### Throughput Improvement
+Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 16 concurrency:
+| Benchmark | Throughput Improvement |
+|-----------|------------------------|
+| HumanEval | **+67.3%** |
+| GSM8K | **+58.6%** |
+| Math-500 | **+89.8%** |
+| Med_MCPA | **+46%** |
+| Health_Bench | **+45.3%** |
+### Ultimate Inference Performance
+- **Hardware Environment**: NVIDIA H200 single GPU
+![1](https://hackmd.io/_uploads/BJF9a7MNZe.png)
+![2](https://hackmd.io/_uploads/H15K1NMV-e.png)
+![3](https://hackmd.io/_uploads/H16nT7fN-e.png)
+*Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
+## Technical Specifications
+- **Model Architecture**: LlamaForCausalLMEagle3
+- **Number of Layers**: 1 layer (Draft Model)
+- **Hidden Size**: 4096
+- **Attention Heads**: 32 (KV heads: 8)
+- **Intermediate Size**: 14336
+- **Vocabulary Size**: 157,184
+- **Max Position Embeddings**: 32,768
+- **Data Type**: bfloat16
+## Quick Start
+### Requirements
+- H200-class Computational Performance
+- CUDA 12.0+
+- PyTorch 2.0+
+### Installation
+```bash
+pip install sglang==0.5.6
+```
+and include PR https://github.com/sgl-project/sglang/pull/15119
+### Inference with SGLang
+```python
+python3 -m sglang.launch_server  \
+    --model-path MedAIBase/AntAngelMed-FP8 \
+    --host 0.0.0.0 --port 30012  \
+    --trust-remote-code  \
+    --attention-backend fa3  \
+    --mem-fraction-static 0.9 \
+    --tp-size 1  \
+    --speculative-algorithm EAGLE3  \
+    --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
+    --speculative-num-steps 3  \
+    --speculative-eagle-topk 1   \
+    --speculative-num-draft-tokens 4
+```
+## Training Data
+- **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
+## Use Cases
+- High-concurrency inference services
+- Real-time dialogue systems
+- Code generation and completion
+- Mathematical reasoning and computation
+- Production environments requiring low-latency responses
+## Open Source Contribution
+We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
+- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
+## Limitations and Notes
+- This model is a draft model that needs to be used with a target model to achieve speculative sampling
+- FP8 quantization is recommended for optimal performance
+- Performance may vary across different hardware platforms
+- Medical domain applications must comply with relevant regulations; model outputs are for reference only
+## License
+This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE).