AntAngelMed-eagle3 / README.md

Update README.md

e4c3b30 verified about 1 month ago

4 kB

	---
	license: apache-2.0
	base_model:
	- MedAIBase/AntAngelMed
	---

	# AntAngelMed-eagle3

	## Model Overview

	AntAngelMed-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

	The model is trained on high-quality medical datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

	## Key Features

	- Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
	- Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
	- Production-Grade Optimization: Achieving 3267 tokens/s output throughput on single NVIDIA H200


	## Performance

	### Speculative Sampling Efficiency

	Average Acceptance Length with speculative length of 4:

	\| Benchmark \| Average Acceptance Length \|
	\|-----------\|---------------------------\|
	\| HumanEval \| 2.816 \|
	\| GSM8K \| 3.24 \|
	\| Math-500 \| 3.326 \|
	\| Med_MCPA \| 2.600 \|
	\| Health_Bench \| 2.446 \|

	### Throughput Improvement

	Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 16 concurrency:

	\| Benchmark \| Throughput Improvement \|
	\|-----------\|------------------------\|
	\| HumanEval \| +67.3% \|
	\| GSM8K \| +58.6% \|
	\| Math-500 \| +89.8% \|
	\| Med_MCPA \| +46% \|
	\| Health_Bench \| +45.3% \|

	### Ultimate Inference Performance

	- Hardware Environment: NVIDIA H200 single GPU

	![1](https://hackmd.io/_uploads/BJF9a7MNZe.png)
	![2](https://hackmd.io/_uploads/H15K1NMV-e.png)
	![3](https://hackmd.io/_uploads/H16nT7fN-e.png)


	Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200

	## Technical Specifications

	- Model Architecture: LlamaForCausalLMEagle3
	- Number of Layers: 1 layer (Draft Model)
	- Hidden Size: 4096
	- Attention Heads: 32 (KV heads: 8)
	- Intermediate Size: 14336
	- Vocabulary Size: 157,184
	- Max Position Embeddings: 32,768
	- Data Type: bfloat16

	## Quick Start

	### Requirements

	- H200-class Computational Performance
	- CUDA 12.0+
	- PyTorch 2.0+

	### Installation

	```bash
	pip install sglang==0.5.6
	```
	and include PR https://github.com/sgl-project/sglang/pull/15119

	### Inference with SGLang

	```python
	python3 -m sglang.launch_server \
	--model-path MedAIBase/AntAngelMed-FP8 \
	--host 0.0.0.0 --port 30012 \
	--trust-remote-code \
	--attention-backend fa3 \
	--mem-fraction-static 0.9 \
	--tp-size 1 \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4
	```

	## Training Data

	- Data Quality: Rigorously filtered and cleaned to ensure high-quality training data

	## Use Cases

	- High-concurrency inference services
	- Real-time dialogue systems
	- Code generation and completion
	- Mathematical reasoning and computation
	- Production environments requiring low-latency responses

	## Open Source Contribution

	We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:
	- PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)


	## Limitations and Notes

	- This model is a draft model that needs to be used with a target model to achieve speculative sampling
	- FP8 quantization is recommended for optimal performance
	- Performance may vary across different hardware platforms
	- Medical domain applications must comply with relevant regulations; model outputs are for reference only


	## License

	This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE).