yarkcy commited on
Commit
530ef2f
·
verified ·
1 Parent(s): 04c8dfa

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +125 -3
README.md CHANGED
@@ -1,3 +1,125 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # AntAngelMed-eagle3
3
+
4
+ ## Model Overview
5
+
6
+ **AntAngelMed-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
7
+
8
+ The model is trained on **high-quality medical datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
9
+
10
+ ## Key Features
11
+
12
+ - **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
13
+ - **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
14
+ - **Production-Grade Optimization**: Achieving 3267 tokens/s output throughput on single NVIDIA H200
15
+
16
+
17
+ ## Performance
18
+
19
+ ### Speculative Sampling Efficiency
20
+
21
+ Average Acceptance Length with speculative length of 4:
22
+
23
+ | Benchmark | Average Acceptance Length |
24
+ |-----------|---------------------------|
25
+ | HumanEval | 2.816 |
26
+ | GSM8K | 3.24 |
27
+ | Math-500 | 3.326 |
28
+ | Med_MCPA | 2.600 |
29
+ | Health_Bench | 2.446 |
30
+
31
+ ### Throughput Improvement
32
+
33
+ Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 16 concurrency:
34
+
35
+ | Benchmark | Throughput Improvement |
36
+ |-----------|------------------------|
37
+ | HumanEval | **+67.3%** |
38
+ | GSM8K | **+58.6%** |
39
+ | Math-500 | **+89.8%** |
40
+ | Med_MCPA | **+46%** |
41
+ | Health_Bench | **+45.3%** |
42
+
43
+ ### Ultimate Inference Performance
44
+
45
+ - **Hardware Environment**: NVIDIA H200 single GPU
46
+
47
+ ![1](https://hackmd.io/_uploads/BJF9a7MNZe.png)
48
+ ![2](https://hackmd.io/_uploads/H15K1NMV-e.png)
49
+ ![3](https://hackmd.io/_uploads/H16nT7fN-e.png)
50
+
51
+
52
+ *Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200*
53
+
54
+ ## Technical Specifications
55
+
56
+ - **Model Architecture**: LlamaForCausalLMEagle3
57
+ - **Number of Layers**: 1 layer (Draft Model)
58
+ - **Hidden Size**: 4096
59
+ - **Attention Heads**: 32 (KV heads: 8)
60
+ - **Intermediate Size**: 14336
61
+ - **Vocabulary Size**: 157,184
62
+ - **Max Position Embeddings**: 32,768
63
+ - **Data Type**: bfloat16
64
+
65
+ ## Quick Start
66
+
67
+ ### Requirements
68
+
69
+ - H200-class Computational Performance
70
+ - CUDA 12.0+
71
+ - PyTorch 2.0+
72
+
73
+ ### Installation
74
+
75
+ ```bash
76
+ pip install sglang==0.5.6
77
+ ```
78
+ and include PR https://github.com/sgl-project/sglang/pull/15119
79
+
80
+ ### Inference with SGLang
81
+
82
+ ```python
83
+ python3 -m sglang.launch_server \
84
+ --model-path MedAIBase/AntAngelMed-FP8 \
85
+ --host 0.0.0.0 --port 30012 \
86
+ --trust-remote-code \
87
+ --attention-backend fa3 \
88
+ --mem-fraction-static 0.9 \
89
+ --tp-size 1 \
90
+ --speculative-algorithm EAGLE3 \
91
+ --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
92
+ --speculative-num-steps 3 \
93
+ --speculative-eagle-topk 1 \
94
+ --speculative-num-draft-tokens 4
95
+ ```
96
+
97
+ ## Training Data
98
+
99
+ - **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data
100
+
101
+ ## Use Cases
102
+
103
+ - High-concurrency inference services
104
+ - Real-time dialogue systems
105
+ - Code generation and completion
106
+ - Mathematical reasoning and computation
107
+ - Production environments requiring low-latency responses
108
+
109
+ ## Open Source Contribution
110
+
111
+ We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**:
112
+ - PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119)
113
+
114
+
115
+ ## Limitations and Notes
116
+
117
+ - This model is a draft model that needs to be used with a target model to achieve speculative sampling
118
+ - FP8 quantization is recommended for optimal performance
119
+ - Performance may vary across different hardware platforms
120
+ - Medical domain applications must comply with relevant regulations; model outputs are for reference only
121
+
122
+
123
+ ## License
124
+
125
+ This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE).