HuNoNo commited on
Commit
3137106
·
verified ·
1 Parent(s): dcdca7f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +482 -0
README.md ADDED
@@ -0,0 +1,482 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <p align="center">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo_light.png?raw=true">
5
+ <img alt="AngelSlim" src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo.png?raw=true" width=55%>
6
+ </picture>
7
+ </p>
8
+
9
+ <h3 align="center">
10
+ Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.
11
+ </h3>
12
+
13
+ <p align="center">
14
+ 📖 <a href="https://angelslim.readthedocs.io/">Documentation</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/AngelSlim">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/AngelSlim">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="./docs/source/assets/angel_slim_wechat.png">WeChat</a>
15
+ <br>
16
+ </p>
17
+
18
+
19
+ ## Table of Contents
20
+
21
+ - [Latest Updates](#latest-updates)
22
+ - [Key Features](#key-features)
23
+ - [Supported Models](#supported-models)
24
+ - [How to Use](#how-to-use)
25
+ - [Install AngelSlim](#install-angelslim)
26
+ - [Quick Start](#quick-start)
27
+ - [deployment & Evaluation](#deployment)
28
+ - [Benchmark](#benchmark)
29
+ - [License](#license)
30
+ - [Citation](#citation)
31
+ - [Technical Discussion](#technical-discussion)
32
+
33
+ ## 📣Latest Updates
34
+ - [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
35
+ - [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192)🔥🔥🔥
36
+ - [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)🔥🔥🔥
37
+ - [25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights.
38
+ - [25/09/01] We now support ​FP8 quantization​ of the [Hunyuan-MT-7B](https://huggingface.co/tencent/Hunyuan-MT-7B-fp8) translation model. And enabled ​Torch inference and Benchmark evaluation​ for Eagle3. And implemented support for ​quantization and Cache​ for [FLUX](https://github.com/Tencent/AngelSlim/tree/main/configs/flux). And support ​quantization​ for the [Seed-OSS](https://github.com/Tencent/AngelSlim/tree/main/configs/seed_oss).
39
+ - [25/08/06] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms, and quantization for `DeepSeek-R1/V3` and `Kimi-K2`, including `FP8-Static` and `W4A8-FP8` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight.
40
+ - [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight.
41
+
42
+ Coming soon:
43
+ - [ ] Diffusion model compression support.
44
+ - [ ] Release of new algorithm for speculative sampling.
45
+
46
+ ## 🌟Key Features
47
+
48
+ - **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
49
+ - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
50
+ - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
51
+
52
+ ## 💼Supported Models
53
+
54
+ ### Quantization
55
+ Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::
56
+
57
+ | Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
58
+ | --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
59
+ | [Hunyuan-Dense](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) | ✅ | ✅ | ✅ | ✅ | ✅ |
60
+ | [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | ✅ | ✅ | ✅ | ✅ | ✅ |
61
+ | [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
62
+ | [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
63
+ | [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | ✅ | ✅ | ✅ | ✅ | ✅ |
64
+ | [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | ✅ | ✅ | ✅ | ✅ | ✅ |
65
+ | [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
66
+
67
+ ### Speculative Decoding
68
+
69
+ #### Eagle3
70
+ The Eagle3 weights for the Qwen3 series model are now available.
71
+
72
+ | Qwen3 Models | Hunyuan Models |
73
+ | ----------|----------|
74
+ | ✅ [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |✅ [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) |
75
+ | ✅ [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |✅ [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) |
76
+ | ✅ [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |✅ [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) |
77
+ | ✅ [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) |
78
+ | ✅ [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) |
79
+ | ✅ [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) |
80
+
81
+ ## 🛎️How to Use
82
+
83
+ ### Install AngelSlim
84
+
85
+ We recommend using `pip` to install the latest stable version of `AngelSlim`:
86
+
87
+ ```shell
88
+ pip install angelslim
89
+ ```
90
+
91
+ Alternatively, you can clone the repository and install from source in editable mode:
92
+
93
+ ```shell
94
+ cd AngelSlim && python setup.py install
95
+ ```
96
+
97
+ For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
98
+
99
+ ### Quick Start
100
+
101
+ #### Quantization
102
+
103
+ After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model:
104
+
105
+ * One-click Start
106
+
107
+ ```shell
108
+ python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
109
+ ```
110
+
111
+ This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights.
112
+
113
+ * Code-based Start
114
+
115
+ To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
116
+
117
+ ```python
118
+ from angelslim.engine import Engine
119
+
120
+ slim_engine = Engine()
121
+ # Prepare model
122
+ slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
123
+ # Initialize compressor
124
+ slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
125
+ # Compress model
126
+ slim_engine.run()
127
+ # Save compressed model
128
+ slim_engine.save("./output")
129
+ ```
130
+
131
+ For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
132
+
133
+ #### Speculative_Decoding
134
+
135
+ ##### Eagle3 PyTorch Performance Testing
136
+
137
+ After installing `AngelSlim`, you can quickly start Eagle3 PyTorch performance testing with the following script:
138
+
139
+ ```bash
140
+ python3 tools/spec_benchmark.py \
141
+ --base-model-path /path/to/base/model \
142
+ --eagle-model-path /path/to/eagle/model \
143
+ --model-id your_model_id \
144
+ --mode both
145
+ ```
146
+
147
+ For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
148
+
149
+ ### Deployment and Testing
150
+
151
+ ### 1. Offline Inference
152
+
153
+ If you need to load a quantized model via `transformers`, please set the `deploy_backend: huggingface` in the `global` configuration before quantizing the model, or manually modify the `ignored_layers` field in the `config.json` file located in the quantized model output directory to `ignore`.
154
+
155
+ To test offline inference with a quantized model loaded via `transformers`, run the following command:
156
+
157
+ ```shell
158
+ python deploy/offline.py $MODEL_PATH
159
+ ```
160
+
161
+ Where `MODEL_PATH` is the path to the quantized model output.
162
+
163
+ #### 2. API Service Deployment
164
+
165
+ After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
166
+
167
+ **vLLM**
168
+
169
+ Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
170
+
171
+
172
+ ```shell
173
+ bash deploy/run_vllm.sh $MODEL_PATH
174
+ ```
175
+
176
+ **SGLang**
177
+
178
+
179
+ Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
180
+
181
+ ```shell
182
+ bash deploy/run_sglang.sh $MODEL_PATH
183
+ ```
184
+
185
+ #### 3. Service Invocation
186
+
187
+ Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
188
+
189
+ ```shell
190
+ bash deploy/openai.sh $MODEL_PATH
191
+ ```
192
+
193
+ #### 4. Performance Evaluation
194
+
195
+ Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
196
+
197
+ ```shell
198
+ bash deploy/lm_eval.sh $MODEL_PATH
199
+ ```
200
+
201
+ For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
202
+
203
+
204
+ ## 📈 Benchmark
205
+
206
+ ### (1) Quantization
207
+
208
+ The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
209
+
210
+ #### Hunyuan Series Models
211
+
212
+ Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:
213
+
214
+ <table>
215
+ <thead>
216
+ <tr><th>Model</th><th>Quantization</th><th>OlympiadBench</th><th>AIME 2024</th><th>DROP</th><th>GPQA-Diamond</th></tr>
217
+ </thead>
218
+ <tbody>
219
+ <tr><td rowspan="4">Hunyuan-A13B-Instruct</td>
220
+ <td>BF16</td><td>82.7</td><td>87.30</td><td>91.1</td><td>71.2</td></tr>
221
+ <tr><td>FP8-Static</td><td>83.0</td><td>86.7</td><td>91.1</td><td>-</td></tr>
222
+ <tr><td>Int4-GPTQ</td><td>82.7</td><td>86.7</td><td>91.1</td><td>-</td></tr>
223
+ <tr><td>Int4-AWQ</td><td>82.6</td><td>85.6</td><td>91.0</td><td>-</td></tr>
224
+ </tbody>
225
+ <tbody>
226
+ <tr><td rowspan="4">Hunyuan-7B-Instruct</td>
227
+ <td>BF16</td> <td>76.5</td><td>81.1</td><td>85.9</td><td>60.1</td></tr>
228
+ <tr><td>FP8-Static</td><td>76.6</td><td>80.9</td><td>86.0</td><td>60.1</td></tr>
229
+ <tr><td>Int4-GPTQ</td><td>76.2</td><td>81.0</td><td>85.7</td><td>60.0</td></tr>
230
+ <tr><td>Int4-AWQ</td><td>76.4</td><td>80.9</td><td>85.9</td><td>60.1</td></tr>
231
+ </tbody>
232
+ <tbody>
233
+ <tr><td rowspan="4">Hunyuan-4B-Instruct</td>
234
+ <td>BF16</td> <td>73.1</td><td>78.3</td><td>78.2</td><td>61.1</td></tr>
235
+ <tr><td>FP8-Static</td><td>73.1</td><td>76.6</td><td>78.3</td><td>60.2</td></tr>
236
+ <tr><td>Int4-GPTQ</td><td>72.9</td><td>-</td><td>78.1</td><td>58.1</td></tr>
237
+ <tr><td>Int4-AWQ</td><td>72.8</td><td>-</td><td>78.2</td><td>-</td></tr>
238
+ </tbody>
239
+ <tbody>
240
+ <tr><td rowspan="4">Hunyuan-1.8B-Instruct</td>
241
+ <td>BF16</td> <td>63.4</td><td>56.7</td><td>76.7</td><td>47.2</td></tr>
242
+ <tr><td>FP8-Static</td><td>62.5</td><td>55.2</td><td>75.1</td><td>47.7</td></tr>
243
+ <tr><td>Int4-GPTQ</td><td>60.9</td><td>-</td><td>73.0</td><td>44.4</td></tr>
244
+ <tr><td>Int4-AWQ</td><td>61.7</td><td>-</td><td>71.7</td><td>43.6</td></tr>
245
+ </tbody>
246
+ <tbody>
247
+ <tr><td rowspan="4">Hunyuan-0.5B-Instruct</td>
248
+ <td>BF16</td> <td>29.6</td><td>17.2</td><td>52.8</td><td>23.3</td></tr>
249
+ <tr><td>FP8-Static</td><td>29.6</td><td>17.2</td><td>51.6</td><td>22.5</td></tr>
250
+ <tr><td>Int4-GPTQ</td><td>26.8</td><td>-</td><td>50.9</td><td>23.3</td></tr>
251
+ <tr><td>Int4-AWQ</td><td>26.3</td><td>-</td><td>48.9</td><td>23.3</td></tr>
252
+ </tbody>
253
+ </table>
254
+
255
+ #### Qwen3 Series Models
256
+
257
+ Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
258
+
259
+ <table>
260
+ <thead>
261
+ <tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th><th>HUMANEVAL</th></tr>
262
+ </thead>
263
+ <tbody>
264
+ <tr><td rowspan="4">Qwen3-0.6B</td><td>BF16</td><td>45.84</td><td>47.21</td><td>42.99</td><td>19.51</td></tr>
265
+ <tr><td>FP8-Static</td><td>45.99</td><td>46.87</td><td>38.06</td><td>18.90</td></tr>
266
+ <tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td><td>20.73</td></tr>
267
+ <tr><td>INT8-Dynamic</td><td>45.17</td><td>46.95</td><td>41.17</td><td>21.34</td></tr>
268
+ <tr><td rowspan="6">Qwen3-8B</td><td>BF16</td><td>79.27</td><td>74.78</td><td>87.79</td><td>63.41</td></tr>
269
+ <tr><td>FP8-Static</td><td>78.23</td><td>74.79</td><td>86.96</td><td>62.20</td></tr>
270
+ <tr><td>FP8-Dynamic</td><td>78.45</td><td>74.75</td><td>87.64</td><td>62.80</td></tr>
271
+ <tr><td>INT8-Dynamic</td><td>78.01</td><td>74.84</td><td>86.96</td><td>67.07</td></tr>
272
+ <tr><td>INT4-GPTQ</td><td>77.19</td><td>73.26</td><td>86.43</td><td>62.20</td></tr>
273
+ <tr><td>INT4-AWQ</td><td>76.15</td><td>73.59</td><td>86.96</td><td>63.41</td></tr>
274
+ <tr><td rowspan="6">Qwen3-14B</td><td>BF16</td><td>83.06</td><td>78.90</td><td>88.40</td><td>55.49</td></tr>
275
+ <tr><td>FP8-Static</td><td>82.62</td><td>78.57</td><td>89.46</td><td>57.32</td></tr>
276
+ <tr><td>FP8-Dynamic</td><td>82.24</td><td>78.92</td><td>88.32</td><td>52.44</td></tr>
277
+ <tr><td>INT8-Dynamic</td><td>81.87</td><td>78.13</td><td>86.28</td><td>56.10</td></tr>
278
+ <tr><td>INT4-GPTQ</td><td>81.05</td><td>78.02</td><td>87.34</td><td>57.93</td></tr>
279
+ <tr><td>INT4-AWQ</td><td>82.02</td><td>77.68</td><td>84.23</td><td>61.59</td></tr>
280
+ <tr><td rowspan="5">Qwen3-32B</td><td>BF16</td><td>86.55</td><td>82.00</td><td>74.53</td><td>37.80</td></tr>
281
+ <tr><td>FP8-Static</td><td>86.92</td><td>81.78</td><td>70.20</td><td>39.63</td></tr>
282
+ <tr><td>FP8-Dynamic</td><td>86.55</td><td>81.89</td><td>70.43</td><td>38.41</td></tr>
283
+ <tr><td>INT4-GPTQ</td><td>86.18</td><td>81.01</td><td>-</td><td>43.29</td></tr>
284
+ <tr><td>INT4-AWQ</td><td>86.18</td><td>81.54</td><td>-</td><td>36.59</td></tr>
285
+ <tr><td rowspan="4">Qwen3-30B-A3B</td><td>BF16</td><td>83.66</td><td>79.36</td><td>89.99</td><td>31.71</td></tr>
286
+ <tr><td>FP8-Static</td><td>83.95</td><td>79.47</td><td>89.01</td><td>31.10</td></tr>
287
+ <tr><td>FP8-Dynamic</td><td>84.10</td><td>79.40</td><td>89.16</td><td>32.93</td></tr>
288
+ <tr><td>INT8-Dynamic</td><td>83.36</td><td>79.48</td><td>89.16</td><td>34.15</td></tr>
289
+ <tr><td rowspan="4">Qwen3-235B-A22B</td><td>BF16</td><td>89.60</td><td>86.28</td><td>85.29</td><td>27.44</td></tr>
290
+ <tr><td>FP8-Static</td><td>89.67</td><td>86.19</td><td>86.96</td><td>27.44</td></tr>
291
+ <tr><td>FP8-Dynamic</td><td>89.67</td><td>86.18</td><td>85.22</td><td>28.05</td></tr>
292
+ <tr><td>INT8-Dynamic</td><td>88.93</td><td>86.20</td><td>86.20</td><td>23.78</td></tr>
293
+ <tr><td rowspan="5">QwQ-32B</td><td>BF16</td><td>85.74</td><td>82.03</td><td>73.31</td><td>42.68</td></tr>
294
+ <tr><td>FP8-Static</td><td>85.44</td><td>81.91</td><td>75.36</td><td>42.68</td></tr>
295
+ <tr><td>FP8-Dynamic</td><td>85.07</td><td>81.93</td><td>75.66</td><td>42.07</td></tr>
296
+ <tr><td>INT4-GPTQ</td><td>84.03</td><td>81.26</td><td>68.23</td><td>45.73</td></tr>
297
+ <tr><td>INT4-AWQ</td><td>83.58</td><td>81.01</td><td>68.69</td><td>43.29</td></tr>
298
+ </tbody>
299
+ </table>
300
+
301
+ #### Qwen2.5VL Series Models
302
+
303
+ Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
304
+
305
+ <table>
306
+ <thead>
307
+ <tr><th>Model</th><th>Quantization</th><th>MMMU_VAL</th><th>MMLDocVQA_VALU</th><th>ChartQA_TEST</th></tr>
308
+ </thead>
309
+ <tbody>
310
+ <tr><td rowspan="5">Qwen2.5VL-3B</td><td>BF16</td><td>47.11</td><td>78.57</td><td>80.32</td></tr>
311
+ <tr><td>FP8-Static</td><td>47.33</td><td>79.34</td><td>79.68</td></tr>
312
+ <tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td></tr>
313
+ <tr><td>INT4-GPTQ</td><td>46.56</td><td>77.20</td><td>78.96</td></tr>
314
+ <tr><td>INT4-AWQ</td><td>45.78</td><td>-</td><td>79.60</td></tr>
315
+ <tr><td rowspan="5">Qwen2.5VL-7B</td><td>BF16</td><td>45.44</td><td>89.71</td><td>84.64</td></tr>
316
+ <tr><td>FP8-Static</td><td>47.00</td><td>89.83</td><td>85.92</td></tr>
317
+ <tr><td>FP8-Dynamic</td><td>47.22</td><td>89.80</td><td>88.64</td></tr>
318
+ <tr><td>INT4-GPTQ</td><td>46.67</td><td>90.45</td><td>-</td></tr>
319
+ <tr><td>INT4-AWQ</td><td>45.67</td><td>89.28</td><td>-</td></tr>
320
+ <tr><td rowspan="5">Qwen2.5VL-32B</td><td>BF16</td><td>57.00</td><td>90.03</td><td>-</td></tr>
321
+ <tr><td>FP8-Static</td><td>57.00</td><td>89.88</td><td>-</td></tr>
322
+ <tr><td>FP8-Dynamic</td><td>56.44</td><td>89.88</td><td>-</td></tr>
323
+ <tr><td>INT4-GPTQ</td><td>55.22</td><td>89.80 </td><td>-</td></tr>
324
+ <tr><td>INT4-AWQ</td><td>55.22</td><td>90.30</td><td>-</td></tr>
325
+ <tr><td rowspan="5">Qwen2.5VL-72B</td><td>BF16</td><td>58.78</td><td>94.39</td><td>85.60</td></tr>
326
+ <tr><td>FP8-Static</td><td>57.89</td><td>94.41</td><td>85.84</td></tr>
327
+ <tr><td>FP8-Dynamic</td><td>58.67</td><td>94.38</td><td>85.60</td></tr>
328
+ <tr><td>INT4-GPTQ</td><td>57.56</td><td>94.46</td><td>86.48</td></tr>
329
+ <tr><td>INT4-AWQ</td><td>58.78</td><td>94.19</td><td>87.28</td></tr>
330
+ </tbody>
331
+ </table>
332
+
333
+ #### DeepSeek Series Models
334
+
335
+ Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`、`AIME 2024`、`SimpleQA` and `LiveCodeBench`:
336
+
337
+ <table>
338
+ <thead>
339
+ <tr><th>Model</th><th>Quantization</th><th>GPQA Diamond</th><th>AIME 2024</th><th>SimpleQA</th><th>LiveCodeBench</th></tr>
340
+ </thead>
341
+ <tbody>
342
+ <tr><td rowspan="6">DeepSeek-R1-0528</td><td>FP8-Block-Wise</td><td>78.28</td><td>88.67</td><td>27.8</td><td>77.1</td></tr>
343
+ <tr><td>W4A8-FP8</td><td>77.37</td><td>88.67</td><td>26.83</td><td>78.86</td></tr>
344
+ </tbody>
345
+ </table>
346
+
347
+ > **Note**:
348
+ > - The above results are based on the average of 5 test runs deployed with TRT-LLM
349
+ > - The hyperparameters used during evaluation are as follows:
350
+ > ```json
351
+ >{
352
+ > "top_k": 20,
353
+ > "top_p": 0.6,
354
+ > "temperature": 0.7,
355
+ > "output_seq_len": 32768,
356
+ > "max_input_seq_len": 16384
357
+ >}
358
+ >```
359
+
360
+ #### Other Models
361
+
362
+ Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
363
+
364
+ <table>
365
+ <thead>
366
+ <tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th></tr>
367
+ </thead>
368
+ <tbody>
369
+ <tr><td rowspan="3">Qwen2.5-1.5B-Instruct</td><td>BF16</td><td>67.01</td><td>60.05</td><td>54.28</td></tr>
370
+ <tr><td>FP8-Static</td><td>66.27</td><td>60.23</td><td>-</td></tr>
371
+ <tr><td>FP8-Dynamic</td><td>66.79</td><td>60.08</td><td>51.71</td></tr>
372
+ <tr><td rowspan="5">Qwen2.5-7B-Instruct</td><td>BF16</td><td>81.20</td><td>74.55</td><td>79.98</td></tr>
373
+ <tr><td>FP8-Static</td><td>81.13</td><td>74.03</td><td>79.30</td></tr>
374
+ <tr><td>FP8-Dynamic</td><td>80.31</td><td>74.07</td><td>79.00</td></tr>
375
+ <tr><td>INT4-GPTQ</td><td>79.05</td><td>73.05</td><td>74.75</td></tr>
376
+ <tr><td>INT4-AWQ</td><td>79.35</td><td>73.22</td><td>79.38</td></tr>
377
+ <tr><td rowspan="5">Qwen2.5-32B-Instruct</td><td>BF16</td><td>87.30</td><td>83.21</td><td>81.73</td></tr>
378
+ <tr><td>FP8-Static</td><td>87.59</td><td>83.08</td><td>81.58</td></tr>
379
+ <tr><td>FP8-Dynamic</td><td>87.30</td><td>83.04</td><td>81.58</td></tr>
380
+ <tr><td>INT4-GPTQ</td><td>86.70</td><td>82.45</td><td>82.03</td></tr>
381
+ <tr><td>INT4-AWQ</td><td>87.00</td><td>82.64</td><td>-</td></tr>
382
+ <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-7B</td><td>BF16</td><td>53.49</td><td>53.80</td><td>75.74</td></tr>
383
+ <tr><td>FP8-Static</td><td>53.57</td><td>54.17</td><td>76.19</td></tr>
384
+ <tr><td>FP8-Dynamic</td><td>52.97</td><td>54.13</td><td>74.15</td></tr>
385
+ <tr><td>INT4-GPTQ</td><td>51.86</td><td>52.44</td><td>75.89</td></tr>
386
+ <tr><td>INT4-AWQ</td><td>53.49</td><td>53.70</td><td>-</td></tr>
387
+ <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-14B</td><td>BF16</td><td>77.71</td><td>74.28</td><td>85.67</td></tr>
388
+ <tr><td>FP8-Static</td><td>77.56</td><td>74.66</td><td>86.73</td></tr>
389
+ <tr><td>FP8-Dynamic</td><td>76.82</td><td>74.63</td><td>87.11</td></tr>
390
+ <tr><td>INT4-GPTQ</td><td>74.29</td><td>72.37</td><td>84.61</td></tr>
391
+ <tr><td>INT4-AWQ</td><td>74.81</td><td>73.00</td><td>86.05</td></tr>
392
+ <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-32B</td><td>BF16</td><td>84.18</td><td>80.89</td><td>87.41</td></tr>
393
+ <tr><td>FP8-Static</td><td>83.43</td><td>80.90</td><td>87.57</td></tr>
394
+ <tr><td>FP8-Dynamic</td><td>83.73</td><td>81.10</td><td>86.43</td></tr>
395
+ <tr><td>INT4-GPTQ</td><td>84.10</td><td>79.80</td><td>86.73</td></tr>
396
+ <tr><td>INT4-AWQ</td><td>82.84</td><td>80.15</td><td>87.19</td></tr>
397
+ </tbody>
398
+ </table>
399
+
400
+ ### (2) Speculative Decoding
401
+
402
+ #### Qwen3 Series Models
403
+ Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
404
+
405
+ <table>
406
+ <thead>
407
+ <tr>
408
+ <th>&nbsp</th><th>&nbsp</th>
409
+ <th colspan="2" style="text-align: center; vertical-align: middle;">MT-bench</th>
410
+ <th colspan="2" style="text-align: center; vertical-align: middle;">HumanEval</th>
411
+ <th colspan="2" style="text-align: center; vertical-align: middle;">GSM8K</th>
412
+ <th colspan="2" style="text-align: center; vertical-align: middle;">Alpaca</th>
413
+ <th colspan="2" style="text-align: center; vertical-align: middle;">Mean</th></tr>
414
+ <tr><th>Temperature</th><th>Model</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th></tr>
415
+ </thead>
416
+ <tbody>
417
+ <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=0</strong></td></tr> -->
418
+ <tr><td rowspan="6"><strong>T=0</strong></td>
419
+ <td>Qwen3-1.7B</td><td>2.05x</td><td>2.81</td><td>2.07x</td><td>2.93</td><td>2.11x</td><td>2.98</td><td>1.93x</td><td>2.69</td><td>2.04x</td><td>2.85</td></tr>
420
+ <tr> <td>Qwen3-4B</td><td>2.21x</td><td>3.01</td><td>2.36x</td><td>3.24</td><td>2.42x</td><td>3.13</td><td>2.32x</td><td>2.75</td><td>2.33x</td><td>3.03</td></tr>
421
+ <tr><td>Qwen3-8B</td><td>2.63x</td><td>3.65</td><td>2.76x</td><td>3.85</td><td>2.82x</td><td>3.90</td><td>2.62x</td><td>3.48</td><td>2.70x</td><td>3.72</td></tr>
422
+ <tr><td>Qwen3-14B</td><td>2.23x</td><td>3.30</td><td>2.53x</td><td>3.74</td><td>2.56x</td><td>3.79</td><td>2.16x</td><td>3.13</td><td>2.37x</td><td>3.49</td></tr>
423
+ <tr><td>Qwen3-32B</td><td>2.39x</td><td>2.78</td><td>2.37x</td><td>2.81</td><td>2.47x</td><td>2.92</td><td>2.42x</td><td>2.53</td><td>2.41x</td><td>2.76</td></tr>
424
+ <tr><td>Qwen3-30B-A3B</td><td>2.84x</td><td>3.63</td><td>2.27x</td><td>3.09</td><td>2.64x</td><td>3.42</td><td>2.83x</td><td>3.56</td><td>2.64x</td><td>3.42</td></tr>
425
+ <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
426
+ <tr><td rowspan="6"><strong>T=1</strong></td>
427
+ <td>Qwen3-1.7B</td><td>1.74x</td><td>2.53</td><td>1.86x</td><td>2.70</td><td>1.82x</td><td>2.69</td><td>1.72x</td><td>2.46</td><td>1.93x</td><td>2.60</td></tr>
428
+ <tr><td>Qwen3-4B</td><td>1.93x</td><td>2.60</td><td>2.00x</td><td>2.84</td><td>2.11x</td><td>2.82</td><td>2.34x</td><td>2.50</td><td>1.75x</td><td>2.69</td></tr>
429
+ <tr><td>Qwen3-8B</td><td>1.98x</td><td>2.75</td><td>2.25x</td><td>3.11</td><td>2.31x</td><td>3.15</td><td>2.10x</td><td>2.76</td><td>2.90x</td><td>2.94</td></tr>
430
+ <tr><td>Qwen3-14B</td><td>1.71x</td><td>2.61</td><td>1.95x</td><td>2.87</td><td>2.04x</td><td>3.08</td><td>1.68x</td><td>2.55</td><td>2.90x</td><td>2.78</td></tr>
431
+ <tr><td>Qwen3-32B</td><td>1.62x</td><td>1.91</td><td>1.71x</td><td>2.05</td><td>1.78x</td><td>2.10</td><td>1.80x</td><td>1.95</td><td>1.62x</td><td>2.00</td></tr>
432
+ <tr><td>Qwen3-30B-A3B</td><td>1.91x</td><td>2.46</td><td>2.00x</td><td>2.64</td><td>1.90x</td><td>2.53</td><td>1.80x</td><td>2.32</td><td>1.90x</td><td>2.48</td></tr>
433
+ </tbody>
434
+ </table>
435
+
436
+ #### Hunyuan Series Models
437
+ Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
438
+
439
+ <table>
440
+ <thead>
441
+ <tr>
442
+ <th>&nbsp</th><th>&nbsp</th>
443
+ <th colspan="2" style="text-align: center; vertical-align: middle;">MT-bench</th>
444
+ <th colspan="2" style="text-align: center; vertical-align: middle;">HumanEval</th>
445
+ <th colspan="2" style="text-align: center; vertical-align: middle;">GSM8K</th>
446
+ <th colspan="2" style="text-align: center; vertical-align: middle;">Alpaca</th>
447
+ <th colspan="2" style="text-align: center; vertical-align: middle;">Mean</th></tr>
448
+ <tr><th>Temperature</th><th>Model</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th></tr>
449
+ </thead>
450
+ <tbody>
451
+ <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=0</strong></td></tr> -->
452
+ <tr><td rowspan="3"><strong>T=0</strong></td>
453
+ <td>Hunyuan-1.8B-Instruct</td><td>1.97x</td><td>2.90</td><td>2.58x</td><td>3.73</td><td>2.61x</td><td>3.71</td><td>1.71x</td><td>2.43</td><td>2.22x</td><td>3.19</td></tr>
454
+ <tr> <td>Hunyuan-4B-Instruct</td><td>1.77x</td><td>2.60</td><td>2.64x</td><td>3.35</td><td>2.14x</td><td>3.17</td><td>1.72x</td><td>2.57</td><td>2.07x</td><td>2.92</td></tr>
455
+ <tr><td>Hunyuan-7B-Instruct</td><td>2.22x</td><td>3.58</td><td>3.59x</td><td>5.47</td><td>2.96x</td><td>4.68</td><td>1.64x</td><td>2.56</td><td>2.60x</td><td>4.07</td></tr>
456
+ <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
457
+ <tr><td rowspan="3"><strong>T=1</strong></td>
458
+ <td>Hunyuan-1.8B-Instruct</td><td>1.58x</td><td>2.36</td><td>2.35x</td><td>3.56</td><td>2.23x</td><td>3.38</td><td>1.26x</td><td>1.87</td><td>1.86x</td><td>2.79</td></tr>
459
+ <tr><td>Hunyuan-4B-Instruct</td><td>1.36x</td><td>2.05</td><td>1.97x</td><td>2.86</td><td>1.72x</td><td>2.68</td><td>1.14x</td><td>1.76</td><td>1.55x</td><td>2.34</td></tr>
460
+ <tr><td>Hunyuan-7B-Instruct</td><td>1.90x</td><td>3.11</td><td>3.12x</td><td>5.09</td><td>2.74x</td><td>4.34</td><td>1.47x</td><td>2.39</td><td>2.31x</td><td>3.73</td></tr>
461
+ </tbody>
462
+ </table>
463
+
464
+ ## 📝 License
465
+
466
+ The code for this project is open-sourced under the [License for AngelSlim](LICENSE).
467
+
468
+ ## 🔗 Citation
469
+
470
+ ```
471
+ @software{AngelSlim2025,
472
+ title={{AngelSlim}},
473
+ author={Tencent AngelSlim Project Contributors},
474
+ year={2025},
475
+ month={6},
476
+ url={https://github.com/Tencent/AngelSlim},
477
+ }
478
+ ```
479
+
480
+ ## 💬 Technical Discussion
481
+
482
+ * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat technical discussion group](./docs/source/assets/angel_slim_wechat.png).