liuhong10 commited on
Commit
0ec1d39
Β·
verified Β·
1 Parent(s): 79ef99f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +818 -153
README.md CHANGED
@@ -22,30 +22,21 @@ Dedicated to building a more intuitive, comprehensive, and efficient LLMs compre
22
  </p>
23
 
24
 
25
- ## Table of Contents
 
 
 
 
 
26
 
27
- - [Latest Updates](#latest-updates)
28
- - [Key Features](#key-features)
29
- - [Supported Models](#supported-models)
30
- - [How to Use](#how-to-use)
31
- - [Install AngelSlim](#install-angelslim)
32
- - [Quick Start](#quick-start)
33
- - [deployment & Evaluation](#deployment)
34
- - [Benchmark](#benchmark)
35
- - [License](#license)
36
- - [Citation](#citation)
37
- - [Technical Discussion](#technical-discussion)
38
 
39
- ## πŸ“£Latest Updates
 
 
40
 
41
- - [25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms.
42
- We also opensource Qwen3-8B`s Eagle3 model weight.
43
-
44
- Coming soon:
45
-
46
- - [ ] Support W4A8 quantization for DeepSeek-R1.
47
- - [ ] Support quantization for multimodal models like Qwen-VL.
48
- - [ ] Release of new algorithm for speculative sampling.
49
 
50
  ## 🌟Key Features
51
 
@@ -53,36 +44,170 @@ Coming soon:
53
  - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
54
  - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
55
 
56
- ## πŸ’ΌSupported Models
57
-
58
- ### Quantization
59
- Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::
60
-
61
- | Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
62
- | --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
63
- | [Hunyuan-Dense](https://huggingface.co/tencent/Hunyuan-7B-Instruct) | βœ… | βœ… | βœ… | βœ… | βœ… |
64
- | [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | βœ… | βœ… | βœ… | βœ… | βœ… |
65
- | [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | βœ… | βœ… | βœ… | βœ… | βœ… |
66
- | [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | βœ… | βœ… | βœ… | βœ… | βœ… |
67
- | [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | βœ… | βœ… | βœ… | βœ… | βœ… |
68
- | [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | βœ… | βœ… | βœ… | βœ… | βœ… |
69
- | [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | βœ… | βœ… | βœ… | βœ… | βœ… |
70
-
71
- ### Speculative Decoding
72
- The Eagle3 weights for the Qwen3 series model are now available.
73
-
74
- | Qwen3 Models | Hunyuan Models |
75
- | ----------|----------|
76
- | βœ… [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |βœ… [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) |
77
- | βœ… [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |βœ… [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) |
78
- | βœ… [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |βœ… [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) |
79
- | βœ… [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) |
80
- | βœ… [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) |
81
- | βœ… [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## πŸ›ŽοΈHow to Use
84
 
85
- ### Install AngelSlim
86
 
87
  We recommend using `pip` to install the latest stable version of `AngelSlim`:
88
 
@@ -98,19 +223,35 @@ cd AngelSlim && python setup.py install
98
 
99
  For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
100
 
101
- ### Quick Start
102
 
103
- After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model:
104
 
105
- * One-click Start
106
 
107
- ```shell
108
- python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
109
- ```
 
 
 
 
 
 
 
110
 
111
- This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights.
112
 
113
- * Code-based Start
 
 
 
 
 
 
 
 
 
114
 
115
  To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
116
 
@@ -128,69 +269,539 @@ After installing `AngelSlim`, you can quickly start by running the following scr
128
  slim_engine.save("./output")
129
  ```
130
 
 
 
131
  For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
132
 
133
- ### πŸ–₯️ Deployment and Testing
134
 
135
- #### 1. API Service Deployment
136
 
137
- After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
 
 
 
 
 
 
 
 
138
 
139
- **vLLM**
140
 
141
- Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
142
 
 
143
 
144
  ```shell
145
- bash deploy/run_vllm.sh $MODEL_PATH
146
  ```
147
 
148
- **SGLang**
 
 
149
 
 
150
 
151
- Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
152
 
153
- ```shell
154
- bash deploy/run_sglang.sh $MODEL_PATH
155
- ```
 
 
 
 
 
 
 
 
 
 
 
156
 
157
- #### 2. Service Invocation
158
 
159
  Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
160
 
161
  ```shell
162
- bash deploy/openai.sh $MODEL_PATH
163
  ```
 
 
 
164
 
165
- #### 3. Performance Evaluation
166
 
167
- Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
 
168
 
169
  ```shell
170
- bash deploy/lm_eval.sh $MODEL_PATH
171
  ```
 
172
 
173
  For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
174
 
 
175
 
176
  ## πŸ“ˆ Benchmark
177
 
178
- ### (1) Quantization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
  The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
181
 
182
- #### Hunyuan Series Models
183
 
184
- Benchmark results for the `Hunyuan-A13B-Instruct` model with `FP8` and `INT4-GPTQ` quantization algorithms on datasets including `AIME 2024`, `GSM8K`, `BBH`, and `DROP`:
185
 
186
- | Bench | Hunyuan-A13B-Instruct | Hunyuan-A13B-Instruct-FP8 | Hunyuan-A13B-Instruct-Int4-GPTQ |
187
- |:---------:|:---------------------:|:-------------------------:|:-------------------------------:|
188
- | AIME 2024 | 87.3 | 86.7 | 86.7 |
189
- | GSM8K | 94.39 | 94.01 | 94.24 |
190
- | BBH | 89.1 | 88.34 | 87.91 |
191
- | DROP | 91.1 | 91.1 | 91.05 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
 
193
- #### Qwen3 Series Models
194
 
195
  Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
196
 
@@ -228,17 +839,133 @@ Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT
228
  <tr><td>FP8-Static</td><td>89.67</td><td>86.19</td><td>86.96</td><td>27.44</td></tr>
229
  <tr><td>FP8-Dynamic</td><td>89.67</td><td>86.18</td><td>85.22</td><td>28.05</td></tr>
230
  <tr><td>INT8-Dynamic</td><td>88.93</td><td>86.20</td><td>86.20</td><td>23.78</td></tr>
231
- <tr><td rowspan="5">QwQ-32B</td><td>BF16</td><td>85.74</td><td>82.03</td><td>73.31</td><td>42.68</td></tr>
232
- <tr><td>FP8-Static</td><td>85.44</td><td>81.91</td><td>75.36</td><td>42.68</td></tr>
233
- <tr><td>FP8-Dynamic</td><td>85.07</td><td>81.93</td><td>75.66</td><td>42.07</td></tr>
234
- <tr><td>INT4-GPTQ</td><td>84.03</td><td>81.26</td><td>68.23</td><td>45.73</td></tr>
235
- <tr><td>INT4-AWQ</td><td>83.58</td><td>81.01</td><td>68.69</td><td>43.29</td></tr>
236
  </tbody>
237
  </table>
238
 
239
- #### Other Models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
- Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
 
243
  <table>
244
  <thead>
@@ -276,69 +1003,7 @@ Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`
276
  </tbody>
277
  </table>
278
 
279
- ### (2) Speculative Decoding
280
-
281
- #### Qwen3 Series Models
282
- Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
283
-
284
- <table>
285
- <thead>
286
- <tr>
287
- <th>&nbsp</th><th>&nbsp</th>
288
- <th colspan="2" style="text-align: center; vertical-align: middle;">MT-bench</th>
289
- <th colspan="2" style="text-align: center; vertical-align: middle;">HumanEval</th>
290
- <th colspan="2" style="text-align: center; vertical-align: middle;">GSM8K</th>
291
- <th colspan="2" style="text-align: center; vertical-align: middle;">Alpaca</th>
292
- <th colspan="2" style="text-align: center; vertical-align: middle;">Mean</th></tr>
293
- <tr><th>Temperature</th><th>Model</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th></tr>
294
- </thead>
295
- <tbody>
296
- <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=0</strong></td></tr> -->
297
- <tr><td rowspan="6"><strong>T=0</strong></td>
298
- <td>Qwen3-1.7B</td><td>2.05x</td><td>2.81</td><td>2.07x</td><td>2.93</td><td>2.11x</td><td>2.98</td><td>1.93x</td><td>2.69</td><td>2.04x</td><td>2.85</td></tr>
299
- <tr> <td>Qwen3-4B</td><td>2.21x</td><td>3.01</td><td>2.36x</td><td>3.24</td><td>2.42x</td><td>3.13</td><td>2.32x</td><td>2.75</td><td>2.33x</td><td>3.03</td></tr>
300
- <tr><td>Qwen3-8B</td><td>2.65x</td><td>3.87</td><td>2.64x</td><td>3.82</td><td>2.86x</td><td>4.10</td><td>2.58x</td><td>3.55</td><td>2.68x</td><td>3.83</td></tr>
301
- <tr><td>Qwen3-14B</td><td>2.42x</td><td>3.38</td><td>2.57x</td><td>3.58</td><td>2.75x</td><td>3.77</td><td>2.27x</td><td>3.11</td><td>2.50x</td><td>3.46</td></tr>
302
- <tr><td>Qwen3-32B</td><td>2.39x</td><td>2.78</td><td>2.37x</td><td>2.81</td><td>2.47x</td><td>2.92</td><td>2.42x</td><td>2.53</td><td>2.41x</td><td>2.76</td></tr>
303
- <tr><td>Qwen3-30B-A3B</td><td>2.84x</td><td>3.63</td><td>2.27x</td><td>3.09</td><td>2.64x</td><td>3.42</td><td>2.83x</td><td>3.56</td><td>2.64x</td><td>3.42</td></tr>
304
- <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
305
- <tr><td rowspan="6"><strong>T=1</strong></td>
306
- <td>Qwen3-1.7B</td><td>1.74x</td><td>2.53</td><td>1.86x</td><td>2.70</td><td>1.82x</td><td>2.69</td><td>1.72x</td><td>2.46</td><td>1.93x</td><td>2.60</td></tr>
307
- <tr><td>Qwen3-4B</td><td>1.93x</td><td>2.60</td><td>2.00x</td><td>2.84</td><td>2.11x</td><td>2.82</td><td>2.34x</td><td>2.50</td><td>1.75x</td><td>2.69</td></tr>
308
- <tr><td>Qwen3-8B</td><td>1.91x</td><td>2.84</td><td>2.07x</td><td>3.05</td><td>2.34x</td><td>3.26</td><td>2.09x</td><td>2.92</td><td>2.10x</td><td>3.02</td></tr>
309
- <tr><td>Qwen3-14B</td><td>1.81x</td><td>2.58</td><td>1.96x</td><td>2.81</td><td>2.16x</td><td>3.09</td><td>1.76x</td><td>2.49</td><td>1.92x</td><td>2.74</td></tr>
310
- <tr><td>Qwen3-32B</td><td>1.62x</td><td>1.91</td><td>1.71x</td><td>2.05</td><td>1.78x</td><td>2.10</td><td>1.80x</td><td>1.95</td><td>1.62x</td><td>2.00</td></tr>
311
- <tr><td>Qwen3-30B-A3B</td><td>1.91x</td><td>2.46</td><td>2.00x</td><td>2.64</td><td>1.90x</td><td>2.53</td><td>1.80x</td><td>2.32</td><td>1.90x</td><td>2.48</td></tr>
312
- </tbody>
313
- </table>
314
-
315
- #### Hunyuan Series Models
316
- Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
317
-
318
- <table>
319
- <thead>
320
- <tr>
321
- <th>&nbsp</th><th>&nbsp</th>
322
- <th colspan="2" style="text-align: center; vertical-align: middle;">MT-bench</th>
323
- <th colspan="2" style="text-align: center; vertical-align: middle;">HumanEval</th>
324
- <th colspan="2" style="text-align: center; vertical-align: middle;">GSM8K</th>
325
- <th colspan="2" style="text-align: center; vertical-align: middle;">Alpaca</th>
326
- <th colspan="2" style="text-align: center; vertical-align: middle;">Mean</th></tr>
327
- <tr><th>Temperature</th><th>Model</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th><th>Speedup</th><th>Ο„</th></tr>
328
- </thead>
329
- <tbody>
330
- <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=0</strong></td></tr> -->
331
- <tr><td rowspan="3"><strong>T=0</strong></td>
332
- <td>Hunyuan-1.8B-Instruct</td><td>1.97x</td><td>2.90</td><td>2.58x</td><td>3.73</td><td>2.61x</td><td>3.71</td><td>1.71x</td><td>2.43</td><td>2.22x</td><td>3.19</td></tr>
333
- <tr> <td>Hunyuan-4B-Instruct</td><td>1.77x</td><td>2.60</td><td>2.64x</td><td>3.35</td><td>2.14x</td><td>3.17</td><td>1.72x</td><td>2.57</td><td>2.07x</td><td>2.92</td></tr>
334
- <tr><td>Hunyuan-7B-Instruct</td><td>2.22x</td><td>3.58</td><td>3.59x</td><td>5.47</td><td>2.96x</td><td>4.68</td><td>1.64x</td><td>2.56</td><td>2.60x</td><td>4.07</td></tr>
335
- <!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
336
- <tr><td rowspan="3"><strong>T=1</strong></td>
337
- <td>Hunyuan-1.8B-Instruct</td><td>1.58x</td><td>2.36</td><td>2.35x</td><td>3.56</td><td>2.23x</td><td>3.38</td><td>1.26x</td><td>1.87</td><td>1.86x</td><td>2.79</td></tr>
338
- <tr><td>Hunyuan-4B-Instruct</td><td>1.36x</td><td>2.05</td><td>1.97x</td><td>2.86</td><td>1.72x</td><td>2.68</td><td>1.14x</td><td>1.76</td><td>1.55x</td><td>2.34</td></tr>
339
- <tr><td>Hunyuan-7B-Instruct</td><td>1.90x</td><td>3.11</td><td>3.12x</td><td>5.09</td><td>2.74x</td><td>4.34</td><td>1.47x</td><td>2.39</td><td>2.31x</td><td>3.73</td></tr>
340
- </tbody>
341
- </table>
342
 
343
  ## πŸ“ License
344
 
@@ -358,4 +1023,4 @@ The code for this project is open-sourced under the [License for AngelSlim](LICE
358
 
359
  ## πŸ’¬ Technical Discussion
360
 
361
- * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub or join our [WeChat technical discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).
 
22
  </p>
23
 
24
 
25
+ ## πŸ“£Latest News
26
+ - [26/01/13] We have released v0.3. We support the training and deployment of Eagle3 for all-scale LLMs/VLMs/Audio models, as detailed in the [guidance documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/index.html). And We released **Sherry**, the hardware-efficient 1.25 bit quantization algorithm [Paper Comming soon] | [[Code]](https://github.com/Tencent/AngelSlim/tree/sherry/Sherry)πŸ”₯πŸ”₯πŸ”₯
27
+ - [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
28
+ - [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192)
29
+ - [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)
30
+ - [25/09/24] We now support the PTQ quantization of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights.
31
 
32
+ <details>
33
+ <summary>Previous News</summary>
 
 
 
 
 
 
 
 
 
34
 
35
+ - [25/09/01] We now support ​FP8 quantization​ of the [Hunyuan-MT-7B](https://huggingface.co/tencent/Hunyuan-MT-7B-fp8) translation model. And enabled ​Torch inference and Benchmark evaluation​ for Eagle3. And implemented support for ​quantization and Cache​ for [FLUX](https://github.com/Tencent/AngelSlim/tree/main/configs/flux). And support ​quantization​ for the [Seed-OSS](https://github.com/Tencent/AngelSlim/tree/main/configs/seed_oss).
36
+ - [25/08/06] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms, and quantization for `DeepSeek-R1/V3` and `Kimi-K2`, including `FP8-Static` and `W4A8-FP8` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight.
37
+ - [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight.
38
 
39
+ </details>
 
 
 
 
 
 
 
40
 
41
  ## 🌟Key Features
42
 
 
44
  - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
45
  - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
46
 
47
+ ## πŸ’ΌTechnical Overview
48
+
49
+ <table>
50
+ <thead>
51
+ <tr>
52
+ <th rowspan="2" style="text-align: center; vertical-align: middle;">Scenario</th>
53
+ <th rowspan="2" style="text-align: center; vertical-align: middle;">Model</th>
54
+ <th colspan="3" style="text-align: center; vertical-align: middle;">Compression Strategy</th>
55
+ </tr>
56
+ <tr>
57
+ <th style="text-align: center; vertical-align: middle;">Quantization</th>
58
+ <th style="text-align: center; vertical-align: middle;">Speculative Decoding</th>
59
+ <th style="text-align: center; vertical-align: middle;">Other Techniques</th>
60
+ </tr>
61
+ </thead>
62
+ <tbody>
63
+ <tr>
64
+ <td><strong>Large Language Models (LLMs)</strong></td>
65
+ <td>
66
+ <ul style="padding-left: 0; list-style-position: inside;">
67
+ <li><a href="https://huggingface.co/collections/tencent/hunyuan-dense-model">Hunyuan-Dense</a></li>
68
+ <li><a href="https://huggingface.co/collections/tencent/hunyuan-a13b">Hunyuan-MoE</a></li>
69
+ <li><a href="https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8">Qwen3</a></a></li>
70
+ <li><a href="https://huggingface.co/AngelSlim/DeepSeek-R1-0528_w4a8_fp8">DeepSeek-V3/R1</a></li>
71
+ <li><a href="https://huggingface.co/AngelSlim/Glm4_6-fp8_static">GLM-4.6</a></li>
72
+ <li><a href="https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a">Qwen2.5</a></li>
73
+ </ul>
74
+ </td>
75
+ <td>
76
+ <ul style="padding-left: 0; list-style-position: inside;">
77
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen3">FP8-Static/Dynamic</a></li>
78
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen3">INT8-Dynamic</a></li>
79
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen3">INT4-GPTQ/AWQ/GPTAQ</a></li>
80
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/d55b06aeffc53e31f485044c5026e754f4e27b74/configs/qwen3/nvfp4">NVFP4</a></li>
81
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/quantization/fp8_lepto.html">LeptoQuant</a></li>
82
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant">Tequila</a></li>
83
+ </ul>
84
+ </td>
85
+ <td>
86
+ <ul style="padding-left: 0; list-style-position: inside;">
87
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/index.html">Eagle3</a></li>
88
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html">SpecExit</a></li>
89
+ </ul>
90
+ </td>
91
+ <td>
92
+ <ul style="padding-left: 0; list-style-position: inside;">
93
+ <li>
94
+ <strong>Sparse Attention</strong>
95
+ <ul style="padding-left: 1.5rem">
96
+ <li>Under Development</li>
97
+ </ul>
98
+ </li>
99
+ </ul>
100
+ </td>
101
+ </tr>
102
+ <tr>
103
+ <td><strong>Vision Language Models (VLMs)</strong></td>
104
+ <td>
105
+ <ul style="padding-left: 0; list-style-position: inside;">
106
+ <li><a href="">Hunyuan-VL</a></li>
107
+ <li><a href="https://huggingface.co/tencent/HunyuanOCR">HunyuanOCR</a></li>
108
+ <li><a href="https://huggingface.co/collections/Qwen/qwen3-vl">Qwen3-VL</a></li>
109
+ <li><a href="https://huggingface.co/collections/Qwen/qwen25-vl">Qwen2.5-VL</a></li>
110
+ </ul>
111
+ </td>
112
+ <td>
113
+ <ul style="padding-left: 0; list-style-position: inside;">
114
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen3_vl">FP8-Static/Dynamic</a></li>
115
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen2_5_vl">INT8-Dynamic</a></li>
116
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen2_5_vl">INT4-GPTQ/AWQ/GPTAQ</a></li>
117
+ </ul>
118
+ </td>
119
+ <td>
120
+ <ul style="padding-left: 0; list-style-position: inside;">
121
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/index.html">Eagle3</a></li>
122
+ </ul>
123
+ </td>
124
+ <td>
125
+ <ul style="padding-left: 0; list-style-position: inside;">
126
+ <li>
127
+ <strong>Token Pruning</strong>
128
+ <ul style="padding-left: 1.5rem">
129
+ <li>Under Development</li>
130
+ </ul>
131
+ </li>
132
+ </ul>
133
+ </td>
134
+ </tr>
135
+ <tr>
136
+ <td><strong>Diffusion Models</strong></td>
137
+ <td>
138
+ <ul style="padding-left: 0; list-style-position: inside;">
139
+ <li><a href="https://huggingface.co/collections/tencent/hunyuanimage">Hunyuan-Image</a></li>
140
+ <li><a href="https://huggingface.co/tencent/HunyuanVideo">Hunyuan-Video</a></li>
141
+ <li><a href="https://huggingface.co/collections/tencent/hunyuan3d">Hunyuan-3D</a></li>
142
+ <li><a href="https://huggingface.co/collections/Qwen/qwen-image">Qwen-Image</a></li>
143
+ <li><a href="https://huggingface.co/collections/black-forest-labs/flux1">FLUX</a></li>
144
+ <li><a href="https://huggingface.co/collections/Wan-AI/wan21">Wan</a></li>
145
+ <li><a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0">SDXL</a></li>
146
+ </ul>
147
+ </td>
148
+ <td>
149
+ <ul style="padding-left: 0; list-style-position: inside;">
150
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/quantization.html">FP8-Dynamic</a></li>
151
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/quantization.html">FP8-Weight-Only</a></li>
152
+ </ul>
153
+ </td>
154
+ <td>-</td>
155
+ <td>
156
+ <ul style="padding-left: 0; list-style-position: inside;">
157
+ <li>
158
+ <strong>Cache</strong>
159
+ <ul style="padding-left: 1.5rem">
160
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/cache.html">DeepCache</a></li>
161
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/cache.html">TeaCache</a></li>
162
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/cache.html">TaylorCache</a></li>
163
+ </ul>
164
+ </li>
165
+ <li>
166
+ <strong>Sparse Attention</strong>
167
+ <ul style="padding-left: 1.5rem">
168
+ <li>Under Development</li>
169
+ </ul>
170
+ </li>
171
+ </ul>
172
+ </td>
173
+ </tr>
174
+ <tr>
175
+ <td><strong>Speech Models​ (TTS/ASR)</strong></td>
176
+ <td>
177
+ <ul style="padding-left: 0; list-style-position: inside;">
178
+ <li><a href="https://huggingface.co/collections/Qwen/qwen3-omni">Qwen3-Omni</a></li>
179
+ <li><a href="https://huggingface.co/collections/Qwen/qwen2-audio">Qwen2-Audio</a></li>
180
+ <li><a href="https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512">Fun-CosyVoice3</a></li>
181
+ </ul>
182
+ </td>
183
+ <td>
184
+ <ul style="padding-left: 0; list-style-position: inside;">
185
+ <li><a href="https://github.com/Tencent/AngelSlim/blob/main/docs/source/models/qwen3_omni/qwen3_omni_quant.md">FP8-Static/Dynamic</a></li>
186
+ <li><a href="https://github.com/Tencent/AngelSlim/tree/main/configs/qwen2_audio">INT8-Dynamic</a></li>
187
+ </ul>
188
+ </td>
189
+ <td>
190
+ <ul style="padding-left: 0; list-style-position: inside;">
191
+ <li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/index.html">Eagle3</a></li>
192
+ </ul>
193
+ </td>
194
+ <td>
195
+ <ul style="padding-left: 0; list-style-position: inside;">
196
+ <li>
197
+ <strong>Token Pruning</strong>
198
+ <ul style="padding-left: 1.5rem">
199
+ <li>Under Development</li>
200
+ </ul>
201
+ </li>
202
+ </ul>
203
+ </td>
204
+ </tr>
205
+ </tbody>
206
+ </table>
207
 
208
  ## πŸ›ŽοΈHow to Use
209
 
210
+ ### 1. Install AngelSlim
211
 
212
  We recommend using `pip` to install the latest stable version of `AngelSlim`:
213
 
 
223
 
224
  For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
225
 
226
+ ### 2. Quick Start
227
 
228
+ #### 2.1 Speculative Decoding
229
 
230
+ After installing AngelSlim, you can quickly start Eagle3 training with the following scripts:
231
 
232
+ ```shell
233
+ # Start the vLLM server
234
+ bash scripts/speculative/run_vllm_server.sh
235
+ # Generate training data
236
+ bash scripts/speculative/generate_data_for_target_model.sh
237
+ # Perform online training for the Eagle3 model
238
+ bash scripts/speculative/train_eagle3_online.sh
239
+ ```
240
+
241
+ Training and Deployment Guide for Multimodal Model Eagle3β€”Supporting LLM, VLM, and Audio (ASR & TTS) Models: [LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_asr_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html).
242
 
243
+ #### 2.2 LLM/VLM Model Quantization
244
 
245
+ After installing `AngelSlim`, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script:
246
+
247
+ ```shell
248
+ python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
249
+ ```
250
+
251
+ This example produces quantized model weights by performing PTQ calibration on a model loaded from HuggingFace.
252
+
253
+ <details>
254
+ <summary>Code-based Start</summary>
255
 
256
  To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
257
 
 
269
  slim_engine.save("./output")
270
  ```
271
 
272
+ </details>
273
+
274
  For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
275
 
276
+ #### 2.3 Diffusion Model Quantization
277
 
278
+ Use the `scripts/diffusion/run_diffusion.py` for quantization and inference:
279
 
280
+ ```shell
281
+ # Online quantization and inference
282
+ python scripts/diffusion/run_diffusion.py \
283
+ --model-name-or-path black-forest-labs/FLUX.1-schnell \
284
+ --quant-type fp8-per-tensor \
285
+ --prompt "A cat holding a sign that says hello world" \
286
+ --height 1024 --width 1024 --steps 4 --guidance 0.0 --seed 0
287
+ ```
288
+ For more quantization inference methods, please refer to [the Diffusion Model Quantization Documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/quantization.html).
289
 
290
+ ### 3. Deployment and Testing
291
 
292
+ #### 3.1 Offline Inference
293
 
294
+ To test offline inference with a quantized model loaded via `transformers`, run the following command:
295
 
296
  ```shell
297
+ python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"
298
  ```
299
 
300
+ Where `MODEL_PATH` is the path to the quantized model output.
301
+
302
+ #### 3.2 API Service Deployment
303
 
304
+ After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
305
 
306
+ - **vLLM**
307
 
308
+ Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
309
+
310
+ ```shell
311
+ bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
312
+ ```
313
+ Where `-d` is the visible devices, `-t` is tensor parallel size, `-p` is pipeline parallel size, and `-g` is the GPU memory utilization.
314
+
315
+ - **SGLang**
316
+
317
+ Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
318
+
319
+ ```shell
320
+ bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
321
+ ```
322
 
323
+ #### 3.3 Service Invocation
324
 
325
  Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
326
 
327
  ```shell
328
+ bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
329
  ```
330
+ where `-p` is the input prompt.
331
+
332
+ #### 3.4 Performance Evaluation
333
 
334
+ Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`
335
 
336
+ <details>
337
+ <summary>Run script details</summary>
338
 
339
  ```shell
340
+ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
341
  ```
342
+ where `RESULT_PATH` is the directory for saving test results, `-b` is batch size, `--tasks` specifies the evaluation tasks, and `-n` is the number of few-shot examples.
343
 
344
  For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
345
 
346
+ </details>
347
 
348
  ## πŸ“ˆ Benchmark
349
 
350
+ ### 1. Speculative Decoding
351
+
352
+ We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows, with an accept length of 1.8–3.5 and a maximum speedup of 1.4–1.9Γ—.
353
+
354
+ <p align="center">
355
+ <picture>
356
+ <source media="(prefers-color-scheme: dark)" srcset="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png?raw=true">
357
+ <img alt="AngelSlim" src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png?raw=true" width=70%>
358
+ </picture>
359
+ </p>
360
+
361
+ #### 1.1 Qwen3 Series Models
362
+
363
+ Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
364
+
365
+ <table>
366
+ <thead>
367
+ <tr>
368
+ <th>Model</th>
369
+ <th>Method</th>
370
+ <th colspan="2" style="text-align:center;">GSM8K</th>
371
+ <th colspan="2" style="text-align:center;">Alpaca</th>
372
+ <th colspan="2" style="text-align:center;">HumanEval</th>
373
+ <th colspan="2" style="text-align:center;">MT-bench</th>
374
+ <th colspan="2" style="text-align:center;">Mean</th>
375
+ </tr>
376
+ <tr>
377
+ <th></th><th></th>
378
+ <th>throughput (tokens/s)</th><th>accept length</th>
379
+ <th>throughput (tokens/s)</th><th>accept length</th>
380
+ <th>throughput (tokens/s)</th><th>accept length</th>
381
+ <th>throughput (tokens/s)</th><th>accept length</th>
382
+ <th>throughput (tokens/s)</th><th>accept length</th>
383
+ </tr>
384
+ </thead>
385
+
386
+ <tbody>
387
+ <!-- Qwen3-1.7B -->
388
+ <tr>
389
+ <td rowspan="2">Qwen3-1.7B</td>
390
+ <td>Vanilla</td>
391
+ <td>376.42</td><td>1</td>
392
+ <td>378.86</td><td>1</td>
393
+ <td>378.38</td><td>1</td>
394
+ <td>390.53</td><td>1</td>
395
+ <td>381.05</td><td>1</td>
396
+ </tr>
397
+ <tr>
398
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3">Eagle3</a></td>
399
+ <td>616.9</td><td>2.13</td>
400
+ <td>653.29</td><td>2.19</td>
401
+ <td>680.1</td><td>2.2</td>
402
+ <td>621.44</td><td>2.17</td>
403
+ <td>642.93</td><td>2.17</td>
404
+ </tr>
405
+ <!-- Qwen3-4B -->
406
+ <tr>
407
+ <td rowspan="2">Qwen3-4B</td>
408
+ <td>Vanilla</td>
409
+ <td>229.05</td><td>1</td>
410
+ <td>235.29</td><td>1</td>
411
+ <td>234.66</td><td>1</td>
412
+ <td>234.04</td><td>1</td>
413
+ <td>233.26</td><td>1</td>
414
+ </tr>
415
+ <tr>
416
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-4B_eagle3">Eagle3</a></td>
417
+ <td>389.35</td><td>2.07</td>
418
+ <td>395.97</td><td>2.1</td>
419
+ <td>377.84</td><td>2.08</td>
420
+ <td>384.6</td><td>2.07</td>
421
+ <td>386.94</td><td>2.08</td>
422
+ </tr>
423
+ <!-- Qwen3-8B -->
424
+ <tr>
425
+ <td rowspan="2">Qwen3-8B</td>
426
+ <td>Vanilla</td>
427
+ <td>149.63</td><td>1</td>
428
+ <td>149.93</td><td>1</td>
429
+ <td>153.85</td><td>1</td>
430
+ <td>153.81</td><td>1</td>
431
+ <td>151.81</td><td>1</td>
432
+ </tr>
433
+ <tr>
434
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-8B_eagle3">Eagle3</a></td>
435
+ <td>257.32</td><td>2</td>
436
+ <td>266.69</td><td>2.02</td>
437
+ <td>244.89</td><td>1.97</td>
438
+ <td>258.2</td><td>1.97</td>
439
+ <td>257.52</td><td>1.99</td>
440
+ </tr>
441
+ <!-- Qwen3-14B -->
442
+ <tr>
443
+ <td rowspan="2">Qwen3-14B</td>
444
+ <td>Vanilla</td>
445
+ <td>92.97</td><td>1</td>
446
+ <td>92.66</td><td>1</td>
447
+ <td>92.94</td><td>1</td>
448
+ <td>94.46</td><td>1</td>
449
+ <td>93.26</td><td>1</td>
450
+ </tr>
451
+ <tr>
452
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-14B_eagle3">Eagle3</a></td>
453
+ <td>153.72</td><td>1.87</td>
454
+ <td>140.46</td><td>1.78</td>
455
+ <td>144.68</td><td>1.76</td>
456
+ <td>142.45</td><td>1.74</td>
457
+ <td>145.33</td><td>1.79</td>
458
+ </tr>
459
+ <!-- Qwen3-32B -->
460
+ <tr>
461
+ <td rowspan="2">Qwen3-32B</td>
462
+ <td>Vanilla</td>
463
+ <td>43.49</td><td>1</td>
464
+ <td>43.38</td><td>1</td>
465
+ <td>43.19</td><td>1</td>
466
+ <td>43.3</td><td>1</td>
467
+ <td>43.32</td><td>1</td>
468
+ </tr>
469
+ <tr>
470
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-32B_eagle3">Eagle3</a></td>
471
+ <td>80.43</td><td>2.01</td>
472
+ <td>72.49</td><td>1.9</td>
473
+ <td>71.57</td><td>1.86</td>
474
+ <td>74.1</td><td>1.86</td>
475
+ <td>74.1</td><td>1.91</td>
476
+ </tr>
477
+ <!-- Qwen3-30B-A3B -->
478
+ <tr>
479
+ <td rowspan="2">Qwen3-30B-A3B</td>
480
+ <td>Vanilla</td>
481
+ <td>311.84</td><td>1</td>
482
+ <td>320.43</td><td>1</td>
483
+ <td>325.77</td><td>1</td>
484
+ <td>325.42</td><td>1</td>
485
+ <td>320.87</td><td>1</td>
486
+ </tr>
487
+ <tr>
488
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3">Eagle3</a></td>
489
+ <td>453.97</td><td>2.1</td>
490
+ <td>432.45</td><td>2.04</td>
491
+ <td>428.81</td><td>2.02</td>
492
+ <td>437.06</td><td>2.01</td>
493
+ <td>438.07</td><td>2.04</td>
494
+ </tr>
495
+
496
+ </tbody>
497
+ </table>
498
+
499
+ #### 1.2 VLM Models
500
+
501
+ ##### 1.2.1 Qwen3-VL Series Models
502
+
503
+ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
504
+
505
+ <table><thead>
506
+ <tr>
507
+ <th>Model</th>
508
+ <th>Method</th>
509
+ <th colspan="2" style="text-align:center;">GSM8K</th>
510
+ <th colspan="2" style="text-align:center;">Alpaca</th>
511
+ <th colspan="2" style="text-align:center;">HumanEval</th>
512
+ <th colspan="2" style="text-align:center;">MT-bench</th>
513
+ <th colspan="2" style="text-align:center;">MATH-500</th>
514
+ <th colspan="2" style="text-align:center;">MMMU</th>
515
+ <th colspan="2" style="text-align:center;">MMStar</th>
516
+ <th colspan="2" style="text-align:center;">Mean</th>
517
+ <tr>
518
+ <td></td>
519
+ <td></td>
520
+ <th>throughput (tokens/s)</th>
521
+ <th>accept length</th>
522
+ <th>throughput (tokens/s)</th>
523
+ <th>accept length</th>
524
+ <th>throughput (tokens/s)</th>
525
+ <th>accept length</th>
526
+ <th>throughput (tokens/s)</th>
527
+ <th>accept length</th>
528
+ <th>throughput (tokens/s)</th>
529
+ <th>accept length</th>
530
+ <th>throughput (tokens/s)</th>
531
+ <th>accept length</th>
532
+ <th>throughput (tokens/s)</th>
533
+ <th>accept length</th>
534
+ <th>throughput (tokens/s)</th>
535
+ <th>accept length</th>
536
+ </tr>
537
+ </tr></thead>
538
+ <tbody>
539
+ <tr>
540
+ <td rowspan="2">Qwen3-VL-2B-Instruct</td>
541
+ <td>Vanilla</td>
542
+ <td>348.55</td>
543
+ <td>1</td>
544
+ <td>350.9</td>
545
+ <td>1</td>
546
+ <td>346.07</td>
547
+ <td>1</td>
548
+ <td>346.31</td>
549
+ <td>1</td>
550
+ <td>82.96</td>
551
+ <td>1</td>
552
+ <td>83.27</td>
553
+ <td>1</td>
554
+ <td>81.63</td>
555
+ <td>1</td>
556
+ <td>234.24</td>
557
+ <td>1</td>
558
+ </tr>
559
+ <tr>
560
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-VL-2B-Instruct_eagle3">Eagle3</a></td>
561
+ <td>511.52</td>
562
+ <td>2.11</td>
563
+ <td>560.55</td>
564
+ <td>2.26</td>
565
+ <td>826.01</td>
566
+ <td>3.39</td>
567
+ <td>555.22</td>
568
+ <td>2.29</td>
569
+ <td>163.09</td>
570
+ <td>2.57</td>
571
+ <td>154.18</td>
572
+ <td>2.55</td>
573
+ <td>139.73</td>
574
+ <td>2.31</td>
575
+ <td>415.76</td>
576
+ <td>2.5</td>
577
+ </tr>
578
+ <tr>
579
+ <td rowspan="2">Qwen3-VL-4B-Instruct</td>
580
+ <td>Vanilla</td>
581
+ <td>212.87</td>
582
+ <td>1</td>
583
+ <td>213.24</td>
584
+ <td>1</td>
585
+ <td>211.69</td>
586
+ <td>1</td>
587
+ <td>212.1</td>
588
+ <td>1</td>
589
+ <td>67.96</td>
590
+ <td>1</td>
591
+ <td>65.88</td>
592
+ <td>1</td>
593
+ <td>67.75</td>
594
+ <td>1</td>
595
+ <td>150.21</td>
596
+ <td>1</td>
597
+ </tr>
598
+ <tr>
599
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-VL-4B-Instruct_eagle3">Eagle3</a></td>
600
+ <td>415.29</td>
601
+ <td>2.57</td>
602
+ <td>372.89</td>
603
+ <td>2.26</td>
604
+ <td>459.37</td>
605
+ <td>2.82</td>
606
+ <td>382.33</td>
607
+ <td>2.34</td>
608
+ <td>141.87</td>
609
+ <td>2.72</td>
610
+ <td>104.44</td>
611
+ <td>2.05</td>
612
+ <td>107.07</td>
613
+ <td>2.1</td>
614
+ <td>283.32</td>
615
+ <td>2.41</td>
616
+ </tr>
617
+ <tr>
618
+ <td rowspan="2">Qwen3-VL-30B-A3B-Instruct</td>
619
+ <td>Vanilla</td>
620
+ <td>179.94</td>
621
+ <td>1</td>
622
+ <td>184.6</td>
623
+ <td>1</td>
624
+ <td>168.68</td>
625
+ <td>1</td>
626
+ <td>180.57</td>
627
+ <td>1</td>
628
+ <td>31.08</td>
629
+ <td>1</td>
630
+ <td>31.51</td>
631
+ <td>1</td>
632
+ <td>30.93</td>
633
+ <td>1</td>
634
+ <td>115.33</td>
635
+ <td>1</td>
636
+ </tr>
637
+ <tr>
638
+ <td><a href="https://huggingface.co/AngelSlim/Qwen3-VL-30B-A3B-Instruct_eagle3">Eagle3</a></td>
639
+ <td>281.93</td>
640
+ <td>2.82</td>
641
+ <td>241.42</td>
642
+ <td>2.13</td>
643
+ <td>223.05</td>
644
+ <td>2.57</td>
645
+ <td>240.47</td>
646
+ <td>2.19</td>
647
+ <td>75.31</td>
648
+ <td>2.79</td>
649
+ <td>48.47</td>
650
+ <td>1.78</td>
651
+ <td>52.57</td>
652
+ <td>1.94</td>
653
+ <td>166.17</td>
654
+ <td>2.32</td>
655
+ </tr>
656
+ </tbody></table>
657
+
658
+ ##### 1.2.2 HunyuanOCR Model
659
+
660
+ Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
661
+
662
+ <table><thead>
663
+ <tr>
664
+ <th>Model</th>
665
+ <th>Method</th>
666
+ <th colspan="2" style="text-align:center;">OmniDocBench</th>
667
+ <tr>
668
+ <td></td>
669
+ <td></td>
670
+ <th>throughput (tokens/s)</th>
671
+ <th>accept length</th>
672
+ </tr>
673
+ </tr></thead>
674
+ <tbody>
675
+ <tr>
676
+ <td rowspan="2">Hunyuan-OCR</td>
677
+ <td>Vanilla</td>
678
+ <td>70.12</td>
679
+ <td>1</td>
680
+ </tr>
681
+ <tr>
682
+ <td><a href="https://huggingface.co/AngelSlim/HunyuanOCR_eagle3">Eagle3</a></td>
683
+ <td>108.1</td>
684
+ <td>2.08</td>
685
+ </tr>
686
+ </tbody>
687
+ </table>
688
+
689
+ #### 1.3 Audio Models
690
+
691
+ ##### 1.3.1 Qwen2-Audio Model
692
+
693
+ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
694
+
695
+ <table><thead>
696
+ <tr>
697
+ <th>Model</th>
698
+ <th>Method</th>
699
+ <th colspan="2" style="text-align:center;">LibriSpeech</th>
700
+ <tr>
701
+ <td></td>
702
+ <td></td>
703
+ <th>throughput (tokens/s)</th>
704
+ <th>accept length</th>
705
+ </tr>
706
+ </tr></thead>
707
+ <tbody>
708
+ <tr>
709
+ <td rowspan="2">Qwen2-Audio</td>
710
+ <td>Vanilla</td>
711
+ <td>78.76</td>
712
+ <td>1</td>
713
+ </tr>
714
+ <tr>
715
+ <td><a href="https://huggingface.co/AngelSlim/Qwen2-Audio-7B-Instruct_eagle3">Eagle3</a></td>
716
+ <td>146.66</td>
717
+ <td>3.51</td>
718
+ </tr>
719
+ </tbody>
720
+ </table>
721
+
722
+ ##### 1.3.2 Fun-CosyVoice3 Model
723
+
724
+ Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
725
+
726
+ <table><thead>
727
+ <tr>
728
+ <th>Model</th>
729
+ <th>Method</th>
730
+ <th colspan="2" style="text-align:center;">LibriTTS</th>
731
+ <tr>
732
+ <td></td>
733
+ <td></td>
734
+ <th>throughput (tokens/s)</th>
735
+ <th>accept length</th>
736
+ </tr>
737
+ </tr></thead>
738
+ <tbody>
739
+ <tr>
740
+ <td rowspan="2">Fun-CosyVoice3</td>
741
+ <td>Vanilla</td>
742
+ <td>-</td>
743
+ <td>1</td>
744
+ </tr>
745
+ <tr>
746
+ <td><a href="https://huggingface.co/AngelSlim/Fun-CosyVoice3-0.5B-2512_eagle3">Eagle3</a></td>
747
+ <td>-</td>
748
+ <td>1.96</td>
749
+ </tr>
750
+ </tbody>
751
+ </table>
752
+
753
+ > Adapted for Transformers backend inference, only displays accept length. vLLM speedup ~1.6Γ—, estimated from baseline LLM speedup.
754
+
755
+ ### 2. Quantization
756
 
757
  The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
758
 
759
+ #### 2.1 Hunyuan Series Models
760
 
761
+ Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:
762
 
763
+ <table>
764
+ <thead>
765
+ <tr><th>Model</th><th>Quantization</th><th>OlympiadBench</th><th>AIME 2024</th><th>DROP</th><th>GPQA-Diamond</th></tr>
766
+ </thead>
767
+ <tbody>
768
+ <tr><td rowspan="4">Hunyuan-A13B-Instruct</td>
769
+ <td>BF16</td><td>82.7</td><td>87.30</td><td>91.1</td><td>71.2</td></tr>
770
+ <tr><td>FP8-Static</td><td>83.0</td><td>86.7</td><td>91.1</td><td>-</td></tr>
771
+ <tr><td>Int4-GPTQ</td><td>82.7</td><td>86.7</td><td>91.1</td><td>-</td></tr>
772
+ <tr><td>Int4-AWQ</td><td>82.6</td><td>85.6</td><td>91.0</td><td>-</td></tr>
773
+ </tbody>
774
+ <tbody>
775
+ <tr><td rowspan="4">Hunyuan-7B-Instruct</td>
776
+ <td>BF16</td> <td>76.5</td><td>81.1</td><td>85.9</td><td>60.1</td></tr>
777
+ <tr><td>FP8-Static</td><td>76.6</td><td>80.9</td><td>86.0</td><td>60.1</td></tr>
778
+ <tr><td>Int4-GPTQ</td><td>76.2</td><td>81.0</td><td>85.7</td><td>60.0</td></tr>
779
+ <tr><td>Int4-AWQ</td><td>76.4</td><td>80.9</td><td>85.9</td><td>60.1</td></tr>
780
+ </tbody>
781
+ <tbody>
782
+ <tr><td rowspan="4">Hunyuan-4B-Instruct</td>
783
+ <td>BF16</td> <td>73.1</td><td>78.3</td><td>78.2</td><td>61.1</td></tr>
784
+ <tr><td>FP8-Static</td><td>73.1</td><td>76.6</td><td>78.3</td><td>60.2</td></tr>
785
+ <tr><td>Int4-GPTQ</td><td>72.9</td><td>-</td><td>78.1</td><td>58.1</td></tr>
786
+ <tr><td>Int4-AWQ</td><td>72.8</td><td>-</td><td>78.2</td><td>-</td></tr>
787
+ </tbody>
788
+ <tbody>
789
+ <tr><td rowspan="4">Hunyuan-1.8B-Instruct</td>
790
+ <td>BF16</td> <td>63.4</td><td>56.7</td><td>76.7</td><td>47.2</td></tr>
791
+ <tr><td>FP8-Static</td><td>62.5</td><td>55.2</td><td>75.1</td><td>47.7</td></tr>
792
+ <tr><td>Int4-GPTQ</td><td>60.9</td><td>-</td><td>73.0</td><td>44.4</td></tr>
793
+ <tr><td>Int4-AWQ</td><td>61.7</td><td>-</td><td>71.7</td><td>43.6</td></tr>
794
+ </tbody>
795
+ <tbody>
796
+ <tr><td rowspan="4">Hunyuan-0.5B-Instruct</td>
797
+ <td>BF16</td> <td>29.6</td><td>17.2</td><td>52.8</td><td>23.3</td></tr>
798
+ <tr><td>FP8-Static</td><td>29.6</td><td>17.2</td><td>51.6</td><td>22.5</td></tr>
799
+ <tr><td>Int4-GPTQ</td><td>26.8</td><td>-</td><td>50.9</td><td>23.3</td></tr>
800
+ <tr><td>Int4-AWQ</td><td>26.3</td><td>-</td><td>48.9</td><td>23.3</td></tr>
801
+ </tbody>
802
+ </table>
803
 
804
+ #### 2.2 Qwen3 Series Models
805
 
806
  Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
807
 
 
839
  <tr><td>FP8-Static</td><td>89.67</td><td>86.19</td><td>86.96</td><td>27.44</td></tr>
840
  <tr><td>FP8-Dynamic</td><td>89.67</td><td>86.18</td><td>85.22</td><td>28.05</td></tr>
841
  <tr><td>INT8-Dynamic</td><td>88.93</td><td>86.20</td><td>86.20</td><td>23.78</td></tr>
 
 
 
 
 
842
  </tbody>
843
  </table>
844
 
845
+ #### 2.3 DeepSeek Series Models
846
+
847
+ Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`、`AIME 2024`、`SimpleQA` and `LiveCodeBench`:
848
+
849
+ <table>
850
+ <thead>
851
+ <tr><th>Model</th><th>Quantization</th><th>GPQA Diamond</th><th>AIME 2024</th><th>SimpleQA</th><th>LiveCodeBench</th></tr>
852
+ </thead>
853
+ <tbody>
854
+ <tr><td rowspan="6">DeepSeek-R1-0528</td><td>FP8-Block-Wise</td><td>78.28</td><td>88.67</td><td>27.8</td><td>77.1</td></tr>
855
+ <tr><td>W4A8-FP8</td><td>77.37</td><td>88.67</td><td>26.83</td><td>78.86</td></tr>
856
+ </tbody>
857
+ </table>
858
+
859
+ <details>
860
+ <summary>Note</summary>
861
+
862
+ > - The above results are based on the average of 5 test runs deployed with TRT-LLM
863
+ > - The hyperparameters used during evaluation are as follows:
864
+ > ```json
865
+ >{
866
+ > "top_k": 20,
867
+ > "top_p": 0.6,
868
+ > "temperature": 0.7,
869
+ > "output_seq_len": 32768,
870
+ > "max_input_seq_len": 16384
871
+ >}
872
+ >```
873
+
874
+ </details>
875
+
876
+ #### 2.4 Qwen-VL Series Models
877
+
878
+ **Qwen3-VL Benchmark**
879
+
880
+ Benchmark results for Qwen3VL series models with `BF16`、`FP8-Static` and `FP8-Dynamic` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
881
+
882
+ <table>
883
+ <thead>
884
+ <tr><th>Model</th><th>Quantization</th><th>MMMU_VAL</th><th>DocVQA_VAL</th><th>ChartQA_TEST</th></tr>
885
+ </thead>
886
+ <tbody>
887
+ <tr><td rowspan="3">Qwen3-VL-32B-Instruct</td><td>BF16</td><td>60.11</td><td>96.08</td><td>94.64</td></tr>
888
+ <tr><td>FP8-Static</td><td>61.22</td><td>96.00</td><td>94.64</td></tr>
889
+ <tr><td>FP8-Dynamic</td><td>60.78</td><td>96.19</td><td>94.72</td></tr>
890
+ <tr><td rowspan="2">Qwen3-VL-30B-A3B-Instruct</td><td>BF16</td><td>50.44</td><td>95.28</td><td>95.36</td></tr>
891
+ <tr><td>FP8-Dynamic</td><td>50.67</td><td>95.25</td><td>95.20</td></tr>
892
+ </tbody>
893
+ </table>
894
+
895
+ <details>
896
+ <summary><strong>Qwen2.5VL Benchmark</strong></summary>
897
 
898
+ Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
899
+
900
+ <table>
901
+ <thead>
902
+ <tr><th>Model</th><th>Quantization</th><th>MMMU_VAL</th><th>MMLDocVQA_VALU</th><th>ChartQA_TEST</th></tr>
903
+ </thead>
904
+ <tbody>
905
+ <tr><td rowspan="5">Qwen2.5VL-3B</td><td>BF16</td><td>47.11</td><td>78.57</td><td>80.32</td></tr>
906
+ <tr><td>FP8-Static</td><td>47.33</td><td>79.34</td><td>79.68</td></tr>
907
+ <tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td></tr>
908
+ <tr><td>INT4-GPTQ</td><td>46.56</td><td>77.20</td><td>78.96</td></tr>
909
+ <tr><td>INT4-AWQ</td><td>45.78</td><td>-</td><td>79.60</td></tr>
910
+ <tr><td rowspan="5">Qwen2.5VL-7B</td><td>BF16</td><td>45.44</td><td>89.71</td><td>84.64</td></tr>
911
+ <tr><td>FP8-Static</td><td>47.00</td><td>89.83</td><td>85.92</td></tr>
912
+ <tr><td>FP8-Dynamic</td><td>47.22</td><td>89.80</td><td>88.64</td></tr>
913
+ <tr><td>INT4-GPTQ</td><td>46.67</td><td>90.45</td><td>-</td></tr>
914
+ <tr><td>INT4-AWQ</td><td>45.67</td><td>89.28</td><td>-</td></tr>
915
+ <tr><td rowspan="5">Qwen2.5VL-32B</td><td>BF16</td><td>57.00</td><td>90.03</td><td>-</td></tr>
916
+ <tr><td>FP8-Static</td><td>57.00</td><td>89.88</td><td>-</td></tr>
917
+ <tr><td>FP8-Dynamic</td><td>56.44</td><td>89.88</td><td>-</td></tr>
918
+ <tr><td>INT4-GPTQ</td><td>55.22</td><td>89.80 </td><td>-</td></tr>
919
+ <tr><td>INT4-AWQ</td><td>55.22</td><td>90.30</td><td>-</td></tr>
920
+ <tr><td rowspan="5">Qwen2.5VL-72B</td><td>BF16</td><td>58.78</td><td>94.39</td><td>85.60</td></tr>
921
+ <tr><td>FP8-Static</td><td>57.89</td><td>94.41</td><td>85.84</td></tr>
922
+ <tr><td>FP8-Dynamic</td><td>58.67</td><td>94.38</td><td>85.60</td></tr>
923
+ <tr><td>INT4-GPTQ</td><td>57.56</td><td>94.46</td><td>86.48</td></tr>
924
+ <tr><td>INT4-AWQ</td><td>58.78</td><td>94.19</td><td>87.28</td></tr>
925
+ </tbody>
926
+ </table>
927
+
928
+ </details>
929
+
930
+ #### 2.5 Qwen-Omni Series Models
931
+
932
+ **Qwen3-Omni Text to Text Benchmark**
933
+
934
+ Benchmark results for Qwen3-Omni series models in BF16, FP8-Static, and FP8-Dynamic on aime25, gpqa_diamond, and mmlu_redux are as follows:
935
+
936
+ <table>
937
+ <thead>
938
+ <tr><th>Model</th><th>Quantization</th><th>aime25</th><th>gpqa_diamond</th><th>mmlu_redux</th></tr>
939
+ </thead>
940
+ <tbody>
941
+ <tr><td rowspan="3">Qwen3-Omni-30B-A3B-Instruct</td><td>BF16</td><td>73.32</td><td>56.77</td><td>88.09</td></tr>
942
+ <tr><td>FP8-Static</td><td>71.33</td><td>56.57</td><td>87.91</td></tr>
943
+ <tr><td>FP8-Dynamic</td><td>73.33</td><td>55.15</td><td>88.07</td></tr>
944
+ </tbody>
945
+ </table>
946
+
947
+ <details>
948
+ <summary>Note</summary>
949
+
950
+ > - The above evaluation results were obtained by deploying with the vLLM framework and averaging over 5 runs (vLLM only supports the thinker component).
951
+ > - The hyperparameters used during evaluation are as follows:
952
+ > ```json
953
+ >{
954
+ > "top_p": 0.95,
955
+ > "temperature": 0.6,
956
+ > "do_sample": true,
957
+ > "max-model-len 65536": 65536
958
+ >}
959
+ >```
960
+
961
+ </details>
962
+
963
+ #### 2.6 Other Models
964
+
965
+ Other models such as GLM-4.6, Qwen2.5, and Seed-OSS have been evaluated on benchmarks like `CEVAL`, `MMLU`, and `GSM8K` using quantization strategies including `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ`.
966
+
967
+ <details>
968
+ <summary>Benchmark Experiment Details</summary>
969
 
970
  <table>
971
  <thead>
 
1003
  </tbody>
1004
  </table>
1005
 
1006
+ </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1007
 
1008
  ## πŸ“ License
1009
 
 
1023
 
1024
  ## πŸ’¬ Technical Discussion
1025
 
1026
+ * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).