## 📣Latest News
- [26/01/13] We have released v0.3. We support the training and deployment of Eagle3 for all-scale LLMs/VLMs/Audio models, as detailed in the [guidance documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/index.html). And We released **Sherry**, the hardware-efficient 1.25 bit quantization algorithm [Paper Comming soon] | [[Code]](https://github.com/Tencent/AngelSlim/tree/sherry/Sherry)🔥🔥🔥
- [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
- [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192)
- [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)
- [25/09/24] We now support the PTQ quantization of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights.
Previous News
- [25/09/01] We now support FP8 quantization of the [Hunyuan-MT-7B](https://huggingface.co/tencent/Hunyuan-MT-7B-fp8) translation model. And enabled Torch inference and Benchmark evaluation for Eagle3. And implemented support for quantization and Cache for [FLUX](https://github.com/Tencent/AngelSlim/tree/main/configs/flux). And support quantization for the [Seed-OSS](https://github.com/Tencent/AngelSlim/tree/main/configs/seed_oss).
- [25/08/06] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms, and quantization for `DeepSeek-R1/V3` and `Kimi-K2`, including `FP8-Static` and `W4A8-FP8` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight.
- [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight.
## 🌟Key Features
- **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
- **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
- **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
## 💼Technical Overview
## 🛎️How to Use
### 1. Install AngelSlim
We recommend using `pip` to install the latest stable version of `AngelSlim`:
```shell
pip install angelslim
```
Alternatively, you can clone the repository and install from source in editable mode:
```shell
cd AngelSlim && python setup.py install
```
For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
### 2. Quick Start
#### 2.1 Speculative Decoding
After installing AngelSlim, you can quickly start Eagle3 training with the following scripts:
```shell
# Start the vLLM server
bash scripts/speculative/run_vllm_server.sh
# Generate training data
bash scripts/speculative/generate_data_for_target_model.sh
# Perform online training for the Eagle3 model
bash scripts/speculative/train_eagle3_online.sh
```
Training and Deployment Guide for Multimodal Model Eagle3—Supporting LLM, VLM, and Audio (ASR & TTS) Models: [LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_asr_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html).
#### 2.2 LLM/VLM Model Quantization
After installing `AngelSlim`, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script:
```shell
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
```
This example produces quantized model weights by performing PTQ calibration on a model loaded from HuggingFace.
Code-based Start
To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
```python
from angelslim.engine import Engine
slim_engine = Engine()
# Prepare model
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
# Initialize compressor
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
# Compress model
slim_engine.run()
# Save compressed model
slim_engine.save("./output")
```
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
#### 2.3 Diffusion Model Quantization
Use the `scripts/diffusion/run_diffusion.py` for quantization and inference:
```shell
# Online quantization and inference
python scripts/diffusion/run_diffusion.py \
--model-name-or-path black-forest-labs/FLUX.1-schnell \
--quant-type fp8-per-tensor \
--prompt "A cat holding a sign that says hello world" \
--height 1024 --width 1024 --steps 4 --guidance 0.0 --seed 0
```
For more quantization inference methods, please refer to [the Diffusion Model Quantization Documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/quantization.html).
### 3. Deployment and Testing
#### 3.1 Offline Inference
To test offline inference with a quantized model loaded via `transformers`, run the following command:
```shell
python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"
```
Where `MODEL_PATH` is the path to the quantized model output.
#### 3.2 API Service Deployment
After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
- **vLLM**
Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
```shell
bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
```
Where `-d` is the visible devices, `-t` is tensor parallel size, `-p` is pipeline parallel size, and `-g` is the GPU memory utilization.
- **SGLang**
Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
```shell
bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
```
#### 3.3 Service Invocation
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
```shell
bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
```
where `-p` is the input prompt.
#### 3.4 Performance Evaluation
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`
Run script details
```shell
bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
```
where `RESULT_PATH` is the directory for saving test results, `-b` is batch size, `--tasks` specifies the evaluation tasks, and `-n` is the number of few-shot examples.
For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
## 📈 Benchmark
### 1. Speculative Decoding
We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows, with an accept length of 1.8–3.5 and a maximum speedup of 1.4–1.9×.
#### 1.1 Qwen3 Series Models
Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
#### 1.2 VLM Models
##### 1.2.1 Qwen3-VL Series Models
Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
##### 1.2.2 HunyuanOCR Model
Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
#### 1.3 Audio Models
##### 1.3.1 Qwen2-Audio Model
Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
##### 1.3.2 Fun-CosyVoice3 Model
Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
> Adapted for Transformers backend inference, only displays accept length. vLLM speedup ~1.6×, estimated from baseline LLM speedup.
### 2. Quantization
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
#### 2.1 Hunyuan Series Models
Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:
Model
Quantization
OlympiadBench
AIME 2024
DROP
GPQA-Diamond
Hunyuan-A13B-Instruct
BF16
82.7
87.30
91.1
71.2
FP8-Static
83.0
86.7
91.1
-
Int4-GPTQ
82.7
86.7
91.1
-
Int4-AWQ
82.6
85.6
91.0
-
Hunyuan-7B-Instruct
BF16
76.5
81.1
85.9
60.1
FP8-Static
76.6
80.9
86.0
60.1
Int4-GPTQ
76.2
81.0
85.7
60.0
Int4-AWQ
76.4
80.9
85.9
60.1
Hunyuan-4B-Instruct
BF16
73.1
78.3
78.2
61.1
FP8-Static
73.1
76.6
78.3
60.2
Int4-GPTQ
72.9
-
78.1
58.1
Int4-AWQ
72.8
-
78.2
-
Hunyuan-1.8B-Instruct
BF16
63.4
56.7
76.7
47.2
FP8-Static
62.5
55.2
75.1
47.7
Int4-GPTQ
60.9
-
73.0
44.4
Int4-AWQ
61.7
-
71.7
43.6
Hunyuan-0.5B-Instruct
BF16
29.6
17.2
52.8
23.3
FP8-Static
29.6
17.2
51.6
22.5
Int4-GPTQ
26.8
-
50.9
23.3
Int4-AWQ
26.3
-
48.9
23.3
#### 2.2 Qwen3 Series Models
Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
Model
Quantization
CEVAL
MMLU
GSM8K
HUMANEVAL
Qwen3-0.6B
BF16
45.84
47.21
42.99
19.51
FP8-Static
45.99
46.87
38.06
18.90
FP8-Dynamic
45.99
46.93
38.29
20.73
INT8-Dynamic
45.17
46.95
41.17
21.34
Qwen3-8B
BF16
79.27
74.78
87.79
63.41
FP8-Static
78.23
74.79
86.96
62.20
FP8-Dynamic
78.45
74.75
87.64
62.80
INT8-Dynamic
78.01
74.84
86.96
67.07
INT4-GPTQ
77.19
73.26
86.43
62.20
INT4-AWQ
76.15
73.59
86.96
63.41
Qwen3-14B
BF16
83.06
78.90
88.40
55.49
FP8-Static
82.62
78.57
89.46
57.32
FP8-Dynamic
82.24
78.92
88.32
52.44
INT8-Dynamic
81.87
78.13
86.28
56.10
INT4-GPTQ
81.05
78.02
87.34
57.93
INT4-AWQ
82.02
77.68
84.23
61.59
Qwen3-32B
BF16
86.55
82.00
74.53
37.80
FP8-Static
86.92
81.78
70.20
39.63
FP8-Dynamic
86.55
81.89
70.43
38.41
INT4-GPTQ
86.18
81.01
-
43.29
INT4-AWQ
86.18
81.54
-
36.59
Qwen3-30B-A3B
BF16
83.66
79.36
89.99
31.71
FP8-Static
83.95
79.47
89.01
31.10
FP8-Dynamic
84.10
79.40
89.16
32.93
INT8-Dynamic
83.36
79.48
89.16
34.15
Qwen3-235B-A22B
BF16
89.60
86.28
85.29
27.44
FP8-Static
89.67
86.19
86.96
27.44
FP8-Dynamic
89.67
86.18
85.22
28.05
INT8-Dynamic
88.93
86.20
86.20
23.78
#### 2.3 DeepSeek Series Models
Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`、`AIME 2024`、`SimpleQA` and `LiveCodeBench`:
Model
Quantization
GPQA Diamond
AIME 2024
SimpleQA
LiveCodeBench
DeepSeek-R1-0528
FP8-Block-Wise
78.28
88.67
27.8
77.1
W4A8-FP8
77.37
88.67
26.83
78.86
Note
> - The above results are based on the average of 5 test runs deployed with TRT-LLM
> - The hyperparameters used during evaluation are as follows:
> ```json
>{
> "top_k": 20,
> "top_p": 0.6,
> "temperature": 0.7,
> "output_seq_len": 32768,
> "max_input_seq_len": 16384
>}
>```
#### 2.4 Qwen-VL Series Models
**Qwen3-VL Benchmark**
Benchmark results for Qwen3VL series models with `BF16`、`FP8-Static` and `FP8-Dynamic` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
Model
Quantization
MMMU_VAL
DocVQA_VAL
ChartQA_TEST
Qwen3-VL-32B-Instruct
BF16
60.11
96.08
94.64
FP8-Static
61.22
96.00
94.64
FP8-Dynamic
60.78
96.19
94.72
Qwen3-VL-30B-A3B-Instruct
BF16
50.44
95.28
95.36
FP8-Dynamic
50.67
95.25
95.20
Qwen2.5VL Benchmark
Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
Model
Quantization
MMMU_VAL
MMLDocVQA_VALU
ChartQA_TEST
Qwen2.5VL-3B
BF16
47.11
78.57
80.32
FP8-Static
47.33
79.34
79.68
FP8-Dynamic
45.99
46.93
38.29
INT4-GPTQ
46.56
77.20
78.96
INT4-AWQ
45.78
-
79.60
Qwen2.5VL-7B
BF16
45.44
89.71
84.64
FP8-Static
47.00
89.83
85.92
FP8-Dynamic
47.22
89.80
88.64
INT4-GPTQ
46.67
90.45
-
INT4-AWQ
45.67
89.28
-
Qwen2.5VL-32B
BF16
57.00
90.03
-
FP8-Static
57.00
89.88
-
FP8-Dynamic
56.44
89.88
-
INT4-GPTQ
55.22
89.80
-
INT4-AWQ
55.22
90.30
-
Qwen2.5VL-72B
BF16
58.78
94.39
85.60
FP8-Static
57.89
94.41
85.84
FP8-Dynamic
58.67
94.38
85.60
INT4-GPTQ
57.56
94.46
86.48
INT4-AWQ
58.78
94.19
87.28
#### 2.5 Qwen-Omni Series Models
**Qwen3-Omni Text to Text Benchmark**
Benchmark results for Qwen3-Omni series models in BF16, FP8-Static, and FP8-Dynamic on aime25, gpqa_diamond, and mmlu_redux are as follows:
Model
Quantization
aime25
gpqa_diamond
mmlu_redux
Qwen3-Omni-30B-A3B-Instruct
BF16
73.32
56.77
88.09
FP8-Static
71.33
56.57
87.91
FP8-Dynamic
73.33
55.15
88.07
Note
> - The above evaluation results were obtained by deploying with the vLLM framework and averaging over 5 runs (vLLM only supports the thinker component).
> - The hyperparameters used during evaluation are as follows:
> ```json
>{
> "top_p": 0.95,
> "temperature": 0.6,
> "do_sample": true,
> "max-model-len 65536": 65536
>}
>```
#### 2.6 Other Models
Other models such as GLM-4.6, Qwen2.5, and Seed-OSS have been evaluated on benchmarks like `CEVAL`, `MMLU`, and `GSM8K` using quantization strategies including `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ`.
Benchmark Experiment Details
Model
Quantization
CEVAL
MMLU
GSM8K
Qwen2.5-1.5B-Instruct
BF16
67.01
60.05
54.28
FP8-Static
66.27
60.23
-
FP8-Dynamic
66.79
60.08
51.71
Qwen2.5-7B-Instruct
BF16
81.20
74.55
79.98
FP8-Static
81.13
74.03
79.30
FP8-Dynamic
80.31
74.07
79.00
INT4-GPTQ
79.05
73.05
74.75
INT4-AWQ
79.35
73.22
79.38
Qwen2.5-32B-Instruct
BF16
87.30
83.21
81.73
FP8-Static
87.59
83.08
81.58
FP8-Dynamic
87.30
83.04
81.58
INT4-GPTQ
86.70
82.45
82.03
INT4-AWQ
87.00
82.64
-
DeepSeek-R1-Distill-Qwen-7B
BF16
53.49
53.80
75.74
FP8-Static
53.57
54.17
76.19
FP8-Dynamic
52.97
54.13
74.15
INT4-GPTQ
51.86
52.44
75.89
INT4-AWQ
53.49
53.70
-
DeepSeek-R1-Distill-Qwen-14B
BF16
77.71
74.28
85.67
FP8-Static
77.56
74.66
86.73
FP8-Dynamic
76.82
74.63
87.11
INT4-GPTQ
74.29
72.37
84.61
INT4-AWQ
74.81
73.00
86.05
DeepSeek-R1-Distill-Qwen-32B
BF16
84.18
80.89
87.41
FP8-Static
83.43
80.90
87.57
FP8-Dynamic
83.73
81.10
86.43
INT4-GPTQ
84.10
79.80
86.73
INT4-AWQ
82.84
80.15
87.19
## 📝 License
The code for this project is open-sourced under the [License for AngelSlim](LICENSE).
## 🔗 Citation
```
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={6},
url={https://github.com/Tencent/AngelSlim},
}
```
## 💬 Technical Discussion
* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).