--- tags: - hunyuan - eagle3 - eagle ---

AngelSlim

Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.

📖 Documentation | 🤗 Hugging Face | 🤖 ModelScope | 💬 WeChat

## 📣Latest News - [26/01/13] We have released v0.3. We support the training and deployment of Eagle3 for all-scale LLMs/VLMs/Audio models, as detailed in the [guidance documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/index.html). And We released **Sherry**, the hardware-efficient 1.25 bit quantization algorithm [Paper Comming soon] | [[Code]](https://github.com/Tencent/AngelSlim/tree/sherry/Sherry)🔥🔥🔥 - [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools. - [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192) - [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant) - [25/09/24] We now support the PTQ quantization of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights.

Previous News

- [25/09/01] We now support FP8 quantization of the [Hunyuan-MT-7B](https://huggingface.co/tencent/Hunyuan-MT-7B-fp8) translation model. And enabled Torch inference and Benchmark evaluation for Eagle3. And implemented support for quantization and Cache for [FLUX](https://github.com/Tencent/AngelSlim/tree/main/configs/flux). And support quantization for the [Seed-OSS](https://github.com/Tencent/AngelSlim/tree/main/configs/seed_oss). - [25/08/06] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms, and quantization for `DeepSeek-R1/V3` and `Kimi-K2`, including `FP8-Static` and `W4A8-FP8` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight. - [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight.

## 🌟Key Features - **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use. - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future. - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU. ## 💼Technical Overview

Scenario	Model	Compression Strategy
Scenario	Model	Quantization	Speculative Decoding	Other Techniques
Large Language Models (LLMs)	Hunyuan-Dense Hunyuan-MoE Qwen3 DeepSeek-V3/R1 GLM-4.6 Qwen2.5	FP8-Static/Dynamic INT8-Dynamic INT4-GPTQ/AWQ/GPTAQ NVFP4 LeptoQuant Tequila	Eagle3 SpecExit	Sparse Attention Under Development
Vision Language Models (VLMs)	Hunyuan-VL HunyuanOCR Qwen3-VL Qwen2.5-VL	FP8-Static/Dynamic INT8-Dynamic INT4-GPTQ/AWQ/GPTAQ	Eagle3	Token Pruning Under Development
Diffusion Models	Hunyuan-Image Hunyuan-Video Hunyuan-3D Qwen-Image FLUX Wan SDXL	FP8-Dynamic FP8-Weight-Only	-	Cache DeepCache TeaCache TaylorCache Sparse Attention Under Development
Speech Models (TTS/ASR)	Qwen3-Omni Qwen2-Audio Fun-CosyVoice3	FP8-Static/Dynamic INT8-Dynamic	Eagle3	Token Pruning Under Development

## 🛎️How to Use ### 1. Install AngelSlim We recommend using `pip` to install the latest stable version of `AngelSlim`: ```shell pip install angelslim ``` Alternatively, you can clone the repository and install from source in editable mode: ```shell cd AngelSlim && python setup.py install ``` For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html). ### 2. Quick Start #### 2.1 Speculative Decoding After installing AngelSlim, you can quickly start Eagle3 training with the following scripts: ```shell # Start the vLLM server bash scripts/speculative/run_vllm_server.sh # Generate training data bash scripts/speculative/generate_data_for_target_model.sh # Perform online training for the Eagle3 model bash scripts/speculative/train_eagle3_online.sh ``` Training and Deployment Guide for Multimodal Model Eagle3—Supporting LLM, VLM, and Audio (ASR & TTS) Models: [LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_asr_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html). #### 2.2 LLM/VLM Model Quantization After installing `AngelSlim`, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script: ```shell python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml ``` This example produces quantized model weights by performing PTQ calibration on a model loaded from HuggingFace.

Code-based Start

To perform dynamic `FP8` quantization on `Qwen3-1.7B`: ```python from angelslim.engine import Engine slim_engine = Engine() # Prepare model slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",) # Initialize compressor slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic") # Compress model slim_engine.run() # Save compressed model slim_engine.save("./output") ```

For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html). #### 2.3 Diffusion Model Quantization Use the `scripts/diffusion/run_diffusion.py` for quantization and inference: ```shell # Online quantization and inference python scripts/diffusion/run_diffusion.py \ --model-name-or-path black-forest-labs/FLUX.1-schnell \ --quant-type fp8-per-tensor \ --prompt "A cat holding a sign that says hello world" \ --height 1024 --width 1024 --steps 4 --guidance 0.0 --seed 0 ``` For more quantization inference methods, please refer to [the Diffusion Model Quantization Documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/diffusion/quantization.html). ### 3. Deployment and Testing #### 3.1 Offline Inference To test offline inference with a quantized model loaded via `transformers`, run the following command: ```shell python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is" ``` Where `MODEL_PATH` is the path to the quantized model output. #### 3.2 API Service Deployment After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks: - **vLLM** Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required. ```shell bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096 ``` Where `-d` is the visible devices, `-t` is tensor parallel size, `-p` is pipeline parallel size, and `-g` is the GPU memory utilization. - **SGLang** Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`. ```shell bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8 ``` #### 3.3 Service Invocation Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction): ```shell bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant." ``` where `-p` is the input prompt. #### 3.4 Performance Evaluation Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`

Run script details

```shell bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH ``` where `RESULT_PATH` is the directory for saving test results, `-b` is batch size, `--tasks` specifies the evaluation tasks, and `-n` is the number of few-shot examples. For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).

## 📈 Benchmark ### 1. Speculative Decoding We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows, with an accept length of 1.8–3.5 and a maximum speedup of 1.4–1.9×.

AngelSlim

#### 1.1 Qwen3 Series Models Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).

Model	Method	GSM8K		Alpaca		HumanEval		MT-bench		Mean
		throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length
Qwen3-1.7B	Vanilla	376.42	1	378.86	1	378.38	1	390.53	1	381.05	1
Qwen3-1.7B	Eagle3	616.9	2.13	653.29	2.19	680.1	2.2	621.44	2.17	642.93	2.17
Qwen3-4B	Vanilla	229.05	1	235.29	1	234.66	1	234.04	1	233.26	1
Qwen3-4B	Eagle3	389.35	2.07	395.97	2.1	377.84	2.08	384.6	2.07	386.94	2.08
Qwen3-8B	Vanilla	149.63	1	149.93	1	153.85	1	153.81	1	151.81	1
Qwen3-8B	Eagle3	257.32	2	266.69	2.02	244.89	1.97	258.2	1.97	257.52	1.99
Qwen3-14B	Vanilla	92.97	1	92.66	1	92.94	1	94.46	1	93.26	1
Qwen3-14B	Eagle3	153.72	1.87	140.46	1.78	144.68	1.76	142.45	1.74	145.33	1.79
Qwen3-32B	Vanilla	43.49	1	43.38	1	43.19	1	43.3	1	43.32	1
Qwen3-32B	Eagle3	80.43	2.01	72.49	1.9	71.57	1.86	74.1	1.86	74.1	1.91
Qwen3-30B-A3B	Vanilla	311.84	1	320.43	1	325.77	1	325.42	1	320.87	1
Qwen3-30B-A3B	Eagle3	453.97	2.1	432.45	2.04	428.81	2.02	437.06	2.01	438.07	2.04

#### 1.2 VLM Models ##### 1.2.1 Qwen3-VL Series Models Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).

Model	Method	GSM8K		Alpaca		HumanEval		MT-bench		MATH-500		MMMU		MMStar		Mean
		throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length	throughput (tokens/s)	accept length
Qwen3-VL-2B-Instruct	Vanilla	348.55	1	350.9	1	346.07	1	346.31	1	82.96	1	83.27	1	81.63	1	234.24	1
Qwen3-VL-2B-Instruct	Eagle3	511.52	2.11	560.55	2.26	826.01	3.39	555.22	2.29	163.09	2.57	154.18	2.55	139.73	2.31	415.76	2.5
Qwen3-VL-4B-Instruct	Vanilla	212.87	1	213.24	1	211.69	1	212.1	1	67.96	1	65.88	1	67.75	1	150.21	1
Qwen3-VL-4B-Instruct	Eagle3	415.29	2.57	372.89	2.26	459.37	2.82	382.33	2.34	141.87	2.72	104.44	2.05	107.07	2.1	283.32	2.41
Qwen3-VL-30B-A3B-Instruct	Vanilla	179.94	1	184.6	1	168.68	1	180.57	1	31.08	1	31.51	1	30.93	1	115.33	1
Qwen3-VL-30B-A3B-Instruct	Eagle3	281.93	2.82	241.42	2.13	223.05	2.57	240.47	2.19	75.31	2.79	48.47	1.78	52.57	1.94	166.17	2.32

##### 1.2.2 HunyuanOCR Model Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).

Model	Method	OmniDocBench
		throughput (tokens/s)	accept length
Hunyuan-OCR	Vanilla	70.12	1
Hunyuan-OCR	Eagle3	108.1	2.08

#### 1.3 Audio Models ##### 1.3.1 Qwen2-Audio Model Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).

Model	Method	LibriSpeech
		throughput (tokens/s)	accept length
Qwen2-Audio	Vanilla	78.76	1
Qwen2-Audio	Eagle3	146.66	3.51

##### 1.3.2 Fun-CosyVoice3 Model Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).

Model	Method	LibriTTS
		throughput (tokens/s)	accept length
Fun-CosyVoice3	Vanilla	-	1
Fun-CosyVoice3	Eagle3	-	1.96

> Adapted for Transformers backend inference, only displays accept length. vLLM speedup ~1.6×, estimated from baseline LLM speedup. ### 2. Quantization The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html) #### 2.1 Hunyuan Series Models Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:

Model	Quantization	OlympiadBench	AIME 2024	DROP	GPQA-Diamond
Hunyuan-A13B-Instruct	BF16	82.7	87.30	91.1	71.2
	FP8-Static	83.0	86.7	91.1	-
	Int4-GPTQ	82.7	86.7	91.1	-
	Int4-AWQ	82.6	85.6	91.0	-
Hunyuan-7B-Instruct	BF16	76.5	81.1	85.9	60.1
	FP8-Static	76.6	80.9	86.0	60.1
	Int4-GPTQ	76.2	81.0	85.7	60.0
	Int4-AWQ	76.4	80.9	85.9	60.1
Hunyuan-4B-Instruct	BF16	73.1	78.3	78.2	61.1
	FP8-Static	73.1	76.6	78.3	60.2
	Int4-GPTQ	72.9	-	78.1	58.1
	Int4-AWQ	72.8	-	78.2	-
Hunyuan-1.8B-Instruct	BF16	63.4	56.7	76.7	47.2
	FP8-Static	62.5	55.2	75.1	47.7
	Int4-GPTQ	60.9	-	73.0	44.4
	Int4-AWQ	61.7	-	71.7	43.6
Hunyuan-0.5B-Instruct	BF16	29.6	17.2	52.8	23.3
	FP8-Static	29.6	17.2	51.6	22.5
	Int4-GPTQ	26.8	-	50.9	23.3
	Int4-AWQ	26.3	-	48.9	23.3

#### 2.2 Qwen3 Series Models Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:

Model	Quantization	CEVAL	MMLU	GSM8K	HUMANEVAL
Qwen3-0.6B	BF16	45.84	47.21	42.99	19.51
	FP8-Static	45.99	46.87	38.06	18.90
	FP8-Dynamic	45.99	46.93	38.29	20.73
	INT8-Dynamic	45.17	46.95	41.17	21.34
Qwen3-8B	BF16	79.27	74.78	87.79	63.41
	FP8-Static	78.23	74.79	86.96	62.20
	FP8-Dynamic	78.45	74.75	87.64	62.80
	INT8-Dynamic	78.01	74.84	86.96	67.07
	INT4-GPTQ	77.19	73.26	86.43	62.20
	INT4-AWQ	76.15	73.59	86.96	63.41
Qwen3-14B	BF16	83.06	78.90	88.40	55.49
	FP8-Static	82.62	78.57	89.46	57.32
	FP8-Dynamic	82.24	78.92	88.32	52.44
	INT8-Dynamic	81.87	78.13	86.28	56.10
	INT4-GPTQ	81.05	78.02	87.34	57.93
	INT4-AWQ	82.02	77.68	84.23	61.59
Qwen3-32B	BF16	86.55	82.00	74.53	37.80
	FP8-Static	86.92	81.78	70.20	39.63
	FP8-Dynamic	86.55	81.89	70.43	38.41
	INT4-GPTQ	86.18	81.01	-	43.29
	INT4-AWQ	86.18	81.54	-	36.59
Qwen3-30B-A3B	BF16	83.66	79.36	89.99	31.71
	FP8-Static	83.95	79.47	89.01	31.10
	FP8-Dynamic	84.10	79.40	89.16	32.93
	INT8-Dynamic	83.36	79.48	89.16	34.15
Qwen3-235B-A22B	BF16	89.60	86.28	85.29	27.44
	FP8-Static	89.67	86.19	86.96	27.44
	FP8-Dynamic	89.67	86.18	85.22	28.05
	INT8-Dynamic	88.93	86.20	86.20	23.78

#### 2.3 DeepSeek Series Models Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`、`AIME 2024`、`SimpleQA` and `LiveCodeBench`：

Model	Quantization	GPQA Diamond	AIME 2024	SimpleQA	LiveCodeBench
DeepSeek-R1-0528	FP8-Block-Wise	78.28	88.67	27.8	77.1
	W4A8-FP8	77.37	88.67	26.83	78.86

Note

> - The above results are based on the average of 5 test runs deployed with TRT-LLM > - The hyperparameters used during evaluation are as follows: > ```json >{ > "top_k": 20, > "top_p": 0.6, > "temperature": 0.7, > "output_seq_len": 32768, > "max_input_seq_len": 16384 >} >```

#### 2.4 Qwen-VL Series Models **Qwen3-VL Benchmark** Benchmark results for Qwen3VL series models with `BF16`、`FP8-Static` and `FP8-Dynamic` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`：

Model	Quantization	MMMU_VAL	DocVQA_VAL	ChartQA_TEST
Qwen3-VL-32B-Instruct	BF16	60.11	96.08	94.64
	FP8-Static	61.22	96.00	94.64
	FP8-Dynamic	60.78	96.19	94.72
Qwen3-VL-30B-A3B-Instruct	BF16	50.44	95.28	95.36
Qwen3-VL-30B-A3B-Instruct	FP8-Dynamic	50.67	95.25	95.20

Qwen2.5VL Benchmark

Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`：

Model	Quantization	MMMU_VAL	MMLDocVQA_VALU	ChartQA_TEST
Qwen2.5VL-3B	BF16	47.11	78.57	80.32
	FP8-Static	47.33	79.34	79.68
	FP8-Dynamic	45.99	46.93	38.29
	INT4-GPTQ	46.56	77.20	78.96
	INT4-AWQ	45.78	-	79.60
Qwen2.5VL-7B	BF16	45.44	89.71	84.64
	FP8-Static	47.00	89.83	85.92
	FP8-Dynamic	47.22	89.80	88.64
	INT4-GPTQ	46.67	90.45	-
	INT4-AWQ	45.67	89.28	-
Qwen2.5VL-32B	BF16	57.00	90.03	-
	FP8-Static	57.00	89.88	-
	FP8-Dynamic	56.44	89.88	-
	INT4-GPTQ	55.22	89.80	-
	INT4-AWQ	55.22	90.30	-
Qwen2.5VL-72B	BF16	58.78	94.39	85.60
	FP8-Static	57.89	94.41	85.84
	FP8-Dynamic	58.67	94.38	85.60
	INT4-GPTQ	57.56	94.46	86.48
	INT4-AWQ	58.78	94.19	87.28

#### 2.5 Qwen-Omni Series Models **Qwen3-Omni Text to Text Benchmark** Benchmark results for Qwen3-Omni series models in BF16, FP8-Static, and FP8-Dynamic on aime25, gpqa_diamond, and mmlu_redux are as follows:

Model	Quantization	aime25	gpqa_diamond	mmlu_redux
Qwen3-Omni-30B-A3B-Instruct	BF16	73.32	56.77	88.09
	FP8-Static	71.33	56.57	87.91
	FP8-Dynamic	73.33	55.15	88.07

Note

> - The above evaluation results were obtained by deploying with the vLLM framework and averaging over 5 runs (vLLM only supports the thinker component). > - The hyperparameters used during evaluation are as follows: > ```json >{ > "top_p": 0.95, > "temperature": 0.6, > "do_sample": true, > "max-model-len 65536": 65536 >} >```

#### 2.6 Other Models Other models such as GLM-4.6, Qwen2.5, and Seed-OSS have been evaluated on benchmarks like `CEVAL`, `MMLU`, and `GSM8K` using quantization strategies including `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ`.

Benchmark Experiment Details

Model	Quantization	CEVAL	MMLU	GSM8K
Qwen2.5-1.5B-Instruct	BF16	67.01	60.05	54.28
	FP8-Static	66.27	60.23	-
	FP8-Dynamic	66.79	60.08	51.71
Qwen2.5-7B-Instruct	BF16	81.20	74.55	79.98
	FP8-Static	81.13	74.03	79.30
	FP8-Dynamic	80.31	74.07	79.00
	INT4-GPTQ	79.05	73.05	74.75
	INT4-AWQ	79.35	73.22	79.38
Qwen2.5-32B-Instruct	BF16	87.30	83.21	81.73
	FP8-Static	87.59	83.08	81.58
	FP8-Dynamic	87.30	83.04	81.58
	INT4-GPTQ	86.70	82.45	82.03
	INT4-AWQ	87.00	82.64	-
DeepSeek-R1-Distill-Qwen-7B	BF16	53.49	53.80	75.74
	FP8-Static	53.57	54.17	76.19
	FP8-Dynamic	52.97	54.13	74.15
	INT4-GPTQ	51.86	52.44	75.89
	INT4-AWQ	53.49	53.70	-
DeepSeek-R1-Distill-Qwen-14B	BF16	77.71	74.28	85.67
	FP8-Static	77.56	74.66	86.73
	FP8-Dynamic	76.82	74.63	87.11
	INT4-GPTQ	74.29	72.37	84.61
	INT4-AWQ	74.81	73.00	86.05
DeepSeek-R1-Distill-Qwen-32B	BF16	84.18	80.89	87.41
	FP8-Static	83.43	80.90	87.57
	FP8-Dynamic	83.73	81.10	86.43
	INT4-GPTQ	84.10	79.80	86.73
	INT4-AWQ	82.84	80.15	87.19

## 📝 License The code for this project is open-sourced under the [License for AngelSlim](LICENSE). ## 🔗 Citation ``` @software{AngelSlim2025, title={{AngelSlim}}, author={Tencent AngelSlim Project Contributors}, year={2025}, month={6}, url={https://github.com/Tencent/AngelSlim}, } ``` ## 💬 Technical Discussion * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).