Qwen3-4B-Int8-Dynamic

Model Overview

INT8 Dynamic Quantized version of Qwen3-4B, quantized by Baidu AICAPX Team, optimized for Kunlun XPU.

🔗 Original Model: Qwen/Qwen3-4B

Quantization Details

This model is quantized using SmoothQuant + GPTQ pipeline with INT8 Dynamic Quantization (W8A8) method.

Quantization Configuration

Symmetric Quant		Asymmetric Quant		Model Type		Granularity
weight	activation	weight	activation	Dense	MoE	weight	activation
✅	✅	❌	❌	✅	❌	per-channel	per-token

Component	Configuration
Quantization Method	SmoothQuant + GPTQ
Smoothing Strength	0.8
Smoothing Targets	`q_proj`, `k_proj`, `v_proj` ← `input_layernorm` `gate_proj`, `up_proj` ← `post_attention_layernorm`
Block Size	128
Format	int-quantized (compressed-tensors)
Ignored Layers	lm_head, mlp.gate, mlp.shared_expert_gate

Detailed Configuration:

{
  "quant_method": "compressed-tensors",
  "format": "int-quantized",
  "weights": {
    "num_bits": 8,
    "symmetric": true,
    "strategy": "channel",
    "observer": "minmax"
  },
  "input_activations": {
    "num_bits": 8,
    "symmetric": true,
    "strategy": "token",
    "dynamic": true
  }
}

Requirements

This model requires vLLM >= 0.11.0 for inference. For Kunlun XPU, please use vllm-kunlun >= 0.11.0.

Software Dependencies

transformers >= 4.51.0
vllm >= 0.11.0 (for NVIDIA GPU)
vllm-kunlun >= 0.11.0 (for Kunlun XPU)

With transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3'

Deployment with SGLang

python -m sglang.launch_server --model-path Qwen3-4B-Int8-Dynamic --reasoning-parser qwen3

Best Practices

Thinking mode: Temperature=0.6, TopP=0.95, TopK=20, MinP=0. DO NOT use greedy decoding.
Non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
Set presence_penalty between 0–2 to reduce endless repetitions.
For long texts (>32K tokens), enable YaRN for up to 131K context.

For more details on thinking/non-thinking modes, refer to Qwen3 Documentation.

Accuracy

Model	IFEval	SuperGPQA	MMMU
Qwen3-4B-Int8-Dynamic	87.0	35.0	-

Credits

Original Model: Qwen/Qwen3-4B by Qwen Team
Quantization: Baidu AICAPX Team
Optimized for: Kunlun XPU with vllm-kunlun >= 0.11.0

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support