Qwen3-4B-Int8-Dynamic
Model Overview
INT8 Dynamic Quantized version of Qwen3-4B, quantized by Baidu AICAPX Team, optimized for Kunlun XPU.
π Original Model: Qwen/Qwen3-4B
Quantization Details
This model is quantized using SmoothQuant + GPTQ pipeline with INT8 Dynamic Quantization (W8A8) method.
Quantization Configuration
| Symmetric Quant | Asymmetric Quant | Model Type | Granularity | ||||
|---|---|---|---|---|---|---|---|
| weight | activation | weight | activation | Dense | MoE | weight | activation |
| β | β | β | β | β | β | per-channel | per-token |
| Component | Configuration |
|---|---|
| Quantization Method | SmoothQuant + GPTQ |
| Smoothing Strength | 0.8 |
| Smoothing Targets | q_proj, k_proj, v_proj β input_layernormgate_proj, up_proj β post_attention_layernorm |
| Block Size | 128 |
| Format | int-quantized (compressed-tensors) |
| Ignored Layers | lm_head, mlp.gate, mlp.shared_expert_gate |
Detailed Configuration:
{
"quant_method": "compressed-tensors",
"format": "int-quantized",
"weights": {
"num_bits": 8,
"symmetric": true,
"strategy": "channel",
"observer": "minmax"
},
"input_activations": {
"num_bits": 8,
"symmetric": true,
"strategy": "token",
"dynamic": true
}
}
Requirements
This model requires vLLM >= 0.11.0 for inference. For Kunlun XPU, please use vllm-kunlun >= 0.11.0.
Software Dependencies
transformers >= 4.51.0vllm >= 0.11.0(for NVIDIA GPU)vllm-kunlun >= 0.11.0(for Kunlun XPU)
With transformers<4.51.0, you will encounter the following error:
KeyError: 'qwen3'
Deployment with SGLang
python -m sglang.launch_server --model-path Qwen3-4B-Int8-Dynamic --reasoning-parser qwen3
Best Practices
- Thinking mode:
Temperature=0.6,TopP=0.95,TopK=20,MinP=0. DO NOT use greedy decoding. - Non-thinking mode:
Temperature=0.7,TopP=0.8,TopK=20,MinP=0. - Set
presence_penaltybetween 0β2 to reduce endless repetitions. - For long texts (>32K tokens), enable YaRN for up to 131K context.
For more details on thinking/non-thinking modes, refer to Qwen3 Documentation.
Accuracy
| Model | IFEval | SuperGPQA | MMMU |
|---|---|---|---|
| Qwen3-4B-Int8-Dynamic | 87.0 | 35.0 | - |
Credits
- Original Model: Qwen/Qwen3-4B by Qwen Team
- Quantization: Baidu AICAPX Team
- Optimized for: Kunlun XPU with vllm-kunlun >= 0.11.0
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support