Qwen3-4B-Int8-Dynamic

Model Overview

INT8 Dynamic Quantized version of Qwen3-4B, quantized by Baidu AICAPX Team, optimized for Kunlun XPU.

πŸ”— Original Model: Qwen/Qwen3-4B

Quantization Details

This model is quantized using SmoothQuant + GPTQ pipeline with INT8 Dynamic Quantization (W8A8) method.

Quantization Configuration

Symmetric Quant Asymmetric Quant Model Type Granularity
weight activation weight activation Dense MoE weight activation
βœ… βœ… ❌ ❌ βœ… ❌ per-channel per-token
Component Configuration
Quantization Method SmoothQuant + GPTQ
Smoothing Strength 0.8
Smoothing Targets q_proj, k_proj, v_proj ← input_layernorm
gate_proj, up_proj ← post_attention_layernorm
Block Size 128
Format int-quantized (compressed-tensors)
Ignored Layers lm_head, mlp.gate, mlp.shared_expert_gate

Detailed Configuration:

{
  "quant_method": "compressed-tensors",
  "format": "int-quantized",
  "weights": {
    "num_bits": 8,
    "symmetric": true,
    "strategy": "channel",
    "observer": "minmax"
  },
  "input_activations": {
    "num_bits": 8,
    "symmetric": true,
    "strategy": "token",
    "dynamic": true
  }
}

Requirements

This model requires vLLM >= 0.11.0 for inference. For Kunlun XPU, please use vllm-kunlun >= 0.11.0.

Software Dependencies

  • transformers >= 4.51.0
  • vllm >= 0.11.0 (for NVIDIA GPU)
  • vllm-kunlun >= 0.11.0 (for Kunlun XPU)

With transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3'

Deployment with SGLang

python -m sglang.launch_server --model-path Qwen3-4B-Int8-Dynamic --reasoning-parser qwen3

Best Practices

  • Thinking mode: Temperature=0.6, TopP=0.95, TopK=20, MinP=0. DO NOT use greedy decoding.
  • Non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
  • Set presence_penalty between 0–2 to reduce endless repetitions.
  • For long texts (>32K tokens), enable YaRN for up to 131K context.

For more details on thinking/non-thinking modes, refer to Qwen3 Documentation.

Accuracy

Model IFEval SuperGPQA MMMU
Qwen3-4B-Int8-Dynamic 87.0 35.0 -

Credits

  • Original Model: Qwen/Qwen3-4B by Qwen Team
  • Quantization: Baidu AICAPX Team
  • Optimized for: Kunlun XPU with vllm-kunlun >= 0.11.0
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
Β·
I8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support