Qwen3-Coder-Next GPTQ 4-bit
GPTQ 4-bit quantization of Qwen/Qwen3-Coder-Next, an 80B-parameter Mixture-of-Experts (MoE) coding model with 3B activated parameters per token.
Model Overview
- Architecture: Qwen3NextForCausalLM (hybrid linear + full attention with DeltaNet)
- Total parameters: ~80B
- Activated parameters: ~3B per token (10 of 512 experts selected per token)
- Layers: 48 (36 linear attention + 12 full attention, repeating 3:1 pattern)
- Experts: 512 per layer + 1 shared expert per layer
- Context length: 262,144 tokens
- Supports: Tool calling, code generation, general chat
Quantization Details
All 73,728 MoE expert modules (512 experts x 3 projections x 48 layers) are quantized to INT4 using GPTQ. Non-expert modules remain at FP16 for quality preservation.
| Component | Precision | Notes |
|---|---|---|
MoE experts (gate_proj, up_proj, down_proj) |
INT4 (GPTQ) | 73,728 modules quantized |
Attention (q_proj, k_proj, v_proj, o_proj) |
FP16 | Full precision |
Linear attention (in_proj_qkvz, out_proj, in_proj_ba) |
FP16 | Full precision |
| Shared experts | FP16 | Full precision |
| Embeddings, LM head, norms | FP16 | Full precision |
GPTQ configuration:
- Bits: 4
- Group size: 32
- Symmetric: Yes
- desc_act: No
- true_sequential: Yes
- Failsafe: RTN for poorly-calibrated rare experts (7,650 of 73,728 modules, ~10.4%)
Calibration
- Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
- Samples: 2,048 with context length binning (uniform distribution across 256-2048 token bins)
- Quantizer: GPTQModel v5.7.0
See quantize.py for the full quantization script.
Model Size
| Version | Size | Compression |
|---|---|---|
| BF16 (original) | ~160 GB | - |
| GPTQ 4-bit | 47 GB | 3.4x |
Perplexity
Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:
| Model | Perplexity | Degradation |
|---|---|---|
| BF16 (original) | 6.9401 | - |
| GPTQ 4-bit | 6.9956 | +0.8% |
Usage
vLLM (Recommended)
vllm serve btbtyler09/Qwen3-Coder-Next-GPTQ-4bit \
--tensor-parallel-size 4 \
--trust-remote-code \
--quantization gptq \
--max-model-len 32768
Tool Calling
This model supports tool calling via the Qwen3-Coder chat template. The quantized model includes:
chat_template.jinja- Chat template with tool supportqwen3coder_tool_parser_vllm.py- vLLM tool parser pluginqwen3_coder_detector_sgl.py- SGLang tool detector
For vLLM tool calling:
vllm serve btbtyler09/Qwen3-Coder-Next-GPTQ-4bit \
--tensor-parallel-size 4 \
--trust-remote-code \
--dtype float16 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Credits
- Base Model: Qwen - Qwen3-Coder-Next
- Quantization: GPTQ via GPTQModel v5.7.0
- Quantized by: btbtyler09
License
This model inherits the Apache 2.0 license from the base model.
- Downloads last month
- 6,172
Model tree for btbtyler09/Qwen3-Coder-Next-GPTQ-4bit
Base model
Qwen/Qwen3-Coder-Next