File size: 4,786 Bytes
5c31114 1e05fd9 5c31114 4e84393 5c31114 4e84393 6daedbb 586c292 6daedbb 4e84393 8a0bf89 4e84393 6a265e0 4e84393 3213b47 4e84393 8907e49 4e84393 8907e49 4e84393 8907e49 4e84393 3d0095c 4e84393 8907e49 4e84393 8907e49 4e84393 8907e49 4e84393 8907e49 4e84393 11fea3d 4e84393 11fea3d 4e84393 11fea3d 4e84393 11fea3d 4e84393 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
base_model: Qwen/Qwen3-Coder-Next
---
# Model Overview
- **Model Architecture:** qwen3_next
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.1.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
- **moe**
- **Weight quantization:** OCP MXFP4, Static
- **Activation quantization:** OCP MXFP4, Dynamic
- **attn:** `linear_attn.out_proj`, `self_attn.o_proj`
- **Weight quantization:** OCP MXFP4, Static
- **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with Qwen3-Coder-Next model by applying AMD-Quark for MXFP4 quantization.
# Model Quantization
The model was quantized from Qwen/Qwen3-Coder-Next using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
**Quantization scripts:**
Note that qwen3_next is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors
from quark.contrib.llm_eval import ppl_eval
# Register qwen3_next template
qwen3_next_template = LLMTemplate(
model_type="qwen3_next",
kv_layers_name=["*qkvz"],
q_layer_name="*qkvz",
exclude_layers_name=["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"],
)
LLMTemplate.register_template(qwen3_next_template)
# Configuration
ckpt_path = "Qwen/Qwen3-Coder-Next"
output_dir = "amd/Qwen3-Coder-Next-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = ["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"]
# Load model
tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto", device_map="auto")
model.eval()
# Get quant config from template
template = LLMTemplate.get(model.config.model_type)
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)
# Quantize
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)
model = quantizer.freeze(model)
# Export hf_format
export_safetensors(model, output_dir, custom_mode="quark")
tokenizer.save_pretrained(output_dir)
# Evaluate PPL (optional)
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
ppl = ppl_eval(model, testenc, model.device)
print(f"Perplexity: {ppl.item()}")
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
## Evaluation
The model was evaluated on GSM8K benchmarks.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>Qwen3-Coder-Next </strong>
</td>
<td><strong>Qwen3-Coder-Next-MXFP4(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>GSM8K (flexible-extract)
</td>
<td>94.54
</td>
<td>93.25
</td>
<td>98.6%
</td>
</tr>
</table>
### Reproduction
The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `vllm/vllm-openai-rocm:v0.14.0`.
Install the vLLM `(commit ecb4f822091a64b5084b3a4aff326906487a363f)` and lm-eval `(Version: 0.4.10)` in container first.
```
git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 setup.py develop
pip install lm-eval
```
#### Launching server
```
MODEL=amd/Qwen3-Coder-Next-MXFP4
SAFETENSORS_FAST_GPU=1 \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm serve $MODEL \
--tensor-parallel-size 4 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code
```
#### Evaluating model in a new terminal
```
lm_eval \
--model local-completions \
--model_args "model=amd/Qwen3-Coder-Next-MXFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,tokenized_requests=False,tokenizer_backend=None" \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size auto
```
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |