File size: 4,786 Bytes
5c31114
 
1e05fd9
 
5c31114
 
4e84393
5c31114
4e84393
 
 
 
 
 
 
 
 
6daedbb
 
586c292
6daedbb
 
4e84393
 
8a0bf89
4e84393
 
 
6a265e0
4e84393
 
 
 
 
 
3213b47
4e84393
 
 
8907e49
4e84393
 
 
 
 
 
 
 
8907e49
4e84393
 
 
 
 
8907e49
4e84393
3d0095c
4e84393
 
8907e49
4e84393
 
 
8907e49
4e84393
 
 
 
8907e49
4e84393
 
 
8907e49
 
4e84393
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11fea3d
4e84393
11fea3d
4e84393
11fea3d
4e84393
11fea3d
4e84393
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
base_model: Qwen/Qwen3-Coder-Next
---

# Model Overview

- **Model Architecture:** qwen3_next
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.1.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
  - **moe**
    - **Weight quantization:** OCP MXFP4, Static
    - **Activation quantization:** OCP MXFP4, Dynamic
  - **attn:** `linear_attn.out_proj`, `self_attn.o_proj`
    - **Weight quantization:** OCP MXFP4, Static
    - **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with Qwen3-Coder-Next model by applying AMD-Quark for MXFP4 quantization.

# Model Quantization

The model was quantized from Qwen/Qwen3-Coder-Next using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.

**Quantization scripts:**

Note that qwen3_next is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors
from quark.contrib.llm_eval import ppl_eval

# Register qwen3_next template
qwen3_next_template = LLMTemplate(
    model_type="qwen3_next",
    kv_layers_name=["*qkvz"],
    q_layer_name="*qkvz",
    exclude_layers_name=["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"],
)
LLMTemplate.register_template(qwen3_next_template)

# Configuration
ckpt_path = "Qwen/Qwen3-Coder-Next"
output_dir = "amd/Qwen3-Coder-Next-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = ["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"]

# Load model
tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto", device_map="auto")
model.eval()

# Get quant config from template
template = LLMTemplate.get(model.config.model_type)
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)

# Quantize
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)
model = quantizer.freeze(model)

# Export hf_format
export_safetensors(model, output_dir, custom_mode="quark")
tokenizer.save_pretrained(output_dir)

# Evaluate PPL (optional)
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
ppl = ppl_eval(model, testenc, model.device)
print(f"Perplexity: {ppl.item()}")
```


# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation
The model was evaluated on GSM8K benchmarks. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Qwen3-Coder-Next </strong>
   </td>
   <td><strong>Qwen3-Coder-Next-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GSM8K (flexible-extract)
   </td>
   <td>94.54
   </td>
   <td>93.25
   </td>
   <td>98.6%
   </td>
  </tr>
</table>

### Reproduction

The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `vllm/vllm-openai-rocm:v0.14.0`.

Install the vLLM `(commit ecb4f822091a64b5084b3a4aff326906487a363f)` and lm-eval `(Version: 0.4.10)` in container first.
```
git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 setup.py develop

pip install lm-eval
```

#### Launching server
```
MODEL=amd/Qwen3-Coder-Next-MXFP4
SAFETENSORS_FAST_GPU=1 \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm serve $MODEL \
  --tensor-parallel-size 4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code
```

#### Evaluating model in a new terminal
```
lm_eval \
  --model local-completions \
  --model_args "model=amd/Qwen3-Coder-Next-MXFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,tokenized_requests=False,tokenizer_backend=None" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.