amd
/

Qwen3-Coder-Next-MXFP4

@@ -2,6 +2,141 @@
 license: apache-2.0
 ---
-# Disclaimer
-This model is provided for experimental purposes only. Its accuracy, stability, and suitability for deployment are not guaranteed. Users are advised to independently evaluate the model before any practical or production use.

 license: apache-2.0
 ---
+# Model Overview
+- **Model Architecture:** qwen3_next
+  - **Input:** Text
+  - **Output:** Text
+- **Supported Hardware Microarchitecture:** AMD MI350/MI355
+- **ROCm:** 7.1.0
+- **Operating System(s):** Linux
+- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
+- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
+  - **moe**
+    - **Weight quantization:** MOE-only, OCP MXFP4, Static
+    - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
+- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
+This model was built with Qwen3-Coder-Next model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
+# Model Quantization
+The model was quantized from [Qwen/Qwen3-Coder-Next]() using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
+**Quantization scripts:**
+Note that qwen3_next is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
+from datasets import load_dataset
+from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors
+from quark.contrib.llm_eval import ppl_eval
+# Register qwen3_next template
+qwen3_next_template = LLMTemplate(
+    model_type="qwen3_next",
+    kv_layers_name=["*qkvz"],
+    q_layer_name="*qkvz",
+    exclude_layers_name=["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"],
+)
+LLMTemplate.register_template(qwen3_next_template)
+# Configuration
+ckpt_path = "Qwen/Qwen3-Coder-Next"
+output_dir = "amd/Qwen3-Coder-Next-MXFP4"
+quant_scheme = "mxfp4"
+exclude_layers = ["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"]
+# Load model
+model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto", device_map="auto")
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True)
+processor = AutoProcessor.from_pretrained(ckpt_path, trust_remote_code=True)
+# Get quant config from template
+template = LLMTemplate.get(model.config.model_type)
+quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)
+# Quantize
+quantizer = ModelQuantizer(quant_config)
+model = quantizer.quantize_model(model)
+model = quantizer.freeze(model)
+# Export hf_format
+export_safetensors(model, output_dir, custom_mode="quark")
+tokenizer.save_pretrained(output_dir)
+processor.save_pretrained(output_dir)
+# Evaluate PPL
+testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
+testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
+ppl = ppl_eval(model, testenc, model.device)
+print(f"Perplexity: {ppl.item()}")
+```
+# Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
+## Evaluation
+The model was evaluated on GSM8K benchmarks.
+### Accuracy
+<table>
+  <tr>
+   <td><strong>Benchmark</strong>
+   </td>
+   <td><strong>Qwen3-Coder-Next </strong>
+   </td>
+   <td><strong>Qwen3-Coder-Next-MXFP4(this model)</strong>
+   </td>
+   <td><strong>Recovery</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>GSM8K (strict-match)
+   </td>
+   <td>94.69
+   </td>
+   <td>93.18
+   </td>
+   <td>98.41%
+   </td>
+  </tr>
+</table>
+### Reproduction
+The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `vllm/vllm-openai-rocm:v0.14.0`.
+Install the vLLM `(commit ecb4f822091a64b5084b3a4aff326906487a363f)` and lm-eval `(Version: 0.4.10)` in container first.
+```
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+python3 setup.py develop
+pip install lm-eval
+```
+#### Launching server
+```
+MODEL=amd/Qwen3-Coder-Next-MXFP4
+SAFETENSORS_FAST_GPU=1 \
+VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
+vllm serve $MODEL \
+  --tensor-parallel-size 4 \
+  --reasoning-parser qwen3 \
+  --enable-auto-tool-choice \
+  --tool-call-parser qwen3_coder \
+  --trust-remote-code
+```
+#### Evaluating model in a new terminal
+```
+lm_eval \
+  --model local-completions \
+  --model_args "model=amd/Qwen3-Coder-Next-MXFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,tokenized_requests=False,tokenizer_backend=None" \
+  --tasks gsm8k \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+# License
+Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.