| | --- |
| | license: apache-2.0 |
| | license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE |
| | base_model: Qwen/Qwen3-Coder-Next |
| | --- |
| | |
| | # Model Overview |
| |
|
| | - **Model Architecture:** qwen3_next |
| | - **Input:** Text |
| | - **Output:** Text |
| | - **Supported Hardware Microarchitecture:** AMD MI350/MI355 |
| | - **ROCm:** 7.1.0 |
| | - **Operating System(s):** Linux |
| | - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) |
| | - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11) |
| | - **moe** |
| | - **Weight quantization:** OCP MXFP4, Static |
| | - **Activation quantization:** OCP MXFP4, Dynamic |
| | - **attn:** `linear_attn.out_proj`, `self_attn.o_proj` |
| | - **Weight quantization:** OCP MXFP4, Static |
| | - **Activation quantization:** OCP MXFP4, Dynamic |
| | - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) |
| | |
| | This model was built with Qwen3-Coder-Next model by applying AMD-Quark for MXFP4 quantization. |
| | |
| | # Model Quantization |
| | |
| | The model was quantized from Qwen/Qwen3-Coder-Next using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4. |
| | |
| | **Quantization scripts:** |
| | |
| | Note that qwen3_next is not in the built-in model template list in Quark V0.11, it has to be registered before quantization. |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from datasets import load_dataset |
| | from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors |
| | from quark.contrib.llm_eval import ppl_eval |
| | |
| | # Register qwen3_next template |
| | qwen3_next_template = LLMTemplate( |
| | model_type="qwen3_next", |
| | kv_layers_name=["*qkvz"], |
| | q_layer_name="*qkvz", |
| | exclude_layers_name=["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"], |
| | ) |
| | LLMTemplate.register_template(qwen3_next_template) |
| | |
| | # Configuration |
| | ckpt_path = "Qwen/Qwen3-Coder-Next" |
| | output_dir = "amd/Qwen3-Coder-Next-MXFP4" |
| | quant_scheme = "mxfp4" |
| | exclude_layers = ["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"] |
| | |
| | # Load model |
| | tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto", device_map="auto") |
| | model.eval() |
| | |
| | # Get quant config from template |
| | template = LLMTemplate.get(model.config.model_type) |
| | quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers) |
| | |
| | # Quantize |
| | quantizer = ModelQuantizer(quant_config) |
| | model = quantizer.quantize_model(model) |
| | model = quantizer.freeze(model) |
| | |
| | # Export hf_format |
| | export_safetensors(model, output_dir, custom_mode="quark") |
| | tokenizer.save_pretrained(output_dir) |
| | |
| | # Evaluate PPL (optional) |
| | testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") |
| | testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt") |
| | ppl = ppl_eval(model, testenc, model.device) |
| | print(f"Perplexity: {ppl.item()}") |
| | ``` |
| |
|
| |
|
| | # Deployment |
| | ### Use with vLLM |
| |
|
| | This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. |
| |
|
| | ## Evaluation |
| | The model was evaluated on GSM8K benchmarks. |
| |
|
| | ### Accuracy |
| |
|
| | <table> |
| | <tr> |
| | <td><strong>Benchmark</strong> |
| | </td> |
| | <td><strong>Qwen3-Coder-Next </strong> |
| | </td> |
| | <td><strong>Qwen3-Coder-Next-MXFP4(this model)</strong> |
| | </td> |
| | <td><strong>Recovery</strong> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td>GSM8K (flexible-extract) |
| | </td> |
| | <td>94.54 |
| | </td> |
| | <td>93.25 |
| | </td> |
| | <td>98.6% |
| | </td> |
| | </tr> |
| | </table> |
| |
|
| | ### Reproduction |
| |
|
| | The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `vllm/vllm-openai-rocm:v0.14.0`. |
| |
|
| | Install the vLLM `(commit ecb4f822091a64b5084b3a4aff326906487a363f)` and lm-eval `(Version: 0.4.10)` in container first. |
| | ``` |
| | git clone https://github.com/vllm-project/vllm.git |
| | cd vllm |
| | python3 setup.py develop |
| | |
| | pip install lm-eval |
| | ``` |
| |
|
| | #### Launching server |
| | ``` |
| | MODEL=amd/Qwen3-Coder-Next-MXFP4 |
| | SAFETENSORS_FAST_GPU=1 \ |
| | VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ |
| | vllm serve $MODEL \ |
| | --tensor-parallel-size 4 \ |
| | --reasoning-parser qwen3 \ |
| | --enable-auto-tool-choice \ |
| | --tool-call-parser qwen3_coder \ |
| | --trust-remote-code |
| | ``` |
| |
|
| | #### Evaluating model in a new terminal |
| | ``` |
| | lm_eval \ |
| | --model local-completions \ |
| | --model_args "model=amd/Qwen3-Coder-Next-MXFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,tokenized_requests=False,tokenizer_backend=None" \ |
| | --tasks gsm8k \ |
| | --num_fewshot 5 \ |
| | --batch_size auto |
| | ``` |
| |
|
| | # License |
| | Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |