--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE base_model: Qwen/Qwen3-Coder-Next --- # Model Overview - **Model Architecture:** qwen3_next - **Input:** Text - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm:** 7.1.0 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11) - **moe** - **Weight quantization:** OCP MXFP4, Static - **Activation quantization:** OCP MXFP4, Dynamic - **attn:** `linear_attn.out_proj`, `self_attn.o_proj` - **Weight quantization:** OCP MXFP4, Static - **Activation quantization:** OCP MXFP4, Dynamic - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) This model was built with Qwen3-Coder-Next model by applying AMD-Quark for MXFP4 quantization. # Model Quantization The model was quantized from Qwen/Qwen3-Coder-Next using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4. **Quantization scripts:** Note that qwen3_next is not in the built-in model template list in Quark V0.11, it has to be registered before quantization. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors from quark.contrib.llm_eval import ppl_eval # Register qwen3_next template qwen3_next_template = LLMTemplate( model_type="qwen3_next", kv_layers_name=["*qkvz"], q_layer_name="*qkvz", exclude_layers_name=["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"], ) LLMTemplate.register_template(qwen3_next_template) # Configuration ckpt_path = "Qwen/Qwen3-Coder-Next" output_dir = "amd/Qwen3-Coder-Next-MXFP4" quant_scheme = "mxfp4" exclude_layers = ["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"] # Load model tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto", device_map="auto") model.eval() # Get quant config from template template = LLMTemplate.get(model.config.model_type) quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers) # Quantize quantizer = ModelQuantizer(quant_config) model = quantizer.quantize_model(model) model = quantizer.freeze(model) # Export hf_format export_safetensors(model, output_dir, custom_mode="quark") tokenizer.save_pretrained(output_dir) # Evaluate PPL (optional) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt") ppl = ppl_eval(model, testenc, model.device) print(f"Perplexity: {ppl.item()}") ``` # Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. ## Evaluation The model was evaluated on GSM8K benchmarks. ### Accuracy
| Benchmark | Qwen3-Coder-Next | Qwen3-Coder-Next-MXFP4(this model) | Recovery |
| GSM8K (flexible-extract) | 94.54 | 93.25 | 98.6% |