jiaxwang commited on
Commit
4e84393
·
verified ·
1 Parent(s): 5c31114

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -2
README.md CHANGED
@@ -2,6 +2,141 @@
2
  license: apache-2.0
3
  ---
4
 
5
- # Disclaimer
6
 
7
- This model is provided for experimental purposes only. Its accuracy, stability, and suitability for deployment are not guaranteed. Users are advised to independently evaluate the model before any practical or production use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
4
 
5
+ # Model Overview
6
 
7
+ - **Model Architecture:** qwen3_next
8
+ - **Input:** Text
9
+ - **Output:** Text
10
+ - **Supported Hardware Microarchitecture:** AMD MI350/MI355
11
+ - **ROCm:** 7.1.0
12
+ - **Operating System(s):** Linux
13
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
14
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
15
+ - **moe**
16
+ - **Weight quantization:** MOE-only, OCP MXFP4, Static
17
+ - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
18
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
19
+
20
+ This model was built with Qwen3-Coder-Next model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
21
+
22
+ # Model Quantization
23
+
24
+ The model was quantized from [Qwen/Qwen3-Coder-Next]() using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
25
+
26
+ **Quantization scripts:**
27
+
28
+ Note that qwen3_next is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.
29
+
30
+ ```python
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
32
+ from datasets import load_dataset
33
+ from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors
34
+ from quark.contrib.llm_eval import ppl_eval
35
+ # Register qwen3_next template
36
+ qwen3_next_template = LLMTemplate(
37
+ model_type="qwen3_next",
38
+ kv_layers_name=["*qkvz"],
39
+ q_layer_name="*qkvz",
40
+ exclude_layers_name=["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"],
41
+ )
42
+ LLMTemplate.register_template(qwen3_next_template)
43
+ # Configuration
44
+ ckpt_path = "Qwen/Qwen3-Coder-Next"
45
+ output_dir = "amd/Qwen3-Coder-Next-MXFP4"
46
+ quant_scheme = "mxfp4"
47
+ exclude_layers = ["lm_head", "*linear_attn.in_proj_ba", "*linear_attn.in_proj_qkvz","*mlp.gate", "*mlp.shared_expert_gate", "*self_attn.k_proj", "*self_attn.q_proj", "*self_attn.v_proj"]
48
+ # Load model
49
+ model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto", device_map="auto")
50
+ model.eval()
51
+ tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True)
52
+ processor = AutoProcessor.from_pretrained(ckpt_path, trust_remote_code=True)
53
+ # Get quant config from template
54
+ template = LLMTemplate.get(model.config.model_type)
55
+ quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)
56
+ # Quantize
57
+ quantizer = ModelQuantizer(quant_config)
58
+ model = quantizer.quantize_model(model)
59
+ model = quantizer.freeze(model)
60
+ # Export hf_format
61
+ export_safetensors(model, output_dir, custom_mode="quark")
62
+ tokenizer.save_pretrained(output_dir)
63
+ processor.save_pretrained(output_dir)
64
+ # Evaluate PPL
65
+ testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
66
+ testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
67
+ ppl = ppl_eval(model, testenc, model.device)
68
+ print(f"Perplexity: {ppl.item()}")
69
+ ```
70
+
71
+
72
+ # Deployment
73
+ ### Use with vLLM
74
+
75
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
76
+
77
+ ## Evaluation
78
+ The model was evaluated on GSM8K benchmarks.
79
+
80
+ ### Accuracy
81
+
82
+ <table>
83
+ <tr>
84
+ <td><strong>Benchmark</strong>
85
+ </td>
86
+ <td><strong>Qwen3-Coder-Next </strong>
87
+ </td>
88
+ <td><strong>Qwen3-Coder-Next-MXFP4(this model)</strong>
89
+ </td>
90
+ <td><strong>Recovery</strong>
91
+ </td>
92
+ </tr>
93
+ <tr>
94
+ <td>GSM8K (strict-match)
95
+ </td>
96
+ <td>94.69
97
+ </td>
98
+ <td>93.18
99
+ </td>
100
+ <td>98.41%
101
+ </td>
102
+ </tr>
103
+ </table>
104
+
105
+ ### Reproduction
106
+
107
+ The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `vllm/vllm-openai-rocm:v0.14.0`.
108
+
109
+ Install the vLLM `(commit ecb4f822091a64b5084b3a4aff326906487a363f)` and lm-eval `(Version: 0.4.10)` in container first.
110
+ ```
111
+ git clone https://github.com/vllm-project/vllm.git
112
+ cd vllm
113
+ python3 setup.py develop
114
+
115
+ pip install lm-eval
116
+ ```
117
+
118
+ #### Launching server
119
+ ```
120
+ MODEL=amd/Qwen3-Coder-Next-MXFP4
121
+ SAFETENSORS_FAST_GPU=1 \
122
+ VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
123
+ vllm serve $MODEL \
124
+ --tensor-parallel-size 4 \
125
+ --reasoning-parser qwen3 \
126
+ --enable-auto-tool-choice \
127
+ --tool-call-parser qwen3_coder \
128
+ --trust-remote-code
129
+ ```
130
+
131
+ #### Evaluating model in a new terminal
132
+ ```
133
+ lm_eval \
134
+ --model local-completions \
135
+ --model_args "model=amd/Qwen3-Coder-Next-MXFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,tokenized_requests=False,tokenizer_backend=None" \
136
+ --tasks gsm8k \
137
+ --num_fewshot 5 \
138
+ --batch_size auto
139
+ ```
140
+
141
+ # License
142
+ Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.