raydelossantos commited on
Commit
be71dc0
·
verified ·
1 Parent(s): a87cd2c

Add model card

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-Coder-Next
4
+ tags:
5
+ - qwen3_next
6
+ - 4-bit precision
7
+ - auto-round
8
+ - code
9
+ - transformers
10
+ - safetensors
11
+ - conversational
12
+ pipeline_tag: text-generation
13
+ library_name: transformers
14
+ ---
15
+
16
+ # Qwen3-Coder-Next INT4 Mixed-Bits (AutoRound)
17
+
18
+ ## Model Details
19
+
20
+ This is a **mixed-bits INT4 quantized** version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B MoE, 14B active parameters) generated using [Intel AutoRound](https://github.com/intel/auto-round).
21
+
22
+ ### Quantization Strategy (Intel MoE Recipe)
23
+
24
+ | Layer Type | Bits | Notes |
25
+ |------------|------|-------|
26
+ | Expert layers (512 experts) | 4-bit | MoE expert MLPs |
27
+ | Non-expert layers (attention, gate) | 8-bit | Higher precision for quality |
28
+ | shared_expert_gate | 16-bit | Skipped (shape not divisible by 32) |
29
+ | lm_head | Original | Excluded by AutoRound |
30
+
31
+ - **Group size**: 128
32
+ - **Symmetric**: Yes
33
+ - **Tuning**: iters=50, GPU-accelerated with SignRound optimization
34
+
35
+ ### Model Size
36
+ - **Original BF16**: ~160GB
37
+ - **Quantized**: ~41GB
38
+
39
+ ### Hardware Requirements
40
+
41
+ > **Important**: This mixed-bits quantization requires GPUs with **SM 9.0+** (Ada Lovelace/Hopper) for optimal kernel support. RTX 3090 (SM 8.6) may experience kernel compatibility issues due to the 8-bit non-expert layers requiring ConchLinearKernel.
42
+
43
+ - **Minimum VRAM**: ~48GB (2x RTX 4090 recommended)
44
+ - **Tensor Parallel**: TP=2 (16 attention heads divisible by 2)
45
+
46
+ For RTX 3090 users, consider using [uniform 4-bit quantization](https://huggingface.co/raydelossantos/Qwen3-Coder-Next-int4-uniform-AutoRound) instead.
47
+
48
+ ## How To Use
49
+
50
+ ### vLLM (Recommended)
51
+
52
+ Requires vLLM >= 0.15.0 with Qwen3-Next support:
53
+
54
+ ```python
55
+ from vllm import LLM, SamplingParams
56
+
57
+ model = LLM(
58
+ model="raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound",
59
+ tensor_parallel_size=2,
60
+ trust_remote_code=True,
61
+ gpu_memory_utilization=0.9,
62
+ )
63
+
64
+ prompts = ["Write a Python function to calculate fibonacci numbers"]
65
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=2048)
66
+ outputs = model.generate(prompts, sampling_params)
67
+
68
+ for output in outputs:
69
+ print(output.outputs[0].text)
70
+ ```
71
+
72
+ ### Transformers
73
+
74
+ ```python
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer
76
+
77
+ model_name = "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound"
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ model_name,
82
+ device_map="auto",
83
+ torch_dtype="auto",
84
+ trust_remote_code=True,
85
+ )
86
+
87
+ prompt = "Write a Python function to calculate fibonacci numbers"
88
+ messages = [{"role": "user", "content": prompt}]
89
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
90
+
91
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
92
+ outputs = model.generate(**inputs, max_new_tokens=512)
93
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
94
+ ```
95
+
96
+ ## Quantization Code
97
+
98
+ This model was quantized using the following approach:
99
+
100
+ ```python
101
+ from auto_round import AutoRound
102
+
103
+ model_name = "Qwen/Qwen3-Coder-Next"
104
+
105
+ # Build layer config for mixed-bits (Intel recipe)
106
+ layer_config = {}
107
+ for i in range(48): # 48 layers
108
+ prefix = f"model.layers.{i}"
109
+
110
+ # Attention layers -> 8-bit
111
+ if i in [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]: # self_attn layers
112
+ for proj in ["q_proj", "k_proj", "v_proj", "o_proj"]:
113
+ layer_config[f"{prefix}.self_attn.{proj}"] = {"bits": 8}
114
+ else: # linear_attn layers
115
+ for proj in ["in_proj_qkvz", "in_proj_ba", "out_proj"]:
116
+ layer_config[f"{prefix}.linear_attn.{proj}"] = {"bits": 8}
117
+
118
+ # MLP gate -> 8-bit
119
+ layer_config[f"{prefix}.mlp.gate"] = {"bits": 8}
120
+
121
+ # shared_expert_gate -> 16-bit (skipped)
122
+ layer_config[f"{prefix}.mlp.shared_expert_gate"] = {"bits": 16}
123
+
124
+ autoround = AutoRound(
125
+ model_name,
126
+ bits=4, # Default for experts
127
+ group_size=128,
128
+ sym=True,
129
+ iters=50,
130
+ lr=5e-3,
131
+ layer_config=layer_config,
132
+ device_map="0,1,2",
133
+ low_gpu_mem_usage=True,
134
+ )
135
+
136
+ autoround.quantize_and_save(format="auto_round", output_dir="./output")
137
+ ```
138
+
139
+ ## Acknowledgments
140
+
141
+ - **Base Model**: [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) by Qwen Team
142
+ - **Quantization**: [Intel AutoRound](https://github.com/intel/auto-round)
143
+ - **Reference**: [Intel/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound](https://huggingface.co/Intel/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound)
144
+
145
+ ## Citation
146
+
147
+ ```bibtex
148
+ @article{cheng2023optimize,
149
+ title={Optimize weight rounding via signed gradient descent for the quantization of llms},
150
+ author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
151
+ journal={arXiv preprint arXiv:2309.05516},
152
+ year={2023}
153
+ }
154
+ ```
155
+
156
+ ## License
157
+
158
+ Apache 2.0 (follows base model license)