Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Paper
• 2309.05516 • Published
• 11
This is a mixed-bits INT4 quantized version of Qwen/Qwen3-Coder-Next (80B MoE, 14B active parameters) generated using Intel AutoRound.
| Layer Type | Bits | Notes |
|---|---|---|
| Expert layers (512 experts) | 4-bit | MoE expert MLPs |
| Non-expert layers (attention, gate) | 8-bit | Higher precision for quality |
| shared_expert_gate | 16-bit | Skipped (shape not divisible by 32) |
| lm_head | Original | Excluded by AutoRound |
Important: This mixed-bits quantization requires GPUs with SM 9.0+ (Ada Lovelace/Hopper) for optimal kernel support. RTX 3090 (SM 8.6) may experience kernel compatibility issues due to the 8-bit non-expert layers requiring ConchLinearKernel.
For RTX 3090 users, consider using uniform 4-bit quantization instead.
Requires vLLM >= 0.15.0 with Qwen3-Next support:
from vllm import LLM, SamplingParams
model = LLM(
model="raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound",
tensor_parallel_size=2,
trust_remote_code=True,
gpu_memory_utilization=0.9,
)
prompts = ["Write a Python function to calculate fibonacci numbers"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=2048)
outputs = model.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
prompt = "Write a Python function to calculate fibonacci numbers"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This model was quantized using the following approach:
from auto_round import AutoRound
model_name = "Qwen/Qwen3-Coder-Next"
# Build layer config for mixed-bits (Intel recipe)
layer_config = {}
for i in range(48): # 48 layers
prefix = f"model.layers.{i}"
# Attention layers -> 8-bit
if i in [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]: # self_attn layers
for proj in ["q_proj", "k_proj", "v_proj", "o_proj"]:
layer_config[f"{prefix}.self_attn.{proj}"] = {"bits": 8}
else: # linear_attn layers
for proj in ["in_proj_qkvz", "in_proj_ba", "out_proj"]:
layer_config[f"{prefix}.linear_attn.{proj}"] = {"bits": 8}
# MLP gate -> 8-bit
layer_config[f"{prefix}.mlp.gate"] = {"bits": 8}
# shared_expert_gate -> 16-bit (skipped)
layer_config[f"{prefix}.mlp.shared_expert_gate"] = {"bits": 16}
autoround = AutoRound(
model_name,
bits=4, # Default for experts
group_size=128,
sym=True,
iters=50,
lr=5e-3,
layer_config=layer_config,
device_map="0,1,2",
low_gpu_mem_usage=True,
)
autoround.quantize_and_save(format="auto_round", output_dir="./output")
@article{cheng2023optimize,
title={Optimize weight rounding via signed gradient descent for the quantization of llms},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal={arXiv preprint arXiv:2309.05516},
year={2023}
}
Apache 2.0 (follows base model license)
Base model
Qwen/Qwen3-Coder-Next