raydelossantos
/

Qwen3-Coder-Next-int4-mixed-AutoRound

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-Coder-Next
+tags:
+- qwen3_next
+- 4-bit precision
+- auto-round
+- code
+- transformers
+- safetensors
+- conversational
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Qwen3-Coder-Next INT4 Mixed-Bits (AutoRound)
+## Model Details
+This is a **mixed-bits INT4 quantized** version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B MoE, 14B active parameters) generated using [Intel AutoRound](https://github.com/intel/auto-round).
+### Quantization Strategy (Intel MoE Recipe)
+| Layer Type | Bits | Notes |
+|------------|------|-------|
+| Expert layers (512 experts) | 4-bit | MoE expert MLPs |
+| Non-expert layers (attention, gate) | 8-bit | Higher precision for quality |
+| shared_expert_gate | 16-bit | Skipped (shape not divisible by 32) |
+| lm_head | Original | Excluded by AutoRound |
+- **Group size**: 128
+- **Symmetric**: Yes
+- **Tuning**: iters=50, GPU-accelerated with SignRound optimization
+### Model Size
+- **Original BF16**: ~160GB
+- **Quantized**: ~41GB
+### Hardware Requirements
+> **Important**: This mixed-bits quantization requires GPUs with **SM 9.0+** (Ada Lovelace/Hopper) for optimal kernel support. RTX 3090 (SM 8.6) may experience kernel compatibility issues due to the 8-bit non-expert layers requiring ConchLinearKernel.
+- **Minimum VRAM**: ~48GB (2x RTX 4090 recommended)
+- **Tensor Parallel**: TP=2 (16 attention heads divisible by 2)
+For RTX 3090 users, consider using [uniform 4-bit quantization](https://huggingface.co/raydelossantos/Qwen3-Coder-Next-int4-uniform-AutoRound) instead.
+## How To Use
+### vLLM (Recommended)
+Requires vLLM >= 0.15.0 with Qwen3-Next support:
+```python
+from vllm import LLM, SamplingParams
+model = LLM(
+    model="raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound",
+    tensor_parallel_size=2,
+    trust_remote_code=True,
+    gpu_memory_utilization=0.9,
+)
+prompts = ["Write a Python function to calculate fibonacci numbers"]
+sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=2048)
+outputs = model.generate(prompts, sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
+```
+### Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    torch_dtype="auto",
+    trust_remote_code=True,
+)
+prompt = "Write a Python function to calculate fibonacci numbers"
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=512)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Quantization Code
+This model was quantized using the following approach:
+```python
+from auto_round import AutoRound
+model_name = "Qwen/Qwen3-Coder-Next"
+# Build layer config for mixed-bits (Intel recipe)
+layer_config = {}
+for i in range(48):  # 48 layers
+    prefix = f"model.layers.{i}"
+    # Attention layers -> 8-bit
+    if i in [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]:  # self_attn layers
+        for proj in ["q_proj", "k_proj", "v_proj", "o_proj"]:
+            layer_config[f"{prefix}.self_attn.{proj}"] = {"bits": 8}
+    else:  # linear_attn layers
+        for proj in ["in_proj_qkvz", "in_proj_ba", "out_proj"]:
+            layer_config[f"{prefix}.linear_attn.{proj}"] = {"bits": 8}
+    # MLP gate -> 8-bit
+    layer_config[f"{prefix}.mlp.gate"] = {"bits": 8}
+    # shared_expert_gate -> 16-bit (skipped)
+    layer_config[f"{prefix}.mlp.shared_expert_gate"] = {"bits": 16}
+autoround = AutoRound(
+    model_name,
+    bits=4,  # Default for experts
+    group_size=128,
+    sym=True,
+    iters=50,
+    lr=5e-3,
+    layer_config=layer_config,
+    device_map="0,1,2",
+    low_gpu_mem_usage=True,
+)
+autoround.quantize_and_save(format="auto_round", output_dir="./output")
+```
+## Acknowledgments
+- **Base Model**: [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) by Qwen Team
+- **Quantization**: [Intel AutoRound](https://github.com/intel/auto-round)
+- **Reference**: [Intel/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound](https://huggingface.co/Intel/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound)
+## Citation
+```bibtex
+@article{cheng2023optimize,
+  title={Optimize weight rounding via signed gradient descent for the quantization of llms},
+  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
+  journal={arXiv preprint arXiv:2309.05516},
+  year={2023}
+}
+```
+## License
+Apache 2.0 (follows base model license)