Instructions to use raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound") model = AutoModelForCausalLM.from_pretrained("raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound
- SGLang
How to use raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound with Docker Model Runner:
docker model run hf.co/raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound
Qwen3-Coder-Next INT4 Mixed-Bits (AutoRound)
Model Details
This is a mixed-bits INT4 quantized version of Qwen/Qwen3-Coder-Next (80B MoE, 14B active parameters) generated using Intel AutoRound.
Quantization Strategy (Intel MoE Recipe)
| Layer Type | Bits | Notes |
|---|---|---|
| Expert layers (512 experts) | 4-bit | MoE expert MLPs |
| Non-expert layers (attention, gate) | 8-bit | Higher precision for quality |
| shared_expert_gate | 16-bit | Skipped (shape not divisible by 32) |
| lm_head | Original | Excluded by AutoRound |
- Group size: 128
- Symmetric: Yes
- Tuning: iters=50, GPU-accelerated with SignRound optimization
Model Size
- Original BF16: ~160GB
- Quantized: ~41GB
Hardware Requirements
Important: This mixed-bits quantization requires GPUs with SM 9.0+ (Ada Lovelace/Hopper) for optimal kernel support. RTX 3090 (SM 8.6) may experience kernel compatibility issues due to the 8-bit non-expert layers requiring ConchLinearKernel.
- Minimum VRAM: ~48GB (2x RTX 4090 recommended)
- Tensor Parallel: TP=2 (16 attention heads divisible by 2)
For RTX 3090 users, consider using uniform 4-bit quantization instead.
How To Use
vLLM (Recommended)
Requires vLLM >= 0.15.0 with Qwen3-Next support:
from vllm import LLM, SamplingParams
model = LLM(
model="raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound",
tensor_parallel_size=2,
trust_remote_code=True,
gpu_memory_utilization=0.9,
)
prompts = ["Write a Python function to calculate fibonacci numbers"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=2048)
outputs = model.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
prompt = "Write a Python function to calculate fibonacci numbers"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantization Code
This model was quantized using the following approach:
from auto_round import AutoRound
model_name = "Qwen/Qwen3-Coder-Next"
# Build layer config for mixed-bits (Intel recipe)
layer_config = {}
for i in range(48): # 48 layers
prefix = f"model.layers.{i}"
# Attention layers -> 8-bit
if i in [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]: # self_attn layers
for proj in ["q_proj", "k_proj", "v_proj", "o_proj"]:
layer_config[f"{prefix}.self_attn.{proj}"] = {"bits": 8}
else: # linear_attn layers
for proj in ["in_proj_qkvz", "in_proj_ba", "out_proj"]:
layer_config[f"{prefix}.linear_attn.{proj}"] = {"bits": 8}
# MLP gate -> 8-bit
layer_config[f"{prefix}.mlp.gate"] = {"bits": 8}
# shared_expert_gate -> 16-bit (skipped)
layer_config[f"{prefix}.mlp.shared_expert_gate"] = {"bits": 16}
autoround = AutoRound(
model_name,
bits=4, # Default for experts
group_size=128,
sym=True,
iters=50,
lr=5e-3,
layer_config=layer_config,
device_map="0,1,2",
low_gpu_mem_usage=True,
)
autoround.quantize_and_save(format="auto_round", output_dir="./output")
Acknowledgments
- Base Model: Qwen/Qwen3-Coder-Next by Qwen Team
- Quantization: Intel AutoRound
- Reference: Intel/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound
Citation
@article{cheng2023optimize,
title={Optimize weight rounding via signed gradient descent for the quantization of llms},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal={arXiv preprint arXiv:2309.05516},
year={2023}
}
License
Apache 2.0 (follows base model license)
- Downloads last month
- 17
Model tree for raydelossantos/Qwen3-Coder-Next-int4-mixed-AutoRound
Base model
Qwen/Qwen3-Coder-Next