Instructions to use btbtyler09/Qwen3-Coder-Next-GPTQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use btbtyler09/Qwen3-Coder-Next-GPTQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="btbtyler09/Qwen3-Coder-Next-GPTQ-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("btbtyler09/Qwen3-Coder-Next-GPTQ-4bit") model = AutoModelForCausalLM.from_pretrained("btbtyler09/Qwen3-Coder-Next-GPTQ-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use btbtyler09/Qwen3-Coder-Next-GPTQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "btbtyler09/Qwen3-Coder-Next-GPTQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3-Coder-Next-GPTQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/btbtyler09/Qwen3-Coder-Next-GPTQ-4bit
- SGLang
How to use btbtyler09/Qwen3-Coder-Next-GPTQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "btbtyler09/Qwen3-Coder-Next-GPTQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3-Coder-Next-GPTQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "btbtyler09/Qwen3-Coder-Next-GPTQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3-Coder-Next-GPTQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use btbtyler09/Qwen3-Coder-Next-GPTQ-4bit with Docker Model Runner:
docker model run hf.co/btbtyler09/Qwen3-Coder-Next-GPTQ-4bit
Qwen3-Coder-Next GPTQ 4-bit
GPTQ 4-bit quantization of Qwen/Qwen3-Coder-Next, an 80B-parameter Mixture-of-Experts (MoE) coding model with 3B activated parameters per token.
Model Overview
- Architecture: Qwen3NextForCausalLM (hybrid linear + full attention with DeltaNet)
- Total parameters: ~80B
- Activated parameters: ~3B per token (10 of 512 experts selected per token)
- Layers: 48 (36 linear attention + 12 full attention, repeating 3:1 pattern)
- Experts: 512 per layer + 1 shared expert per layer
- Context length: 262,144 tokens
- Supports: Tool calling, code generation, general chat
Quantization Details
All 73,728 MoE expert modules (512 experts x 3 projections x 48 layers) are quantized to INT4 using GPTQ. Non-expert modules remain at FP16 for quality preservation.
| Component | Precision | Notes |
|---|---|---|
MoE experts (gate_proj, up_proj, down_proj) |
INT4 (GPTQ) | 73,728 modules quantized |
Attention (q_proj, k_proj, v_proj, o_proj) |
FP16 | Full precision |
Linear attention (in_proj_qkvz, out_proj, in_proj_ba) |
FP16 | Full precision |
| Shared experts | FP16 | Full precision |
| Embeddings, LM head, norms | FP16 | Full precision |
GPTQ configuration:
- Bits: 4
- Group size: 32
- Symmetric: Yes
- desc_act: No
- true_sequential: Yes
- Failsafe: RTN for poorly-calibrated rare experts (7,650 of 73,728 modules, ~10.4%)
Calibration
- Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
- Samples: 2,048 with context length binning (uniform distribution across 256-2048 token bins)
- Quantizer: GPTQModel v5.7.0
See quantize.py for the full quantization script.
Model Size
| Version | Size | Compression |
|---|---|---|
| BF16 (original) | ~160 GB | - |
| GPTQ 4-bit | 47 GB | 3.4x |
Perplexity
Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:
| Model | Perplexity | Degradation |
|---|---|---|
| BF16 (original) | 6.9401 | - |
| GPTQ 4-bit | 6.9956 | +0.8% |
Usage
vLLM (Recommended)
vllm serve btbtyler09/Qwen3-Coder-Next-GPTQ-4bit \
--tensor-parallel-size 4 \
--trust-remote-code \
--quantization gptq \
--max-model-len 32768
Tool Calling
This model supports tool calling via the Qwen3-Coder chat template. The quantized model includes:
chat_template.jinja- Chat template with tool supportqwen3coder_tool_parser_vllm.py- vLLM tool parser pluginqwen3_coder_detector_sgl.py- SGLang tool detector
For vLLM tool calling:
vllm serve btbtyler09/Qwen3-Coder-Next-GPTQ-4bit \
--tensor-parallel-size 4 \
--trust-remote-code \
--dtype float16 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Credits
- Base Model: Qwen - Qwen3-Coder-Next
- Quantization: GPTQ via GPTQModel v5.7.0
- Quantized by: btbtyler09
License
This model inherits the Apache 2.0 license from the base model.
- Downloads last month
- 5,794
Model tree for btbtyler09/Qwen3-Coder-Next-GPTQ-4bit
Base model
Qwen/Qwen3-Coder-Next