Qwen3-Coder-Next 80B β APEX I-Quality GGUF
First APEX I-Quality quantization of Qwen3-Coder-Next 80B, calibrated on a code corpus.
This is an APEX I-Quality quantization of Qwen/Qwen3-Coder-Next β an 80B parameter Mixture-of-Experts model with only 3B active parameters per token, designed specifically for coding agents and local development.
What Makes This Different
- APEX I-Quality profile β the highest quality tier in the APEX quantization framework, using per-tensor type optimization for MoE architectures
- Code-calibrated imatrix β importance matrix generated from 50,575 code samples (not Wikipedia). The imatrix tells the quantizer which weights matter most for code generation, syntax, tool calling, and agent workloads
- Production tested β this exact model runs in production powering PicoClaw coding agents on AMD Ryzen AI Max+ 395 hardware
Files
| File | Size | Description |
|---|---|---|
Qwen3-Coder-Next-APEX-I-Quality.gguf |
54.1 GB | APEX I-Quality quantized model (5.43 BPW) |
imatrix-coder-next.dat |
457 MB | Code-calibrated importance matrix β use this for your own quantizations |
Model Details
| Architecture | qwen3next (hybrid attention + SSM with MoE) |
| Total Parameters | 79.67B |
| Active Parameters | ~3B per token (10 of 512 experts) |
| Expert Count | 512 experts, 10 active per token |
| Context Length | 262,144 tokens (native) |
| Original Type | BF16 (148.5 GB) |
| Quantized Size | 54.1 GB (5.43 BPW) |
| Quantization | APEX I-Quality (Q6_K/Q5_K/IQ4_XS experts, Q8_0 shared, Q6_K attention) |
Performance
Tested on AMD Ryzen AI Max+ 395 (128GB unified memory, ROCm/Vulkan):
| Metric | Value |
|---|---|
| Output Speed | ~50-60 t/s |
| Prompt Processing | Fast (MoE architecture) |
| Memory Usage | ~54 GB model + KV cache |
| Parallel Sessions | 4 (with --parallel 4) |
The 3B active parameter design means this 80B model runs at speeds comparable to β or faster than β much smaller dense models. On our hardware, it outperforms the 30B variant in both speed and quality.
How to Run
llama.cpp (recommended)
# Clone or download the model
huggingface-cli download stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF \
Qwen3-Coder-Next-APEX-I-Quality.gguf \
--local-dir ./models/
# Run with llama-server
./llama-server \
-m ./models/Qwen3-Coder-Next-APEX-I-Quality.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 32768 --parallel 4 \
-ngl 99 --no-mmap
Ollama
Create a Modelfile:
FROM ./Qwen3-Coder-Next-APEX-I-Quality.gguf
PARAMETER num_ctx 32768
Then:
ollama create coder-next -f Modelfile
ollama run coder-next
Hardware Requirements
| Setup | RAM/VRAM | Notes |
|---|---|---|
| AMD Ryzen AI Max+ 395 | 128 GB unified | Recommended. Full GPU offload, fast inference |
| Apple M4 Max/Ultra | 128 GB+ unified | Should work well with Metal |
| Dual GPU (48GB each) | 96 GB+ VRAM | Split across GPUs |
| CPU + RAM | 64 GB+ RAM | Slower, but works with mmap |
Minimum ~58 GB free memory for model + KV cache at 32K context.
Using the Imatrix
The included imatrix-coder-next.dat was generated from 50K+ code samples using llama-imatrix. You can use it for your own quantizations of Qwen3-Coder-Next:
# Download just the imatrix
huggingface-cli download stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF \
imatrix-coder-next.dat \
--local-dir ./
# Use it with llama-quantize for custom quants
./llama-quantize \
--imatrix ./imatrix-coder-next.dat \
Qwen3-Coder-Next-BF16.gguf \
output.gguf Q4_K_M
About
Quantized by STACKS! Container Hosting β a cloud platform built on owned hardware. This model powers our PicoClaw AI coding agents, offering unlimited inference at flat-rate pricing.
We believe in giving back to the open source community. This quantization and the code-calibrated imatrix are provided freely under the same Apache 2.0 license as the original model.
Acknowledgments
- Downloads last month
- 1,312
We're not able to determine the quantization variants.
Model tree for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Base model
Qwen/Qwen3-Coder-Next