Qwen3-Coder-Next optimized for MLX. Note: Uses MXFP4 for some module paths.

EDIT: v2 fixes some misassigned shared expert gates. Slower, but with 4x better perplexity.

EDIT: v3 bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \
  --model spicyneuron/Qwen3-Next-Coder-MLX-mixed-4.5-bit

Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision

This one is comparable to Unsloth's UD-Q4_K_XL Unsloth's MOE-MXFP4 in size, but loads and runs noticeably faster thanks to MLX.

Benchmarks

  • unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
  • mlx-community/Qwen3-Coder-Next-4bit
  • Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
  • Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
  • Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit)

Prompt Processing (tokens/sec)

Prompt Size GGUF MLX 4bit MLX 4.5bit (v1) MLX 4.4bit (v2) MLX 4.9bit (v3)
1000 1440.60 1917.29 1894.38 1871.55 1868.77
5000 1511.29 2113.98 2069.36 2079.87 2071.76
10000 1491.41 2073.89 2032.13 2039.11 2031.04
20000 1387.15 1888.56 1854.83 1860.35 1854.24

Generation (tokens/sec)

Gen Size GGUF MLX 4bit MLX 4.5b (v1) MLX 4.4b (v2) MLX 4.9b (v3)
500 49.35 76.39 75.30 66.82 67.19
1000 49.12 74.67 73.16 65.86 64.82
2000 49.01 71.99 70.95 63.68 62.82
5000 48.64 67.72 66.67 61.04 60.99

Perplexity (MLX Quants)

Model Perplexity Relative Relative %
MLX 4bit 4.118 ± 0.021
MLX 4.5bit (v1) 4.096 ± 0.021 -0.022 -0.53%
MLX 4.4bit (v2) 4.024 ± 0.021 -0.094 -2.28%
MLX 4.9bit (v3) 4.016 ± 0.021 -0.102 -2.48%
# llama.cpp 8130
llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5

# mlx_lm v0.30.7
mlx_lm.benchmark --num-trials 5
mlx_lm.perplexity --sequence-length 1000 --seed 222
Downloads last month
936
Safetensors
Model size
80B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spicyneuron/Qwen3-Next-Coder-MLX-mixed-4.5-bit

Quantized
(78)
this model

Collection including spicyneuron/Qwen3-Next-Coder-MLX-mixed-4.5-bit