spicyneuron's picture
Update README.md
c9de3d3 verified
metadata
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
pipeline_tag: text-generation
base_model: Qwen/Qwen3-Coder-Next
tags:
  - mlx

Qwen3-Coder-Next optimized for MLX. Note: Uses MXFP4 for some module paths.

EDIT: v2 fixes some misassigned shared expert gates. Slower, but with 4x better perplexity.

EDIT: v3 bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \
  --model spicyneuron/Qwen3-Next-Coder-MLX-4.5bit

Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision

This one is comparable to Unsloth's UD-Q4_K_XL Unsloth's MOE-MXFP4 in size, but loads and runs noticeably faster thanks to MLX.

Benchmarks

  • unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
  • mlx-community/Qwen3-Coder-Next-4bit
  • Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
  • Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
  • Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit)

Prompt Processing (tokens/sec)

Prompt Size GGUF MLX 4bit MLX 4.5bit (v1) MLX 4.4bit (v2) MLX 4.9bit (v3)
1000 1440.60 1917.29 1894.38 1871.55 1868.77
5000 1511.29 2113.98 2069.36 2079.87 2071.76
10000 1491.41 2073.89 2032.13 2039.11 2031.04
20000 1387.15 1888.56 1854.83 1860.35 1854.24

Generation (tokens/sec)

Gen Size GGUF MLX 4bit MLX 4.5b (v1) MLX 4.4b (v2) MLX 4.9b (v3)
500 49.35 76.39 75.30 66.82 67.19
1000 49.12 74.67 73.16 65.86 64.82
2000 49.01 71.99 70.95 63.68 62.82
5000 48.64 67.72 66.67 61.04 60.99

Perplexity (MLX Quants)

Model Perplexity Relative Relative %
MLX 4bit 4.118 ± 0.021
MLX 4.5bit (v1) 4.096 ± 0.021 -0.022 -0.53%
MLX 4.4bit (v2) 4.024 ± 0.021 -0.094 -2.28%
MLX 4.9bit (v3) 4.016 ± 0.021 -0.102 -2.48%
# llama.cpp 8130
llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5

# mlx_lm v0.30.7
mlx_lm.benchmark --num-trials 5
mlx_lm.perplexity --sequence-length 1000 --seed 222