---
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
pipeline_tag: text-generation
base_model: Qwen/Qwen3-Coder-Next
tags:
- mlx
---

[Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.

**EDIT:** [v2](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v2) fixes some misassigned shared expert gates. Slower, but with 4x better perplexity.

**EDIT:** [v3](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v3) bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.

# Usage

```sh
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \
  --model spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
```

# Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision

This one is comparable to
~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
[Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
in size, but loads and runs noticeably faster thanks to MLX.

# Benchmarks

- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
- mlx-community/Qwen3-Coder-Next-4bit
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit)

## Prompt Processing (tokens/sec)

| Prompt Size | GGUF        | MLX 4bit        | MLX 4.5bit (v1)        | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
|------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
| 1000        | 1440.60     | 1917.29         | 1894.38                | 1871.55         | 1868.77         |
| 5000        | 1511.29     | 2113.98         | 2069.36                | 2079.87         | 2071.76         |
| 10000       | 1491.41     | 2073.89         | 2032.13                | 2039.11         | 2031.04         |
| 20000       | 1387.15     | 1888.56         | 1854.83                | 1860.35         | 1854.24         |

## Generation (tokens/sec)

| Gen Size | GGUF     | MLX 4bit     | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
|---------:|---------:|-------------:|--------------:|--------------:|--------------:|
| 500      | 49.35    | 76.39        | 75.30         | 66.82         | 67.19         |
| 1000     | 49.12    | 74.67        | 73.16         | 65.86         | 64.82         |
| 2000     | 49.01    | 71.99        | 70.95         | 63.68         | 62.82         |
| 5000     | 48.64    | 67.72        | 66.67         | 61.04         | 60.99         |

## Perplexity (MLX Quants)

| Model                 | Perplexity      | Relative | Relative % |
|-----------------------|-----------------|----------|------------|
| MLX 4bit              | 4.118 ± 0.021   | —        |  —         |
| MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022   | -0.53%     |
| MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094   | -2.28%     |
| MLX 4.9bit (v3)       | 4.016 ± 0.021   | -0.102   | -2.48%     |

```
# llama.cpp 8130
llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5

# mlx_lm v0.30.7
mlx_lm.benchmark --num-trials 5
mlx_lm.perplexity --sequence-length 1000 --seed 222
```