File size: 3,668 Bytes
f6423b6
 
 
 
 
 
 
 
 
ffde6d7
88ad441
ffde6d7
c9de3d3
b3bf7d0
c9de3d3
c21c137
01a490c
 
 
 
 
c9de3d3
01a490c
 
ffde6d7
 
7b18195
 
ffde6d7
 
 
c21c137
 
 
02b6414
 
 
 
b8380e2
02b6414
453239a
 
ddd05bf
02b6414
c21c137
 
b3bf7d0
 
 
 
 
 
c21c137
 
02b6414
b3bf7d0
 
 
 
 
 
02b6414
 
 
c21c137
 
 
 
 
b3bf7d0
02b6414
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
pipeline_tag: text-generation
base_model: Qwen/Qwen3-Coder-Next
tags:
- mlx
---

[Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.

**EDIT:** [v2](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v2) fixes some misassigned shared expert gates. Slower, but with 4x better perplexity.

**EDIT:** [v3](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v3) bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.

# Usage

```sh
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \
  --model spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
```

# Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision

This one is comparable to
~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
[Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
in size, but loads and runs noticeably faster thanks to MLX.

# Benchmarks

- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
- mlx-community/Qwen3-Coder-Next-4bit
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit)

## Prompt Processing (tokens/sec)

| Prompt Size | GGUF        | MLX 4bit        | MLX 4.5bit (v1)        | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
|------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
| 1000        | 1440.60     | 1917.29         | 1894.38                | 1871.55         | 1868.77         |
| 5000        | 1511.29     | 2113.98         | 2069.36                | 2079.87         | 2071.76         |
| 10000       | 1491.41     | 2073.89         | 2032.13                | 2039.11         | 2031.04         |
| 20000       | 1387.15     | 1888.56         | 1854.83                | 1860.35         | 1854.24         |

## Generation (tokens/sec)

| Gen Size | GGUF     | MLX 4bit     | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
|---------:|---------:|-------------:|--------------:|--------------:|--------------:|
| 500      | 49.35    | 76.39        | 75.30         | 66.82         | 67.19         |
| 1000     | 49.12    | 74.67        | 73.16         | 65.86         | 64.82         |
| 2000     | 49.01    | 71.99        | 70.95         | 63.68         | 62.82         |
| 5000     | 48.64    | 67.72        | 66.67         | 61.04         | 60.99         |

## Perplexity (MLX Quants)

| Model                 | Perplexity      | Relative | Relative % |
|-----------------------|-----------------|----------|------------|
| MLX 4bit              | 4.118 ± 0.021   | —        |  —         |
| MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022   | -0.53%     |
| MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094   | -2.28%     |
| MLX 4.9bit (v3)       | 4.016 ± 0.021   | -0.102   | -2.48%     |

```
# llama.cpp 8130
llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5

# mlx_lm v0.30.7
mlx_lm.benchmark --num-trials 5
mlx_lm.perplexity --sequence-length 1000 --seed 222
```