spicyneuron
/

Qwen3-Next-Coder-MLX-mixed-4.5-bit

@@ -10,7 +10,9 @@ tags:
 [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
-**EDIT:** v2 fixes some misassigned shared expert gates. Slightly slower, but with 4x better perplexity.
 # Methodology
@@ -33,21 +35,21 @@ in size, but loads and runs noticeably faster thanks to MLX.
 ## Prompt Processing (tokens/sec)
-| Prompt Size | GGUF        | MLX 4bit        | MLX 4.5bit (v1)        | MLX 4.4bit (v2) |
-|------------:|------------:|----------------:|-----------------------:|----------------:|
-| 1000        | 1440.60     | 1917.29         | 1894.38                | 1871.55         |
-| 5000        | 1511.29     | 2113.98         | 2069.36                | 2079.87         |
-| 10000       | 1491.41     | 2073.89         | 2032.13                | 2039.11         |
-| 20000       | 1387.15     | 1888.56         | 1854.83                | 1860.35         |
 ## Generation (tokens/sec)
-| Gen Size | GGUF     | MLX 4bit     | MLX 4.5b (v1) | MLX 4.4b (v2) |
-|---------:|---------:|-------------:|--------------:|--------------:|
-| 500      | 49.35    | 76.39        | 75.30         | 66.82         |
-| 1000     | 49.12    | 74.67        | 73.16         | 65.86         |
-| 2000     | 49.01    | 71.99        | 70.95         | 63.68         |
-| 5000     | 48.64    | 67.72        | 66.67         | 61.04         |
 ## Perplexity (MLX Quants)
@@ -56,6 +58,7 @@ in size, but loads and runs noticeably faster thanks to MLX.
 | MLX 4bit              | 4.118 ± 0.021   | —        |  —         |
 | MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022   | -0.53%     |
 | MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094   | -2.28%     |
 ```
 # llama.cpp 8130

 [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
+**EDIT:** v2 fixes some misassigned shared expert gates. Slower, but with 4x better perplexity. (recommended)
+**EDIT:** v3 bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.
 # Methodology
 ## Prompt Processing (tokens/sec)
+| Prompt Size | GGUF        | MLX 4bit        | MLX 4.5bit (v1)        | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
+|------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
+| 1000        | 1440.60     | 1917.29         | 1894.38                | 1871.55         | 1868.77         |
+| 5000        | 1511.29     | 2113.98         | 2069.36                | 2079.87         | 2071.76         |
+| 10000       | 1491.41     | 2073.89         | 2032.13                | 2039.11         | 2031.04         |
+| 20000       | 1387.15     | 1888.56         | 1854.83                | 1860.35         | 1854.24         |
 ## Generation (tokens/sec)
+| Gen Size | GGUF     | MLX 4bit     | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
+|---------:|---------:|-------------:|--------------:|--------------:|--------------:|
+| 500      | 49.35    | 76.39        | 75.30         | 66.82         | 67.19         |
+| 1000     | 49.12    | 74.67        | 73.16         | 65.86         | 64.82         |
+| 2000     | 49.01    | 71.99        | 70.95         | 63.68         | 62.82         |
+| 5000     | 48.64    | 67.72        | 66.67         | 61.04         | 60.99         |
 ## Perplexity (MLX Quants)
 | MLX 4bit              | 4.118 ± 0.021   | —        |  —         |
 | MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022   | -0.53%     |
 | MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094   | -2.28%     |
+| MLX 4.9bit (v3)       | 4.016 ± 0.021   | -0.102   | -2.48%     |
 ```
 # llama.cpp 8130