Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,9 @@ tags:
|
|
| 10 |
|
| 11 |
[Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
|
| 12 |
|
| 13 |
-
**EDIT:** v2 fixes some misassigned shared expert gates.
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# Methodology
|
| 16 |
|
|
@@ -33,21 +35,21 @@ in size, but loads and runs noticeably faster thanks to MLX.
|
|
| 33 |
|
| 34 |
## Prompt Processing (tokens/sec)
|
| 35 |
|
| 36 |
-
| Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) |
|
| 37 |
-
|------------:|------------:|----------------:|-----------------------:|----------------:|
|
| 38 |
-
| 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 |
|
| 39 |
-
| 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 |
|
| 40 |
-
| 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 |
|
| 41 |
-
| 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 |
|
| 42 |
|
| 43 |
## Generation (tokens/sec)
|
| 44 |
|
| 45 |
-
| Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) |
|
| 46 |
-
|---------:|---------:|-------------:|--------------:|--------------:|
|
| 47 |
-
| 500 | 49.35 | 76.39 | 75.30 | 66.82 |
|
| 48 |
-
| 1000 | 49.12 | 74.67 | 73.16 | 65.86 |
|
| 49 |
-
| 2000 | 49.01 | 71.99 | 70.95 | 63.68 |
|
| 50 |
-
| 5000 | 48.64 | 67.72 | 66.67 | 61.04 |
|
| 51 |
|
| 52 |
## Perplexity (MLX Quants)
|
| 53 |
|
|
@@ -56,6 +58,7 @@ in size, but loads and runs noticeably faster thanks to MLX.
|
|
| 56 |
| MLX 4bit | 4.118 ± 0.021 | — | — |
|
| 57 |
| MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
|
| 58 |
| MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
|
|
|
|
| 59 |
|
| 60 |
```
|
| 61 |
# llama.cpp 8130
|
|
|
|
| 10 |
|
| 11 |
[Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
|
| 12 |
|
| 13 |
+
**EDIT:** v2 fixes some misassigned shared expert gates. Slower, but with 4x better perplexity. (recommended)
|
| 14 |
+
|
| 15 |
+
**EDIT:** v3 bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.
|
| 16 |
|
| 17 |
# Methodology
|
| 18 |
|
|
|
|
| 35 |
|
| 36 |
## Prompt Processing (tokens/sec)
|
| 37 |
|
| 38 |
+
| Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
|
| 39 |
+
|------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
|
| 40 |
+
| 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 | 1868.77 |
|
| 41 |
+
| 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 | 2071.76 |
|
| 42 |
+
| 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 | 2031.04 |
|
| 43 |
+
| 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 | 1854.24 |
|
| 44 |
|
| 45 |
## Generation (tokens/sec)
|
| 46 |
|
| 47 |
+
| Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
|
| 48 |
+
|---------:|---------:|-------------:|--------------:|--------------:|--------------:|
|
| 49 |
+
| 500 | 49.35 | 76.39 | 75.30 | 66.82 | 67.19 |
|
| 50 |
+
| 1000 | 49.12 | 74.67 | 73.16 | 65.86 | 64.82 |
|
| 51 |
+
| 2000 | 49.01 | 71.99 | 70.95 | 63.68 | 62.82 |
|
| 52 |
+
| 5000 | 48.64 | 67.72 | 66.67 | 61.04 | 60.99 |
|
| 53 |
|
| 54 |
## Perplexity (MLX Quants)
|
| 55 |
|
|
|
|
| 58 |
| MLX 4bit | 4.118 ± 0.021 | — | — |
|
| 59 |
| MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
|
| 60 |
| MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
|
| 61 |
+
| MLX 4.9bit (v3) | 4.016 ± 0.021 | -0.102 | -2.48% |
|
| 62 |
|
| 63 |
```
|
| 64 |
# llama.cpp 8130
|