spicyneuron commited on
Commit
b3bf7d0
·
verified ·
1 Parent(s): 7b18195

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -13
README.md CHANGED
@@ -10,7 +10,9 @@ tags:
10
 
11
  [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
12
 
13
- **EDIT:** v2 fixes some misassigned shared expert gates. Slightly slower, but with 4x better perplexity.
 
 
14
 
15
  # Methodology
16
 
@@ -33,21 +35,21 @@ in size, but loads and runs noticeably faster thanks to MLX.
33
 
34
  ## Prompt Processing (tokens/sec)
35
 
36
- | Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) |
37
- |------------:|------------:|----------------:|-----------------------:|----------------:|
38
- | 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 |
39
- | 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 |
40
- | 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 |
41
- | 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 |
42
 
43
  ## Generation (tokens/sec)
44
 
45
- | Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) |
46
- |---------:|---------:|-------------:|--------------:|--------------:|
47
- | 500 | 49.35 | 76.39 | 75.30 | 66.82 |
48
- | 1000 | 49.12 | 74.67 | 73.16 | 65.86 |
49
- | 2000 | 49.01 | 71.99 | 70.95 | 63.68 |
50
- | 5000 | 48.64 | 67.72 | 66.67 | 61.04 |
51
 
52
  ## Perplexity (MLX Quants)
53
 
@@ -56,6 +58,7 @@ in size, but loads and runs noticeably faster thanks to MLX.
56
  | MLX 4bit | 4.118 ± 0.021 | — | — |
57
  | MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
58
  | MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
 
59
 
60
  ```
61
  # llama.cpp 8130
 
10
 
11
  [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
12
 
13
+ **EDIT:** v2 fixes some misassigned shared expert gates. Slower, but with 4x better perplexity. (recommended)
14
+
15
+ **EDIT:** v3 bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.
16
 
17
  # Methodology
18
 
 
35
 
36
  ## Prompt Processing (tokens/sec)
37
 
38
+ | Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
39
+ |------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
40
+ | 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 | 1868.77 |
41
+ | 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 | 2071.76 |
42
+ | 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 | 2031.04 |
43
+ | 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 | 1854.24 |
44
 
45
  ## Generation (tokens/sec)
46
 
47
+ | Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
48
+ |---------:|---------:|-------------:|--------------:|--------------:|--------------:|
49
+ | 500 | 49.35 | 76.39 | 75.30 | 66.82 | 67.19 |
50
+ | 1000 | 49.12 | 74.67 | 73.16 | 65.86 | 64.82 |
51
+ | 2000 | 49.01 | 71.99 | 70.95 | 63.68 | 62.82 |
52
+ | 5000 | 48.64 | 67.72 | 66.67 | 61.04 | 60.99 |
53
 
54
  ## Perplexity (MLX Quants)
55
 
 
58
  | MLX 4bit | 4.118 ± 0.021 | — | — |
59
  | MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
60
  | MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
61
+ | MLX 4.9bit (v3) | 4.016 ± 0.021 | -0.102 | -2.48% |
62
 
63
  ```
64
  # llama.cpp 8130