spicyneuron commited on
Commit
c21c137
·
verified ·
1 Parent(s): 10ea069

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -16
README.md CHANGED
@@ -10,6 +10,8 @@ tags:
10
 
11
  [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
12
 
 
 
13
  # Methodology
14
 
15
  Quantized using a custom script inspired by Unsloth-style mixed-precision GGUFs. MLX quantization options differ
@@ -17,12 +19,11 @@ than llama.cpp, but the principles are the same:
17
  - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
18
  - More tolerant layers like MoE experts get lower precision
19
 
20
- This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
 
 
21
  in size, but loads and runs noticeably faster thanks to MLX.
22
 
23
- **EDIT: Re-converted the quant to follow [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
24
- structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with a big drop in perplexity.
25
-
26
  # Benchmarks
27
 
28
  - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
@@ -30,22 +31,31 @@ structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with
30
  - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
31
  - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
32
 
33
- ## Throughput (tokens/sec)
 
 
 
 
 
 
 
 
 
34
 
35
- | Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit (v1) Prompt | MLX 4.4bit (v2) Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5b (v1) Gen | MLX 4.5b (v2) Gen |
36
- |-------------------|------------:|----------------:|-----------------------:|-----------------------:|---------:|--------------:|-----------------:|------------------:|
37
- | 1000 / 500 | 1440.60 | 1917.29 | 1894.38 | todo | 49.35 | 76.39 | 75.30 | todo |
38
- | 5000 / 1000 | 1511.29 | 2113.98 | 2069.36 | todo | 49.12 | 74.67 | 73.16 | todo |
39
- | 10000 / 2000 | 1491.41 | 2073.89 | 2032.13 | todo | 49.01 | 71.99 | 70.95 | todo |
40
- | 20000 / 5000 | 1387.15 | 1888.56 | 1854.83 | todo | 48.64 | 67.72 | 66.67 | todo |
41
 
42
  ## Perplexity (MLX Quants)
43
 
44
- | Model | Perplexity | Relative vs 4bit |
45
- |-----------------------|-----------------|------------------|
46
- | MLX 4bit | 4.118 ± 0.021 | baseline |
47
- | MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 (≈ -0.53%)|
48
- | MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 (≈ -2.28%)|
49
 
50
  ```
51
  # llama.cpp 8130
 
10
 
11
  [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
12
 
13
+ **EDIT:** v2 is slightly smaller (~4.4 bits) and slower, but with better perplexity.
14
+
15
  # Methodology
16
 
17
  Quantized using a custom script inspired by Unsloth-style mixed-precision GGUFs. MLX quantization options differ
 
19
  - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
20
  - More tolerant layers like MoE experts get lower precision
21
 
22
+ This one is comparable to
23
+ ~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
24
+ [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
25
  in size, but loads and runs noticeably faster thanks to MLX.
26
 
 
 
 
27
  # Benchmarks
28
 
29
  - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
 
31
  - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
32
  - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
33
 
34
+ ## Prompt Processing (tokens/sec)
35
+
36
+ | Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) |
37
+ |------------:|------------:|----------------:|-----------------------:|----------------:|
38
+ | 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 |
39
+ | 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 |
40
+ | 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 |
41
+ | 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 |
42
+
43
+ ## Generation (tokens/sec)
44
 
45
+ | Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) |
46
+ |---------:|---------:|-------------:|--------------:|--------------:|
47
+ | 500 | 49.35 | 76.39 | 75.30 | 66.82 |
48
+ | 1000 | 49.12 | 74.67 | 73.16 | 65.86 |
49
+ | 2000 | 49.01 | 71.99 | 70.95 | 63.68 |
50
+ | 5000 | 48.64 | 67.72 | 66.67 | 61.04 |
51
 
52
  ## Perplexity (MLX Quants)
53
 
54
+ | Model | Perplexity | Relative | Relative % |
55
+ |-----------------------|-----------------|----------|------------|
56
+ | MLX 4bit | 4.118 ± 0.021 | — | — |
57
+ | MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
58
+ | MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
59
 
60
  ```
61
  # llama.cpp 8130