spicyneuron commited on
Commit
453239a
·
verified ·
1 Parent(s): 3c67fe6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -8
README.md CHANGED
@@ -20,27 +20,32 @@ than llama.cpp, but the principles are the same:
20
  This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
21
  in size, but loads and runs noticeably faster thanks to MLX.
22
 
 
 
 
23
  # Benchmarks
24
 
25
  - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
26
  - mlx-community/Qwen3-Coder-Next-4bit
27
- - Qwen3-Next-Coder-MLX-mixed-4.5-bit
 
28
 
29
  ## Throughput (tokens/sec)
30
 
31
- | Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5bit Gen |
32
- |--------------|------------:|----------------:|------------------:|---------:|-------------:|---------------:|
33
- | 1000 / 500 | 1440.60 | 1917.29 | 1894.38 | 49.35 | 76.39 | 75.30 |
34
- | 5000 / 1000 | 1511.29 | 2113.98 | 2069.36 | 49.12 | 74.67 | 73.16 |
35
- | 10000 / 2000 | 1491.41 | 2073.89 | 2032.13 | 49.01 | 71.99 | 70.95 |
36
- | 20000 / 5000 | 1387.15 | 1888.56 | 1854.83 | 48.64 | 67.72 | 66.67 |
37
 
38
  ## Perplexity (MLX Quants)
39
 
40
  | Model | Perplexity | Relative vs 4bit |
41
  |-----------------------|-----------------|------------------|
42
  | MLX 4bit | 4.118 ± 0.021 | baseline |
43
- | MLX 4.5bit Mixed | 4.096 ± 0.021 | -0.022 (≈ -0.53%)|
 
44
 
45
  ```
46
  # llama.cpp 8130
 
20
  This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
21
  in size, but loads and runs noticeably faster thanks to MLX.
22
 
23
+ **EDIT: Re-converted the quant to follow [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
24
+ structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with a big drop in perplexity.
25
+
26
  # Benchmarks
27
 
28
  - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
29
  - mlx-community/Qwen3-Coder-Next-4bit
30
+ - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
31
+ - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
32
 
33
  ## Throughput (tokens/sec)
34
 
35
+ | Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit (v1) Prompt | MLX 4.4bit (v2) Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5b (v1) Gen | MLX 4.5b (v2) Gen |
36
+ |-------------------|------------:|----------------:|-----------------------:|-----------------------:|---------:|--------------:|-----------------:|------------------:|
37
+ | 1000 / 500 | 1440.60 | 1917.29 | 1894.38 | todo | 49.35 | 76.39 | 75.30 | todo |
38
+ | 5000 / 1000 | 1511.29 | 2113.98 | 2069.36 | todo | 49.12 | 74.67 | 73.16 | todo |
39
+ | 10000 / 2000 | 1491.41 | 2073.89 | 2032.13 | todo | 49.01 | 71.99 | 70.95 | todo |
40
+ | 20000 / 5000 | 1387.15 | 1888.56 | 1854.83 | todo | 48.64 | 67.72 | 66.67 | todo |
41
 
42
  ## Perplexity (MLX Quants)
43
 
44
  | Model | Perplexity | Relative vs 4bit |
45
  |-----------------------|-----------------|------------------|
46
  | MLX 4bit | 4.118 ± 0.021 | baseline |
47
+ | MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 (≈ -0.53%)|
48
+ | MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 (≈ -2.28%)|
49
 
50
  ```
51
  # llama.cpp 8130