Update README.md
Browse files
README.md
CHANGED
|
@@ -20,27 +20,32 @@ than llama.cpp, but the principles are the same:
|
|
| 20 |
This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
|
| 21 |
in size, but loads and runs noticeably faster thanks to MLX.
|
| 22 |
|
|
|
|
|
|
|
|
|
|
| 23 |
# Benchmarks
|
| 24 |
|
| 25 |
- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
|
| 26 |
- mlx-community/Qwen3-Coder-Next-4bit
|
| 27 |
-
- Qwen3-Next-Coder-MLX-mixed-4.5-bit
|
|
|
|
| 28 |
|
| 29 |
## Throughput (tokens/sec)
|
| 30 |
|
| 31 |
-
| Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.
|
| 32 |
-
|--------------|------------:|----------------:|------------------:|---------:|-------------:|---------------:|
|
| 33 |
-
| 1000 / 500
|
| 34 |
-
| 5000 / 1000
|
| 35 |
-
| 10000 / 2000
|
| 36 |
-
| 20000 / 5000
|
| 37 |
|
| 38 |
## Perplexity (MLX Quants)
|
| 39 |
|
| 40 |
| Model | Perplexity | Relative vs 4bit |
|
| 41 |
|-----------------------|-----------------|------------------|
|
| 42 |
| MLX 4bit | 4.118 ± 0.021 | baseline |
|
| 43 |
-
| MLX 4.5bit
|
|
|
|
| 44 |
|
| 45 |
```
|
| 46 |
# llama.cpp 8130
|
|
|
|
| 20 |
This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
|
| 21 |
in size, but loads and runs noticeably faster thanks to MLX.
|
| 22 |
|
| 23 |
+
**EDIT: Re-converted the quant to follow [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
|
| 24 |
+
structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with a big drop in perplexity.
|
| 25 |
+
|
| 26 |
# Benchmarks
|
| 27 |
|
| 28 |
- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
|
| 29 |
- mlx-community/Qwen3-Coder-Next-4bit
|
| 30 |
+
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
|
| 31 |
+
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
|
| 32 |
|
| 33 |
## Throughput (tokens/sec)
|
| 34 |
|
| 35 |
+
| Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit (v1) Prompt | MLX 4.4bit (v2) Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5b (v1) Gen | MLX 4.5b (v2) Gen |
|
| 36 |
+
|-------------------|------------:|----------------:|-----------------------:|-----------------------:|---------:|--------------:|-----------------:|------------------:|
|
| 37 |
+
| 1000 / 500 | 1440.60 | 1917.29 | 1894.38 | todo | 49.35 | 76.39 | 75.30 | todo |
|
| 38 |
+
| 5000 / 1000 | 1511.29 | 2113.98 | 2069.36 | todo | 49.12 | 74.67 | 73.16 | todo |
|
| 39 |
+
| 10000 / 2000 | 1491.41 | 2073.89 | 2032.13 | todo | 49.01 | 71.99 | 70.95 | todo |
|
| 40 |
+
| 20000 / 5000 | 1387.15 | 1888.56 | 1854.83 | todo | 48.64 | 67.72 | 66.67 | todo |
|
| 41 |
|
| 42 |
## Perplexity (MLX Quants)
|
| 43 |
|
| 44 |
| Model | Perplexity | Relative vs 4bit |
|
| 45 |
|-----------------------|-----------------|------------------|
|
| 46 |
| MLX 4bit | 4.118 ± 0.021 | baseline |
|
| 47 |
+
| MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 (≈ -0.53%)|
|
| 48 |
+
| MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 (≈ -2.28%)|
|
| 49 |
|
| 50 |
```
|
| 51 |
# llama.cpp 8130
|