spicyneuron
/

Qwen3-Next-Coder-MLX-mixed-4.5-bit

@@ -20,27 +20,32 @@ than llama.cpp, but the principles are the same:
 This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
 in size, but loads and runs noticeably faster thanks to MLX.
 # Benchmarks
 - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
 - mlx-community/Qwen3-Coder-Next-4bit
-- Qwen3-Next-Coder-MLX-mixed-4.5-bit
 ## Throughput (tokens/sec)
-| Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5bit Gen |
-|--------------|------------:|----------------:|------------------:|---------:|-------------:|---------------:|
-| 1000 / 500       | 1440.60     | 1917.29         | 1894.38           | 49.35    | 76.39        | 75.30          |
-| 5000 / 1000      | 1511.29     | 2113.98         | 2069.36           | 49.12    | 74.67        | 73.16          |
-| 10000 / 2000     | 1491.41     | 2073.89         | 2032.13           | 49.01    | 71.99        | 70.95          |
-| 20000 / 5000     | 1387.15     | 1888.56         | 1854.83           | 48.64    | 67.72        | 66.67          |
 ## Perplexity (MLX Quants)
 | Model                 | Perplexity      | Relative vs 4bit |
 |-----------------------|-----------------|------------------|
 | MLX 4bit              | 4.118 ± 0.021   | baseline         |
-| MLX 4.5bit Mixed      | 4.096 ± 0.021   | -0.022 (≈ -0.53%)|
 ```
 # llama.cpp 8130

 This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
 in size, but loads and runs noticeably faster thanks to MLX.
+**EDIT: Re-converted the quant to follow [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
+structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with a big drop in perplexity.
 # Benchmarks
 - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
 - mlx-community/Qwen3-Coder-Next-4bit
+- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
+- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
 ## Throughput (tokens/sec)
+| Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit (v1) Prompt | MLX 4.4bit (v2) Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5b (v1) Gen | MLX 4.5b (v2) Gen |
+|-------------------|------------:|----------------:|-----------------------:|-----------------------:|---------:|--------------:|-----------------:|------------------:|
+| 1000 / 500        | 1440.60     | 1917.29         | 1894.38                | todo                   | 49.35    | 76.39         | 75.30            | todo |
+| 5000 / 1000       | 1511.29     | 2113.98         | 2069.36                | todo                | 49.12    | 74.67        | 73.16          | todo |
+| 10000 / 2000      | 1491.41     | 2073.89         | 2032.13                | todo                | 49.01    | 71.99        | 70.95          | todo |
+| 20000 / 5000      | 1387.15     | 1888.56         | 1854.83                | todo                | 48.64    | 67.72        | 66.67          | todo |
 ## Perplexity (MLX Quants)
 | Model                 | Perplexity      | Relative vs 4bit |
 |-----------------------|-----------------|------------------|
 | MLX 4bit              | 4.118 ± 0.021   | baseline         |
+| MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022 (≈ -0.53%)|
+| MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094 (≈ -2.28%)|
 ```
 # llama.cpp 8130