Instructions to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Qwen3-Next-Coder-MLX-4.5bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
Run Hermes
hermes
- MLX LM
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,8 @@ tags:
|
|
| 10 |
|
| 11 |
[Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
|
| 12 |
|
|
|
|
|
|
|
| 13 |
# Methodology
|
| 14 |
|
| 15 |
Quantized using a custom script inspired by Unsloth-style mixed-precision GGUFs. MLX quantization options differ
|
|
@@ -17,12 +19,11 @@ than llama.cpp, but the principles are the same:
|
|
| 17 |
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
|
| 18 |
- More tolerant layers like MoE experts get lower precision
|
| 19 |
|
| 20 |
-
This one is comparable to
|
|
|
|
|
|
|
| 21 |
in size, but loads and runs noticeably faster thanks to MLX.
|
| 22 |
|
| 23 |
-
**EDIT: Re-converted the quant to follow [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
|
| 24 |
-
structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with a big drop in perplexity.
|
| 25 |
-
|
| 26 |
# Benchmarks
|
| 27 |
|
| 28 |
- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
|
|
@@ -30,22 +31,31 @@ structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with
|
|
| 30 |
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
|
| 31 |
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
|
| 32 |
|
| 33 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
|
| 36 |
-
|---------
|
| 37 |
-
|
|
| 38 |
-
|
|
| 39 |
-
|
|
| 40 |
-
|
|
| 41 |
|
| 42 |
## Perplexity (MLX Quants)
|
| 43 |
|
| 44 |
-
| Model | Perplexity | Relative
|
| 45 |
-
|-----------------------|-----------------|------------------|
|
| 46 |
-
| MLX 4bit | 4.118 ± 0.021 |
|
| 47 |
-
| MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022
|
| 48 |
-
| MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094
|
| 49 |
|
| 50 |
```
|
| 51 |
# llama.cpp 8130
|
|
|
|
| 10 |
|
| 11 |
[Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
|
| 12 |
|
| 13 |
+
**EDIT:** v2 is slightly smaller (~4.4 bits) and slower, but with better perplexity.
|
| 14 |
+
|
| 15 |
# Methodology
|
| 16 |
|
| 17 |
Quantized using a custom script inspired by Unsloth-style mixed-precision GGUFs. MLX quantization options differ
|
|
|
|
| 19 |
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
|
| 20 |
- More tolerant layers like MoE experts get lower precision
|
| 21 |
|
| 22 |
+
This one is comparable to
|
| 23 |
+
~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
|
| 24 |
+
[Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
|
| 25 |
in size, but loads and runs noticeably faster thanks to MLX.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
| 27 |
# Benchmarks
|
| 28 |
|
| 29 |
- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
|
|
|
|
| 31 |
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
|
| 32 |
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
|
| 33 |
|
| 34 |
+
## Prompt Processing (tokens/sec)
|
| 35 |
+
|
| 36 |
+
| Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) |
|
| 37 |
+
|------------:|------------:|----------------:|-----------------------:|----------------:|
|
| 38 |
+
| 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 |
|
| 39 |
+
| 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 |
|
| 40 |
+
| 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 |
|
| 41 |
+
| 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 |
|
| 42 |
+
|
| 43 |
+
## Generation (tokens/sec)
|
| 44 |
|
| 45 |
+
| Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) |
|
| 46 |
+
|---------:|---------:|-------------:|--------------:|--------------:|
|
| 47 |
+
| 500 | 49.35 | 76.39 | 75.30 | 66.82 |
|
| 48 |
+
| 1000 | 49.12 | 74.67 | 73.16 | 65.86 |
|
| 49 |
+
| 2000 | 49.01 | 71.99 | 70.95 | 63.68 |
|
| 50 |
+
| 5000 | 48.64 | 67.72 | 66.67 | 61.04 |
|
| 51 |
|
| 52 |
## Perplexity (MLX Quants)
|
| 53 |
|
| 54 |
+
| Model | Perplexity | Relative | Relative % |
|
| 55 |
+
|-----------------------|-----------------|----------|------------|
|
| 56 |
+
| MLX 4bit | 4.118 ± 0.021 | — | — |
|
| 57 |
+
| MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
|
| 58 |
+
| MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
|
| 59 |
|
| 60 |
```
|
| 61 |
# llama.cpp 8130
|