Instructions to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Qwen3-Next-Coder-MLX-4.5bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
Run Hermes
hermes
- MLX LM
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
File size: 3,668 Bytes
f6423b6 ffde6d7 88ad441 ffde6d7 c9de3d3 b3bf7d0 c9de3d3 c21c137 01a490c c9de3d3 01a490c ffde6d7 7b18195 ffde6d7 c21c137 02b6414 b8380e2 02b6414 453239a ddd05bf 02b6414 c21c137 b3bf7d0 c21c137 02b6414 b3bf7d0 02b6414 c21c137 b3bf7d0 02b6414 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | ---
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
pipeline_tag: text-generation
base_model: Qwen/Qwen3-Coder-Next
tags:
- mlx
---
[Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
**EDIT:** [v2](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v2) fixes some misassigned shared expert gates. Slower, but with 4x better perplexity.
**EDIT:** [v3](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v3) bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.
# Usage
```sh
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \
--model spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
```
# Methodology
Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision
This one is comparable to
~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
[Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
in size, but loads and runs noticeably faster thanks to MLX.
# Benchmarks
- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
- mlx-community/Qwen3-Coder-Next-4bit
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit)
## Prompt Processing (tokens/sec)
| Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
|------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
| 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 | 1868.77 |
| 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 | 2071.76 |
| 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 | 2031.04 |
| 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 | 1854.24 |
## Generation (tokens/sec)
| Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
|---------:|---------:|-------------:|--------------:|--------------:|--------------:|
| 500 | 49.35 | 76.39 | 75.30 | 66.82 | 67.19 |
| 1000 | 49.12 | 74.67 | 73.16 | 65.86 | 64.82 |
| 2000 | 49.01 | 71.99 | 70.95 | 63.68 | 62.82 |
| 5000 | 48.64 | 67.72 | 66.67 | 61.04 | 60.99 |
## Perplexity (MLX Quants)
| Model | Perplexity | Relative | Relative % |
|-----------------------|-----------------|----------|------------|
| MLX 4bit | 4.118 ± 0.021 | — | — |
| MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% |
| MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% |
| MLX 4.9bit (v3) | 4.016 ± 0.021 | -0.102 | -2.48% |
```
# llama.cpp 8130
llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5
# mlx_lm v0.30.7
mlx_lm.benchmark --num-trials 5
mlx_lm.perplexity --sequence-length 1000 --seed 222
``` |