Instructions to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Qwen3-Next-Coder-MLX-4.5bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
Run Hermes
hermes
- MLX LM
How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
| library_name: mlx | |
| license: apache-2.0 | |
| license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE | |
| pipeline_tag: text-generation | |
| base_model: Qwen/Qwen3-Coder-Next | |
| tags: | |
| - mlx | |
| [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths. | |
| **EDIT:** [v2](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v2) fixes some misassigned shared expert gates. Slower, but with 4x better perplexity. | |
| **EDIT:** [v3](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v3) bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed. | |
| # Usage | |
| ```sh | |
| # Start server at http://localhost:8080/v1/chat/completions | |
| uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \ | |
| --model spicyneuron/Qwen3-Next-Coder-MLX-4.5bit | |
| ``` | |
| # Methodology | |
| Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. | |
| MLX quantization options differ than llama.cpp, but the principles are the same: | |
| - Sensitive layers like MoE routing, attention, and output embeddings get higher precision | |
| - More tolerant layers like MoE experts get lower precision | |
| This one is comparable to | |
| ~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~ | |
| [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf) | |
| in size, but loads and runs noticeably faster thanks to MLX. | |
| # Benchmarks | |
| - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL | |
| - mlx-community/Qwen3-Coder-Next-4bit | |
| - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1) | |
| - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit) | |
| - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit) | |
| ## Prompt Processing (tokens/sec) | |
| | Prompt Size | GGUF | MLX 4bit | MLX 4.5bit (v1) | MLX 4.4bit (v2) | MLX 4.9bit (v3) | | |
| |------------:|------------:|----------------:|-----------------------:|----------------:|----------------:| | |
| | 1000 | 1440.60 | 1917.29 | 1894.38 | 1871.55 | 1868.77 | | |
| | 5000 | 1511.29 | 2113.98 | 2069.36 | 2079.87 | 2071.76 | | |
| | 10000 | 1491.41 | 2073.89 | 2032.13 | 2039.11 | 2031.04 | | |
| | 20000 | 1387.15 | 1888.56 | 1854.83 | 1860.35 | 1854.24 | | |
| ## Generation (tokens/sec) | |
| | Gen Size | GGUF | MLX 4bit | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) | | |
| |---------:|---------:|-------------:|--------------:|--------------:|--------------:| | |
| | 500 | 49.35 | 76.39 | 75.30 | 66.82 | 67.19 | | |
| | 1000 | 49.12 | 74.67 | 73.16 | 65.86 | 64.82 | | |
| | 2000 | 49.01 | 71.99 | 70.95 | 63.68 | 62.82 | | |
| | 5000 | 48.64 | 67.72 | 66.67 | 61.04 | 60.99 | | |
| ## Perplexity (MLX Quants) | |
| | Model | Perplexity | Relative | Relative % | | |
| |-----------------------|-----------------|----------|------------| | |
| | MLX 4bit | 4.118 ± 0.021 | — | — | | |
| | MLX 4.5bit (v1) | 4.096 ± 0.021 | -0.022 | -0.53% | | |
| | MLX 4.4bit (v2) | 4.024 ± 0.021 | -0.094 | -2.28% | | |
| | MLX 4.9bit (v3) | 4.016 ± 0.021 | -0.102 | -2.48% | | |
| ``` | |
| # llama.cpp 8130 | |
| llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5 | |
| # mlx_lm v0.30.7 | |
| mlx_lm.benchmark --num-trials 5 | |
| mlx_lm.perplexity --sequence-length 1000 --seed 222 | |
| ``` |