Instructions to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("spicyneuron/Qwen3-Next-Coder-MLX-4.5bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default spicyneuron/Qwen3-Next-Coder-MLX-4.5bit

Run Hermes

hermes

MLX LM

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3-Next-Coder-MLX-4.5bit

File size: 3,668 Bytes

f6423b6
 
 
 
 
 
 
 
 
ffde6d7
88ad441
ffde6d7
c9de3d3
b3bf7d0
c9de3d3
c21c137
01a490c
 
 
 
 
c9de3d3
01a490c
 
ffde6d7
 
7b18195
 
ffde6d7
 
 
c21c137
 
 
02b6414
 
 
 
b8380e2
02b6414
453239a
 
ddd05bf
02b6414
c21c137
 
b3bf7d0
 
 
 
 
 
c21c137
 
02b6414
b3bf7d0
 
 
 
 
 
02b6414
 
 
c21c137
 
 
 
 
b3bf7d0
02b6414

---
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
pipeline_tag: text-generation
base_model: Qwen/Qwen3-Coder-Next
tags:
- mlx
---

[Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.

**EDIT:** [v2](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v2) fixes some misassigned shared expert gates. Slower, but with 4x better perplexity.

**EDIT:** [v3](https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-4.5bit/tree/v3) bumps edge experts to Q8 for further perplexity improvement and minimal effect on speed.

# Usage

```sh
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server --host 127.0.0.1 --port 8080 \
  --model spicyneuron/Qwen3-Next-Coder-MLX-4.5bit
```

# Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision

This one is comparable to
~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
[Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
in size, but loads and runs noticeably faster thanks to MLX.

# Benchmarks

- unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
- mlx-community/Qwen3-Coder-Next-4bit
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
- Qwen3-Next-Coder-MLX-mixed-4.5-bit (v3, ~4.9 bit)

## Prompt Processing (tokens/sec)

| Prompt Size | GGUF        | MLX 4bit        | MLX 4.5bit (v1)        | MLX 4.4bit (v2) | MLX 4.9bit (v3) |
|------------:|------------:|----------------:|-----------------------:|----------------:|----------------:|
| 1000        | 1440.60     | 1917.29         | 1894.38                | 1871.55         | 1868.77         |
| 5000        | 1511.29     | 2113.98         | 2069.36                | 2079.87         | 2071.76         |
| 10000       | 1491.41     | 2073.89         | 2032.13                | 2039.11         | 2031.04         |
| 20000       | 1387.15     | 1888.56         | 1854.83                | 1860.35         | 1854.24         |

## Generation (tokens/sec)

| Gen Size | GGUF     | MLX 4bit     | MLX 4.5b (v1) | MLX 4.4b (v2) | MLX 4.9b (v3) |
|---------:|---------:|-------------:|--------------:|--------------:|--------------:|
| 500      | 49.35    | 76.39        | 75.30         | 66.82         | 67.19         |
| 1000     | 49.12    | 74.67        | 73.16         | 65.86         | 64.82         |
| 2000     | 49.01    | 71.99        | 70.95         | 63.68         | 62.82         |
| 5000     | 48.64    | 67.72        | 66.67         | 61.04         | 60.99         |

## Perplexity (MLX Quants)

| Model                 | Perplexity      | Relative | Relative % |
|-----------------------|-----------------|----------|------------|
| MLX 4bit              | 4.118 ± 0.021   | —        |  —         |
| MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022   | -0.53%     |
| MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094   | -2.28%     |
| MLX 4.9bit (v3)       | 4.016 ± 0.021   | -0.102   | -2.48%     |

```
# llama.cpp 8130
llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5

# mlx_lm v0.30.7
mlx_lm.benchmark --num-trials 5
mlx_lm.perplexity --sequence-length 1000 --seed 222
```