Instructions to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("spicyneuron/Qwen3-Next-Coder-MLX-4.5bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default spicyneuron/Qwen3-Next-Coder-MLX-4.5bit

Run Hermes

hermes

MLX LM

How to use spicyneuron/Qwen3-Next-Coder-MLX-4.5bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "spicyneuron/Qwen3-Next-Coder-MLX-4.5bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

spicyneuron commited on Feb 27

Commit

c21c137

verified ·

1 Parent(s): 10ea069

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -16

README.md CHANGED Viewed

@@ -10,6 +10,8 @@ tags:
 [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
 # Methodology
 Quantized using a custom script inspired by Unsloth-style mixed-precision GGUFs. MLX quantization options differ
@@ -17,12 +19,11 @@ than llama.cpp, but the principles are the same:
 - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
 - More tolerant layers like MoE experts get lower precision
-This one is comparable to [Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)
 in size, but loads and runs noticeably faster thanks to MLX.
-**EDIT: Re-converted the quant to follow [Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
-structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with a big drop in perplexity.
 # Benchmarks
 - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
@@ -30,22 +31,31 @@ structure due to errors in UD-Q4_K_XL.** New version is smaller (~4.4 bits) with
 - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
 - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
-## Throughput (tokens/sec)
-| Prompt / Gen Size | GGUF Prompt | MLX 4bit Prompt | MLX 4.5bit (v1) Prompt | MLX 4.4bit (v2) Prompt | GGUF Gen | MLX 4bit Gen | MLX 4.5b (v1) Gen | MLX 4.5b (v2) Gen |
-|-------------------|------------:|----------------:|-----------------------:|-----------------------:|---------:|--------------:|-----------------:|------------------:|
-| 1000 / 500        | 1440.60     | 1917.29         | 1894.38                | todo                   | 49.35    | 76.39         | 75.30            | todo |
-| 5000 / 1000       | 1511.29     | 2113.98         | 2069.36                | todo                | 49.12    | 74.67        | 73.16          | todo |
-| 10000 / 2000      | 1491.41     | 2073.89         | 2032.13                | todo                | 49.01    | 71.99        | 70.95          | todo |
-| 20000 / 5000      | 1387.15     | 1888.56         | 1854.83                | todo                | 48.64    | 67.72        | 66.67          | todo |
 ## Perplexity (MLX Quants)
-| Model                 | Perplexity      | Relative vs 4bit |
-|-----------------------|-----------------|------------------|
-| MLX 4bit              | 4.118 ± 0.021   | baseline         |
-| MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022 (≈ -0.53%)|
-| MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094 (≈ -2.28%)|
 ```
 # llama.cpp 8130

 [Qwen3-Coder-Next](https://huggingface.co/moonshotai/Qwen/Qwen3-Coder-Next) optimized for MLX. Note: Uses MXFP4 for some module paths.
+**EDIT:** v2 is slightly smaller (~4.4 bits) and slower, but with better perplexity.
 # Methodology
 Quantized using a custom script inspired by Unsloth-style mixed-precision GGUFs. MLX quantization options differ
 - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
 - More tolerant layers like MoE experts get lower precision
+This one is comparable to
+~~[Unsloth's UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_XL.gguf)~~
+[Unsloth's MOE-MXFP4](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-MXFP4_MOE.gguf)
 in size, but loads and runs noticeably faster thanks to MLX.
 # Benchmarks
 - unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
 - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v1)
 - Qwen3-Next-Coder-MLX-mixed-4.5-bit (v2, ~4.4 bit)
+## Prompt Processing (tokens/sec)
+| Prompt Size | GGUF        | MLX 4bit        | MLX 4.5bit (v1)        | MLX 4.4bit (v2) |
+|------------:|------------:|----------------:|-----------------------:|----------------:|
+| 1000        | 1440.60     | 1917.29         | 1894.38                | 1871.55         |
+| 5000        | 1511.29     | 2113.98         | 2069.36                | 2079.87         |
+| 10000       | 1491.41     | 2073.89         | 2032.13                | 2039.11         |
+| 20000       | 1387.15     | 1888.56         | 1854.83                | 1860.35         |
+## Generation (tokens/sec)
+| Gen Size | GGUF     | MLX 4bit     | MLX 4.5b (v1) | MLX 4.4b (v2) |
+|---------:|---------:|-------------:|--------------:|--------------:|
+| 500      | 49.35    | 76.39        | 75.30         | 66.82         |
+| 1000     | 49.12    | 74.67        | 73.16         | 65.86         |
+| 2000     | 49.01    | 71.99        | 70.95         | 63.68         |
+| 5000     | 48.64    | 67.72        | 66.67         | 61.04         |
 ## Perplexity (MLX Quants)
+| Model                 | Perplexity      | Relative | Relative % |
+|-----------------------|-----------------|----------|------------|
+| MLX 4bit              | 4.118 ± 0.021   | —        |  —         |
+| MLX 4.5bit (v1)       | 4.096 ± 0.021   | -0.022   | -0.53%     |
+| MLX 4.4bit (v2)       | 4.024 ± 0.021   | -0.094   | -2.28%     |
 ```
 # llama.cpp 8130