Instructions to use steampunque/Qwen3-Coder-Next-MP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use steampunque/Qwen3-Coder-Next-MP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="steampunque/Qwen3-Coder-Next-MP-GGUF",
	filename="Qwen3-Coder-Next.Q4_E_H.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use steampunque/Qwen3-Coder-Next-MP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF
# Run inference directly in the terminal:
llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF
# Run inference directly in the terminal:
llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF
# Run inference directly in the terminal:
./llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF

Use Docker

docker model run hf.co/steampunque/Qwen3-Coder-Next-MP-GGUF

LM Studio
Jan
Ollama
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Ollama:
```
ollama run hf.co/steampunque/Qwen3-Coder-Next-MP-GGUF
```

Unsloth Studio

How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for steampunque/Qwen3-Coder-Next-MP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for steampunque/Qwen3-Coder-Next-MP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for steampunque/Qwen3-Coder-Next-MP-GGUF to start chatting

How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "steampunque/Qwen3-Coder-Next-MP-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default steampunque/Qwen3-Coder-Next-MP-GGUF

Run Hermes

hermes

Docker Model Runner
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Docker Model Runner:
```
docker model run hf.co/steampunque/Qwen3-Coder-Next-MP-GGUF
```

Lemonade

How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull steampunque/Qwen3-Coder-Next-MP-GGUF

Run and chat with the model

lemonade run user.Qwen3-Coder-Next-MP-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

good quant

by gopi87 - opened Feb 11

Discussion

gopi87

Feb 11

it did well in my text really a good model if possiple kindly make quant for https://huggingface.co/stepfun-ai/Step-3.5-Flash people will really like it imo

steampunque

Owner Feb 11

it did well in my text really a good model if possiple kindly make quant for https://huggingface.co/stepfun-ai/Step-3.5-Flash people will really like it imo

@gopi87 Thanks. This one was pretty hard, took me a day of iterating the layers to finally get it tuned right, very finicky model about its quants. Agree it does seem quite good. I'll take a look
at the flash model too. I havent done it for myself because its going to be 5 tps on my rig right on the threshold of useable and due to slow tps will take a very long time to
optimize, and it will also need 128G RAM CPU which most dont have. I think this 80 3active is close to perfect layout I get 20tps on fairly old consumer hardware.

steampunque

Owner Feb 12

@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).

gopi87

Feb 12

@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).

thanks for effort mate and try the no thinking version too

CUDA_VISIBLE_DEVICES="0" ./bin/llama-server
--model "/home/gopi/deepresearch-ui/model/stepfun-ai_Step-3.5-Flash-IQ4_XS-00001-of-00003.gguf"
--n-cpu-moe 49
-ngl 99
--ctx-size 40000
--threads 28
--threads-batch 28
--reasoning-budget 0
--host 0.0.0.0
--jinja
--port 8080
--temp 0.6
--top-p 0.95

i wast testing this version getting around 12t/sec

steampunque

Owner Feb 12

@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).

thanks for effort mate and try the no thinking version too

Tried no think template, no joy. The model cannot handle any kind of ambiguity in prompts and just talks to itself until it runs out of tokens on almost all prompts, its completely unusable. I'll keep it
around on my disk for a couple weeks in case somebody over in llama.cpp land finds and fixes a bug for it. There is a chance its overly sensitive to being quantized, I found Qwen3 code next 80B moe to be
very sensitive to quanitization degradation and maybe its way more sensitive at the bigger 196B moe scale.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment