Instructions to use steampunque/Qwen3-Coder-Next-MP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="steampunque/Qwen3-Coder-Next-MP-GGUF", filename="Qwen3-Coder-Next.Q4_E_H.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF # Run inference directly in the terminal: llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF # Run inference directly in the terminal: llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF # Run inference directly in the terminal: ./llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf steampunque/Qwen3-Coder-Next-MP-GGUF
Use Docker
docker model run hf.co/steampunque/Qwen3-Coder-Next-MP-GGUF
- LM Studio
- Jan
- Ollama
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Ollama:
ollama run hf.co/steampunque/Qwen3-Coder-Next-MP-GGUF
- Unsloth Studio new
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/Qwen3-Coder-Next-MP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/Qwen3-Coder-Next-MP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for steampunque/Qwen3-Coder-Next-MP-GGUF to start chatting
- Pi new
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "steampunque/Qwen3-Coder-Next-MP-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/Qwen3-Coder-Next-MP-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default steampunque/Qwen3-Coder-Next-MP-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Docker Model Runner:
docker model run hf.co/steampunque/Qwen3-Coder-Next-MP-GGUF
- Lemonade
How to use steampunque/Qwen3-Coder-Next-MP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull steampunque/Qwen3-Coder-Next-MP-GGUF
Run and chat with the model
lemonade run user.Qwen3-Coder-Next-MP-GGUF-{{QUANT_TAG}}List all available models
lemonade list
good quant
it did well in my text really a good model if possiple kindly make quant for https://huggingface.co/stepfun-ai/Step-3.5-Flash people will really like it imo
it did well in my text really a good model if possiple kindly make quant for https://huggingface.co/stepfun-ai/Step-3.5-Flash people will really like it imo
@gopi87 Thanks. This one was pretty hard, took me a day of iterating the layers to finally get it tuned right, very finicky model about its quants. Agree it does seem quite good. I'll take a look
at the flash model too. I havent done it for myself because its going to be 5 tps on my rig right on the threshold of useable and due to slow tps will take a very long time to
optimize, and it will also need 128G RAM CPU which most dont have. I think this 80 3active is close to perfect layout I get 20tps on fairly old consumer hardware.
@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).
@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).
thanks for effort mate and try the no thinking version too
CUDA_VISIBLE_DEVICES="0" ./bin/llama-server
--model "/home/gopi/deepresearch-ui/model/stepfun-ai_Step-3.5-Flash-IQ4_XS-00001-of-00003.gguf"
--n-cpu-moe 49
-ngl 99
--ctx-size 40000
--threads 28
--threads-batch 28
--reasoning-budget 0
--host 0.0.0.0
--jinja
--port 8080
--temp 0.6
--top-p 0.95
i wast testing this version getting around 12t/sec
@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).
thanks for effort mate and try the no thinking version too
Tried no think template, no joy. The model cannot handle any kind of ambiguity in prompts and just talks to itself until it runs out of tokens on almost all prompts, its completely unusable. I'll keep it
around on my disk for a couple weeks in case somebody over in llama.cpp land finds and fixes a bug for it. There is a chance its overly sensitive to being quantized, I found Qwen3 code next 80B moe to be
very sensitive to quanitization degradation and maybe its way more sensitive at the bigger 196B moe scale.