Instructions to use ubergarm/MiniMax-M2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/MiniMax-M2.5-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/MiniMax-M2.5-GGUF", filename="IQ2_KS/MiniMax-M2.5-IQ2_KS-00001-of-00003.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ubergarm/MiniMax-M2.5-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
Use Docker
docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use ubergarm/MiniMax-M2.5-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/MiniMax-M2.5-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/MiniMax-M2.5-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
- Ollama
How to use ubergarm/MiniMax-M2.5-GGUF with Ollama:
ollama run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
- Unsloth Studio
How to use ubergarm/MiniMax-M2.5-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/MiniMax-M2.5-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/MiniMax-M2.5-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/MiniMax-M2.5-GGUF to start chatting
- Pi
How to use ubergarm/MiniMax-M2.5-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/MiniMax-M2.5-GGUF:Q2_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/MiniMax-M2.5-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/MiniMax-M2.5-GGUF:Q2_K
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/MiniMax-M2.5-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
- Lemonade
How to use ubergarm/MiniMax-M2.5-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/MiniMax-M2.5-GGUF:Q2_K
Run and chat with the model
lemonade run user.MiniMax-M2.5-GGUF-Q2_K
List all available models
lemonade list
Small report (IQ4_XS) & question: IQ4_XS or smol-IQ4_KSS
First, thank you for the wonderful quants. Using them all the time. β¨
I've been using MiniMax-M2.1 as my daily coding driver for quite some time (unsloth UD-Q3_K_XL quant) and was exciting to see M2.5 upgrade dropping... Yesterday I pulled IQ4_XS immediately after you pushed them. May have been the first downloader... π
I'm using a system with 4xNVIDIA A40, AMD EPYC 9334 32-Core, 1.5TB RAM:
Command line
~/ik_llama.cpp/build/bin/llama-server \
--alias MiniMax-M2.5 \
--model ~/models/ubergarm-MiniMax-M2.5-GGUF/IQ4_XS/MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf \
--ctx-size 106496 \
--threads 28 \
--threads-batch 32 \
--grouped-expert-routing \
--split-mode-graph-scheduling \
--split-mode graph \
--max-extra-alloc 256 \
--n-gpu-layers 99 \
--n-cpu-moe 34 \
--tensor-split 10,12,12,12 \
--ubatch-size 4096 \
--batch-size 4096 \
--parallel 1 \
--host 127.0.0.1 \
--port 15647 \
--no-mmap \
--cache-type-k f16 \
--cache-type-v f16 \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--no-display-prompt \
--jinja
Always trying to juggle bpw, context size, KV quant, etc. to accommodate for at least 100k context. But yeah, as I only have 100 GB of free VRAM I sadly need to offload... You may wonder about the f16 cache type. I'm under the impression low KV quant is bad especially for longer context runs. So if I can I try not to quantize cache... π€·
Getting around PP 450, TG 28 T/sec, which is ok-ish for coding tasks. Using OpenCode as harness + OpenSpec with some customizations. And MiniMax-M2.5-IQ4_XS has been very solid for me after throwing long-context real coding tasks at it. Good tool calling, pretty good quality code.
So, all in all, has been a decent upgrade from M2.1 and using it currently as my primary coding model. π
IQ4_XS vs. smol-IQ4_KSS
I've seen you recently added an ik_llama-specific quant in the same size category: smol-IQ4_KSS
I wonder if I should use that instead. I can see the technical differences but am too stupid with quants to do an educated decision... π«
| Layer | IQ4_XS | smol-IQ4_KSS |
|---|---|---|
| token_embd.weight | q4_K | iq4_k |
| output.weight | q6_K | iq6_k |
| ffn_down_exps | iq4_xs | iq4_kss |
| ffn_(gate|up)_exps | iq4_xs | iq4_kss |
Info on that stuff is kind of sparse and often very technical... Can you shed some light? Thanks!
Wow thanks for the very thoughtful and detailed report, I'm impressed!
So you have 4x A40s which are sm86 arch and 48GB VRAM each but you only have 100GB VRAM free? (guessing you keep other models loaded as well or something?)
You have plenty of DRAM and a solid CPU so a very good rig for local ai you have there!
Command Line
Honestly, that is a very clean and solid looking command! And since you can create llama-sweep-bench plots you can definitely dial in as much as you'd like and observe clearly how it is performing.
A few possible things to try:
- If you can free up some VRAM, you could use all 4x GPUs given
-sm graphworks quite well for that (and also works for hybrid CPU+GPUs as you noticed) - You could try to quantize cache using
-khad -ctk q6_0 -ctv q8_0which should hold up okay without too much loss, but I wouldn't go lower than that. Details on khad here: https://github.com/ikawrakow/ik_llama.cpp/pull/1033 (it works on GPU too in more recent PRs). the tl;dr; is khad can improve quality of quantized k cache, so you can go a bit smaller on it. Avoid quantizing v cache as much though. - Try the new speculative decoding stuff from here: https://github.com/ikawrakow/ik_llama.cpp/pull/1261
- Experiment with different slimmer coding harness e.g. oh-my-pi (i still haven't tried it myself, haha)
I wonder if I should use that instead.
Generally, try to get the lowest perplexity quant that fits in your RAM+VRAM that runs fast enough for your workload with the desired context depth. Given you have CUDA you can use the smol-IQ4_KSS and as you see it performs pretty well. To confusing things more, there is a new IQ4_NL which has surprisingly low perplexity. I have not checked the KLD to confirm it is actually "better" in terms of less deviation from the full size model, but worth more research probably.
Anyway, you're doing great! Keep us posted on what you continue to discover especially with actual agentic coding results! Cheers!
Oh there may be some wins tuning prompt caching stuff like --cache-ram XXXX or other things, I need to learn more about how this could help out with agentic use as well.
- Try the new speculative decoding stuff
Yeah, saw that! So much stuff to try... and more knobs to adjust. Love it... π
I'm unsure about good parameters for coding, currently using:
--spec-type ngram-map-k4v \
--spec-ngram-size-n 6 \
--spec-ngram-size-m 4 \
--draft-min 1 \
--draft-max 8 \
--draft-p-min 0.2 \
Not sure though if it's really faster, doesn't seem to slow it down at least...
- Experiment with different slimmer coding harness e.g. oh-my-pi (i still haven't tried it myself, haha)
Very interesting stuff indeed. I actually tried omp, but I guess I'm too locked in with OpenCode already which works best for me all things considered...
omp is stopping execution in the middle of tasks and also file editing fails often despite their hashline thingy. And I'm too lazy/busy to start to debugging omp code though...
OpenCode issues regarding more robust file editing:
- [Tracking] Edit tool reliability: "modified since last read" errors (Undo/Redo & Persistence)
- [FEATURE]: Add a new experimental "hashline" edit mode ποΈ this one explicitly references omp
there is a new
IQ4_NL
Yep, already glancing at it. Just decided to wait a bit though as it's another hefty 130 GB to add to the download queue...
Not sure though if it's really faster, doesn't seem to slow it down at least...
That was my experience, haha... I'm not sure exactly what kinds of workloads it would benefit e.g. repetitive editing JSON data or something? I wish opencode had a way to track PP and TG speeds live and over time as context grows or something.
Yep, already glancing at it. Just decided to wait a bit though as it's another hefty 130 GB to add to the download queue...
If you want even more to download, I'm working on a big boi: https://huggingface.co/ubergarm/GLM-5-GGUF
It would be slower, but might be good as a first pass model to initialize the project, then let the smaller models do refactors after context gets big?
I haven't uploaded everything yet, still fishing:


