Instructions to use ubergarm/MiniMax-M2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.5-GGUF",
	filename="IQ2_KS/MiniMax-M2.5-IQ2_KS-00001-of-00003.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/MiniMax-M2.5-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K

Ollama
How to use ubergarm/MiniMax-M2.5-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

How to use ubergarm/MiniMax-M2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.5-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.5-GGUF:Q2_K

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use ubergarm/MiniMax-M2.5-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "ubergarm/MiniMax-M2.5-GGUF:Q2_K" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use ubergarm/MiniMax-M2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
```

Lemonade

How to use ubergarm/MiniMax-M2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.5-GGUF:Q2_K

Run and chat with the model

lemonade run user.MiniMax-M2.5-GGUF-Q2_K

List all available models

lemonade list

Small report (IQ4_XS) & question: IQ4_XS or smol-IQ4_KSS

by tschunschi - opened Feb 14

Discussion

tschunschi

Feb 14

First, thank you for the wonderful quants. Using them all the time. ✨

I've been using MiniMax-M2.1 as my daily coding driver for quite some time (unsloth UD-Q3_K_XL quant) and was exciting to see M2.5 upgrade dropping... Yesterday I pulled IQ4_XS immediately after you pushed them. May have been the first downloader... 😅

I'm using a system with 4xNVIDIA A40, AMD EPYC 9334 32-Core, 1.5TB RAM:

Command line

~/ik_llama.cpp/build/bin/llama-server \
  --alias MiniMax-M2.5 \
  --model ~/models/ubergarm-MiniMax-M2.5-GGUF/IQ4_XS/MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf \
  --ctx-size 106496 \
  --threads 28 \
  --threads-batch 32 \
  --grouped-expert-routing \
  --split-mode-graph-scheduling \
  --split-mode graph \
  --max-extra-alloc 256 \
  --n-gpu-layers 99 \
  --n-cpu-moe 34 \
  --tensor-split 10,12,12,12 \
  --ubatch-size 4096 \
  --batch-size 4096 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 15647 \
  --no-mmap \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --no-display-prompt \
  --jinja

Always trying to juggle bpw, context size, KV quant, etc. to accommodate for at least 100k context. But yeah, as I only have 100 GB of free VRAM I sadly need to offload... You may wonder about the f16 cache type. I'm under the impression low KV quant is bad especially for longer context runs. So if I can I try not to quantize cache... 🤷

Getting around PP 450, TG 28 T/sec, which is ok-ish for coding tasks. Using OpenCode as harness + OpenSpec with some customizations. And MiniMax-M2.5-IQ4_XS has been very solid for me after throwing long-context real coding tasks at it. Good tool calling, pretty good quality code.

So, all in all, has been a decent upgrade from M2.1 and using it currently as my primary coding model. 🚀

IQ4_XS vs. smol-IQ4_KSS

I've seen you recently added an ik_llama-specific quant in the same size category: smol-IQ4_KSS
I wonder if I should use that instead. I can see the technical differences but am too stupid with quants to do an educated decision... 🫠

Layer	IQ4_XS	smol-IQ4_KSS
token_embd.weight	q4_K	iq4_k
output.weight	q6_K	iq6_k
ffn_down_exps	iq4_xs	iq4_kss
ffn_(gate\|up)_exps	iq4_xs	iq4_kss

Info on that stuff is kind of sparse and often very technical... Can you shed some light? Thanks!

tschunschi

Feb 14

ubergarm

Owner Feb 14

@tschunschi

Wow thanks for the very thoughtful and detailed report, I'm impressed!

So you have 4x A40s which are sm86 arch and 48GB VRAM each but you only have 100GB VRAM free? (guessing you keep other models loaded as well or something?)

You have plenty of DRAM and a solid CPU so a very good rig for local ai you have there!

Command Line

Honestly, that is a very clean and solid looking command! And since you can create llama-sweep-bench plots you can definitely dial in as much as you'd like and observe clearly how it is performing.

A few possible things to try:

If you can free up some VRAM, you could use all 4x GPUs given -sm graph works quite well for that (and also works for hybrid CPU+GPUs as you noticed)
You could try to quantize cache using -khad -ctk q6_0 -ctv q8_0 which should hold up okay without too much loss, but I wouldn't go lower than that. Details on khad here: https://github.com/ikawrakow/ik_llama.cpp/pull/1033 (it works on GPU too in more recent PRs). the tl;dr; is khad can improve quality of quantized k cache, so you can go a bit smaller on it. Avoid quantizing v cache as much though.
Try the new speculative decoding stuff from here: https://github.com/ikawrakow/ik_llama.cpp/pull/1261
Experiment with different slimmer coding harness e.g. oh-my-pi (i still haven't tried it myself, haha)

I wonder if I should use that instead.

Generally, try to get the lowest perplexity quant that fits in your RAM+VRAM that runs fast enough for your workload with the desired context depth. Given you have CUDA you can use the smol-IQ4_KSS and as you see it performs pretty well. To confusing things more, there is a new IQ4_NL which has surprisingly low perplexity. I have not checked the KLD to confirm it is actually "better" in terms of less deviation from the full size model, but worth more research probably.

Anyway, you're doing great! Keep us posted on what you continue to discover especially with actual agentic coding results! Cheers!

ubergarm

Owner Feb 14

Oh there may be some wins tuning prompt caching stuff like --cache-ram XXXX or other things, I need to learn more about how this could help out with agentic use as well.

tschunschi

Feb 15

•

edited Feb 15

Try the new speculative decoding stuff

Yeah, saw that! So much stuff to try... and more knobs to adjust. Love it... 😅

I'm unsure about good parameters for coding, currently using:

  --spec-type ngram-map-k4v \
  --spec-ngram-size-n 6 \
  --spec-ngram-size-m 4 \
  --draft-min 1 \
  --draft-max 8 \
  --draft-p-min 0.2 \

Not sure though if it's really faster, doesn't seem to slow it down at least...

Experiment with different slimmer coding harness e.g. oh-my-pi (i still haven't tried it myself, haha)

Very interesting stuff indeed. I actually tried omp, but I guess I'm too locked in with OpenCode already which works best for me all things considered...
omp is stopping execution in the middle of tasks and also file editing fails often despite their hashline thingy. And I'm too lazy/busy to start to debugging omp code though...

OpenCode issues regarding more robust file editing:

[Tracking] Edit tool reliability: "modified since last read" errors (Undo/Redo & Persistence)
[FEATURE]: Add a new experimental "hashline" edit mode 👈️ this one explicitly references omp

there is a new IQ4_NL

Yep, already glancing at it. Just decided to wait a bit though as it's another hefty 130 GB to add to the download queue...

ubergarm

Owner Feb 15

•

edited Feb 15

Not sure though if it's really faster, doesn't seem to slow it down at least...

That was my experience, haha... I'm not sure exactly what kinds of workloads it would benefit e.g. repetitive editing JSON data or something? I wish opencode had a way to track PP and TG speeds live and over time as context grows or something.

Yep, already glancing at it. Just decided to wait a bit though as it's another hefty 130 GB to add to the download queue...

If you want even more to download, I'm working on a big boi: https://huggingface.co/ubergarm/GLM-5-GGUF

It would be slower, but might be good as a first pass model to initialize the project, then let the smaller models do refactors after context gets big?

I haven't uploaded everything yet, still fishing:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment