Instructions to use ubergarm/MiniMax-M2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.5-GGUF",
	filename="IQ2_KS/MiniMax-M2.5-IQ2_KS-00001-of-00003.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/MiniMax-M2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K

Ollama
How to use ubergarm/MiniMax-M2.5-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

How to use ubergarm/MiniMax-M2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.5-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.5-GGUF:Q2_K

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/MiniMax-M2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
```

Lemonade

How to use ubergarm/MiniMax-M2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.5-GGUF:Q2_K

Run and chat with the model

lemonade run user.MiniMax-M2.5-GGUF-Q2_K

List all available models

lemonade list

Looking forward to IQ4_XS!

by tarruda - opened Feb 13

Discussion

tarruda

Feb 13

Your IQ4_XS quants for Step-3.5-Flash have been the best!

ubergarm

Owner Feb 13

@tarruda

Thanks! I've seen you around reddit and such, thanks for sharing the good word!

I might be able to do a similar quant at iq4_xs here for the mainline folks. I generally don't do mainline and stick to ik, but i don't think anyone else does the same iq4_xs recipe that i have. AesSedai will likely have some solid mainline quants out tonight I'm guessing too.

Oh hey, have you tried pi.dev or the newer oh-my-pi coding agentic harness instead of opencode etc? someone sent me this blog and I've not tried it yet: http://blog.can.ac/2026/02/12/the-harness-problem/

reminds me of this great clip:
https://www.tiktok.com/@startupcode.net/video/7605360360727547150

tarruda

Feb 13

Oh hey, have you tried pi.dev

Funny that you ask, I just installed and tested pi.dev today for the first time.

Didn't do much with it yet, but it seems pretty good with a small initial context usage, which is great to speed up initial responses from LLMs like Step 3.5 Flash.

Gonna have a look at the oh-my-pi fork, thanks for sharing!

ubergarm

Owner Feb 13

@tarruda

Currently uploading IQ4_XS 114.842 GiB (4.314 BPW) compatible with both ik and mainline llama.cpp

LagOps

Feb 13

can you make an IQ4_NL?

ubergarm

Owner Feb 13

@LagOps

Heya! In general I only use ik's IQ4_NL only if tensor dimensions don't work with the newer quantization types and require it.

Just curious why you would even want it? I assume most backend implementations for IQ4_NL are less optimized than similar sized q4_0 and q4_K even for vulkan etc.

LagOps

Feb 13

In my testing IQ4_NL ran a bit faster than Q4_K quants and perplexity-wise it's a bit better than Q4K_S with the same memory footprint. Q4K_S/M would work for me as well.

dehnhaide

Feb 13

Hei John, a million thanks for the effort man! 🙏✌️

ubergarm

Owner Feb 13

@LagOps

Huh, what is your inference rig setup? Full GPU? Hybrid CPU+GPU? You using mainline or ik or something downstream?

Q4K_S with the same memory footprint. Q4K_S/M would work for me as well.

Oh, i don't use any of the "normal recipes" and only do custom. So my attn.* are all full q8_0 actually which is much better than the default recipe like Q4_K_S or Q4_K_M etc. So it isn't exactly comparable unless you just benchmark yourself to find out.

I use llama-sweep-bench and make some graphs for all my comparisons, and keep a branch here if you want to test against mainline: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench

I'll try to get some speed benchmarks up eventually, looking for a good ~75ish GiB size that could run on 96GB VRAM with ~128k context.

In limited testing the smol-IQ3_KS is working well with opencode right now so that is a good sign!

LagOps

Feb 13

Am happy about any custom quants as well of course! And yes, keeping attention a bit higher is usually worth it.

I'm running mainline llama.cpp with a 7900xtx (24gb) and 128gb ddr5 ram with vulkan (ik_llama.cpp didn't work well last time I tried it and also didn't support the new quants for vulkan). As long as it fits my setup and has cpu friendly quants for the routed experts, i'll gladly take it!

LagOps

Feb 13

Just out of curiosity, is there a reason not to use IQ4_NL? At least in comparison to mainline K quants. In my own testing, it was consistently the best option and when it comes to standard recipes, it was close to Q4K_M while being at the size of Q4K_S.

ubergarm

Owner Feb 13

@LagOps

Ahh okay, you are using AMD GPU with vulkan backend. Correct, ik_llama.cpp does support Vulkan but only for older quantization types (not the newest SOTA quants). If you're doing hybrid CPU + GPU you can make a custom quant using Vulkan optimized types for offload, and CPU optimized types for CPU. I don't release anything that specific though, but you could adapt my recipes and use my imatrix if you like.

Just out of curiosity, is there a reason not to use IQ4_NL? At least in comparison to mainline K quants. In my own testing, it was consistently the best option and when it comes to standard recipes, it was close to Q4K_M while being at the size of Q4K_S.

If you're finding iq4_nl quanted tensors to be beating q4_K quanted tensors and giving you better perplexity and speed for your specific offload strategies, then that is totally fine! iq4_nl is kind of a precursor to the later ik_llama.cpp specific types probably. I'm just surprised that for routed exps (mostly CPU inferencing) that it has better speed optimizations is all.

jpbwin

Feb 13

hola UBG, before the avalanche of ppl requesting quants, do you think you'll make one for people with your and my setup of 48gb of vram and *waves hands* RAM? I really appreciated your stepfun quant IQ4_KSS, I am getting like 16 tps / 150 pp with it.

I know compute isn't free, if you want to drop a recipe I'll see if I can do it with my 64gb ddr4 / 2x 3090 rig if you're tied up. thanks again boss

ubergarm

Owner Feb 13

@jpbwin

Let's see you have 48GB VRAM + 64 GB DDR4 = 112GB

The smol-IQ3_KS 87.237 GiB (3.277 BPW) would work for you and give you plenty of context for agentic programming. I've tested it with opencode and seems to be running pretty well.

I may consider a ~4.0BPW smol-IQ4_KSS but depends on the resulting perplexity if I release it or not. I'll at least give it a try soon.

LagOps

Feb 13

@ubergarm the speed increase isn't very high. about 5%, maybe a bit higher. but since in my testing kld was slightly better as well when comparing same-size standard quants i have come to prefer that kind of quant. i tried IQ4_XS quants as well, but those were slow for me on cpu, roughly a 20-25% speed drop. I would be happy with any kind of Q4 K quant recipe if you feel like cooking it up. No worries if it's not a priority, i can wait for someone to do that kind of quant down the line as well. Making my own quant is an option too, but my internet isn't the best and i am also short on storage, so i think i will pass on that this time around. it's usually worthwhile for smaller quants where some customizing can get significant gains, but in the Q4 range i'm not so picky (as long as it's a cpu friendly quant for the experts).

ubergarm

Owner Feb 13

@LagOps

i tried IQ4_XS quants as well, but those were slow for me on cpu, roughly a 20-25% speed drop.

yeah, unforunately iq4_xs hasn't been as optimized on mainline llama.cpp for non CUDA folks. I've heard some mac folks complaining too hah..

I would be happy with any kind of Q4 K quant recipe if you feel like cooking it up.

I'll leave that to the usual mainline suspects, bartowski, AesSedai, and a new-comer, @ox-ox has some fine looking vanilla recipes from mainline llama.cpp as well over at https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF

I'll noodle more on IQ4_NL given your description it might be one of the most solid options for mainline vulkan users interestingly...

LagOps

Feb 13

@will do, thanks for the recommendation for quants from @ox-ox - didn't know them by name and held off from downloading.

ubergarm

Owner Feb 13

@LagOps

Yeah, I tested @ox-ox 's Q3_K_L and seems to be working fine. You'll see slightly more TG given the smaller attn.* reduce the Active Weights required to pull from memory for each token generated, but at a slight cost to quality over full q8_0 attn. Its all trade offs in this fun game!

ox-ox

Feb 13

@ubergarm Thanks for the shoutout! I appreciate the trust.

@LagOps Welcome! I focus on standard mainline recipes to ensure maximum compatibility across backends (Metal, Vulkan, etc).

Since you have a beastly setup (128GB RAM + 24GB VRAM), you have way more headroom than my 128GB Unified Memory. My Q3_K_L will fly on your rig, but if you really want to saturate your memory with a standard Q4_K_M (approx ~132GB), let me know—I can fire up the stove and cook one for you tonight.

LagOps

Feb 13

@ox-ox Thanks a lot for the offer! I am a bit confused however, Q4_K_M is already available from you.

ubergarm

Owner Feb 14

•

edited Feb 14

@LagOps

Well who would have thunk it?! The IQ4_NL has the lowest perplexity lol... Not 100% sure why or if it is technically better or not (didn't check the KLD stats or anything). It is a bit bigger than the IQ4_XS too and probably too big for a 128GB even rig at IQ4_NL 121.386 GiB (4.559 BPW)

Should be done uploading any moment now!

Oh wait, if you're on vulkan i think you'd need the mainline compat version just because those darn token_embd and output tensors... well, i guess i can upload the mainline compat version too lol why not i already made it... one sec..

Okay both versions are available, you should pick the mainline-IQ4_NL probably which uses token_embd@q4_K and output@q6_K so gucci for vulkan!

ubergarm

Owner Feb 14

@jpbwin

Well, the smol-IQ4_KSS 108.671 GiB (4.082 BPW) looks like pretty good perplexity for the size so I decided to ship it. Probably good for folks with total 128GB and CUDA, but likely too tight for enough context on your rig.

The smaller sizes will probably have to do then and quantize kv-cache with -khad -ctk q6_0 -ctv q8_0 is a decent trade-off probably.

LagOps

Feb 15

•

edited Feb 15

@ubergarm

Thanks for looking into it and for providing the quant! The fit is no issue as there's the gpu as well. I could even push 5-10 gb more with some squeezing with 32k context. And yeah, good to see you be able to reproduce those results (at least on ppl, but kld was slightly better for me as well). It's true that IQ4_XS is a bit smaller, but as i mentioned, it's not a cpu-friendly quant and out of the cpu-friendly mainline quants (well just Q4_K, really), IQ4_NL was clearly the best option from my tesing.

Edit: the IQ4_NL you cooked up has some crazy good PPL, i don't think kld would be quite that amazing (don't think it beats Q5_K), but still...

It might be worth increasing the quant for input/output for future quants as it seems like there could be greater gains (as the ik quant version is better by a significant margin). for such a large model, spending a bit more on those tensors isn't overly costly.

ubergarm

Owner Feb 15

@LagOps

Edit: the IQ4_NL you cooked up has some crazy good PPL, i don't think kld would be quite that amazing (don't think it beats Q5_K), but still...

Yeah, I haven't gone back around to take a look at KLD stats, very probable though the IQ5_K would be more similar to the full quality, but likely the IQ4_NL is still very close just speculating wildly haha..

I'm wondering if the IQ4_NL performs well sometimes because it has a smaller block size (it can quantize anything divisible by 32 i believe, whereas most of the other quants have a superblock kinda thing requiring divisble by 256 tensor dimensions).

I'm gonna fish around with IQ4_NL in the future as well, though torn on using a few ik (non vulkan) tensors to eek out the slight gain, or go with mainline types for vulkan compat... its always trade offs lol..

It might be worth increasing the quant for input/output for future quants as it seems like there could be greater gains (as the ik quant version is better by a significant margin). for such a large model, spending a bit more on those tensors isn't overly costly.

I see, you're suggesting just bumping token_embd and final output (head) to say q8_0 ... I do that with my largest IQ5_K often, and while they are not repeating layers true. Quantizing them does help reduce the active weights size and yield a little more TG (decode) speed.

I'll noodle on it. Let me kmnow how it works out for you!

LagOps

Feb 15

•

edited Feb 15

You are right, rather than just model size there is some impact on TG speed (most likely minimal however). Maybe just Q5_K + Q6_K instead of Q4_K + Q6_K. it's more likely that the Q4_K is holding it back here.

Edit: thinking about it some more, i don't think it should matter for hybrid inference since it would be on gpu and it's the cpu part where most performance is lost. for pure cpu there likely would be a bit of a speed difference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment