Instructions to use ubergarm/MiniMax-M2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.5-GGUF",
	filename="IQ2_KS/MiniMax-M2.5-IQ2_KS-00001-of-00003.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/MiniMax-M2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K

Ollama
How to use ubergarm/MiniMax-M2.5-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.5-GGUF to start chatting

How to use ubergarm/MiniMax-M2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.5-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/MiniMax-M2.5-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.5-GGUF:Q2_K

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/MiniMax-M2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.5-GGUF:Q2_K
```

Lemonade

How to use ubergarm/MiniMax-M2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.5-GGUF:Q2_K

Run and chat with the model

lemonade run user.MiniMax-M2.5-GGUF-Q2_K

List all available models

lemonade list

What is your text for ppl?

by ox-ox - opened Feb 13

Discussion

ox-ox

Feb 13

Just ran PPL on my Q3_K_L (110.22 GiB). Got a Final PPL of 8.2213 (+/- 0.09) on WikiText-2. It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants. It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.

ubergarm

Owner Feb 13

Heya, be very careful with the exact command and corpus you use when attempting to compare perplexity across various versions.

I just updated my logs/perplexity* with actual command if you want to check e.g.

  model=/mnt/raid/hf/MiniMax-M2.5-GGUF/IQ5_K/MiniMax-M2.5-IQ5_K-00001-of-00005.gguf

  numactl -N "$SOCKET" -m "$SOCKET" \
  ./build/bin/llama-perplexity \
      -m "$model" \
      -f wiki.test.raw \
      --seed 1337 \
      --ctx-size 512 \
      -ub 4096 -b 4096 \
      --numa numactl \
      --threads 96 \
      --threads-batch 128 \
      --validate-quants \
      --no-mmap

The seed doesn't actually matter as sampling isn't used here for perplexity.

You can get the wiki.test.raw file like so as described in the referenced quant cookers guide (outdated): https://github.com/ikawrakow/ik_llama.cpp/discussions/434

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants.

Be careful again here, what exactly do you mean by F16? There is a difference in bf16 and fp16 and fp8e4m3 in terms of dynamic range and precision. I used the mainline llama.cpp convert_hf_to_gguf.py as designed to convert the original safetensors to bf16 GGUF then quantized from that as is the best way to guarantee to clipping (which would be unlikely here though only a few original tensors are bf16 and most are fp8e4m3).

It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.

Point me to your repo and I could likely test perplexity of your quant using my same rig (as there are small differences depending on backend e.g. CPU vs GPU etc). Also feel free to share your workflow and steps and if you're into cooking quants I'd suggest checking out Beaver AI discord where many quant cookers hang out: https://huggingface.co/BeaverAI

Cheers!

ox-ox

Feb 13

Thanks for the detailed breakdown! This is super helpful.

The Repo:
You can grab the quant here: https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF/blob/main/minimax-m2.5-Q3_K_L.gguf
Context mismatch:
That explains the delta! I ran my PPL test with -c 4096 (chunks 32) on Metal (M3 Max), whereas your log shows --ctx-size 512. My lower PPL (8.22) is likely benefiting from the larger context window compared to your baseline. I will re-run with --ctx-size 512 to align with your methodology and update my numbers.
Workflow (FP8 -> F16):
I used llama.cpp (b8022) to convert the safetensors. I didn't explicitly force BF16, so it likely defaulted to FP16 intermediate GGUF before quantizing to Q3_K_L. My goal was primarily to avoid the direct FP8->Quant artifacts I've seen in previous builds, and fit it into 128GB unified memory without swap.
Cross-Verification:
I would absolutely love if you could test the Q3_K_L on your rig to see how it stacks up against the IQ4_XS efficiency curve on your backend!

I'll definitely check out the Beaver AI discord. I'm an undergrad student working on low-bandwidth LLM interactions (SNEE project), so learning from the "cookers" would be gold. Thanks for the invite!

ubergarm

Owner Feb 13

Thanks for looking more closely into the details to properly represent a fair comparison as best as we can!

I'm testing your Q3_K_L right now, and assume it isn't quite as good given the default recipes knock down attn.*, but I'll report the numbers as soon as I have them!

Yours:

llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  249 tensors
llama_model_loader: - type q5_K:  186 tensors
llama_model_loader: - type q6_K:    1 tensors

Mine:

llama_model_loader: - type q8_0:  248 tensors <--- attn.*
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  186 tensors

I didn't explicitly force BF16, so it likely defaulted to FP16 intermediate GGUF before quantizing to Q3_K_L. My goal was primarily to avoid the direct FP8->Quant artifacts I've seen in previous builds, and fit it into 128GB unified memory without swap.

It defaults to bf16, which is what I did. You can tell by looking at the output files you created, e.g. mine look like: MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf etc...

I've never done direct FP8->Quant and always go from original upcast to full bf16 (or fp16 only if original model is explicitly fp16 to prevent any clipping).

I'm on it!

Cheers!

ox-ox

Feb 13

Aha! I see your strategy: keeping attn.* at Q8_0 while squeezing the experts into IQ4_XS. That's a clever tradeoff.

My Q3_K_L is indeed the "vanilla" recipe from llama.cpp. I suspect your mix might yield better PPL thanks to the high-precision attention heads, but I'm curious if the IQ4_XS on the experts hurts the knowledge retrieval compared to the Q3_K experts in my build.

Really appreciate the deep dive into the tensor breakdown. It confirms my "Master" was indeed BF16 (good to know convert.py handles that safely by default). Standing by for your numbers!

ubergarm

Owner Feb 13

@ox-ox

I'm curious if the IQ4_XS on the experts hurts the knowledge retrieval compared to the Q3_K experts in my build.

No, IQ4_XS is better than the q3_K you're using in the default mix probably for your ffn_(gate|up)_exps. I didn't look at exact recipe, but you are using q5_K likely for ffn_down_exps. Also the perplexity of my IQ4_XS is "better" specific to this model [caveat perplexity is not everything for sure, especially on instruct tuned models, look into KLD for more details].

My advice is ignore the recipe names like Q3_K_L and stuff, and look at the exact quantization type used for each exact tensor.

Also watch my recent talk for more information about how you can use ./build/bin/llama-quantize --help as well as grepping old closed PRs/Discussions on mainline and ik_llama.cpp for exact details. Here is the talk: https://blog.aifoundry.org/p/adventures-in-model-quantization

Cheers and good job again, I hope I'm not coming off too rough, really great you cooked these quants and are in the game now! Especiallky given you're an undergrad. Welcome, take your time and enjoy the ride!

ox-ox

Feb 13

Thanks a lot for the encouragement and the technical pointers!

You're right, I was sticking to the standard llama.cpp recipe names without digging into the specific mix for ffn_gate vs ffn_down. That explains why the IQ4_XS pulls ahead on specific tasks despite the size difference. I’ll definitely stop treating these recipes as black boxes and start looking at the per-tensor quantization types.

I've bookmarked your talk ('Adventures in Model Quantization') and I'm watching it tonight to get up to speed before jumping into the Discord. Thanks for the warm welcome ; it means a lot coming from an expert in the field

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment