Instructions to use ubergarm/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.7-GGUF",
	filename="BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/MiniMax-M2.7-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.7-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Ollama
How to use ubergarm/MiniMax-M2.7-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

How to use ubergarm/MiniMax-M2.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use ubergarm/MiniMax-M2.7-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use ubergarm/MiniMax-M2.7-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Lemonade

How to use ubergarm/MiniMax-M2.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run and chat with the model

lemonade run user.MiniMax-M2.7-GGUF-IQ1_S_Q

List all available models

lemonade list

Quick bench for smol-IQ3_KS on 2 GPUs

by curiouspp8 - opened Apr 12

Discussion

curiouspp8

Apr 12

Uses ~54gb on each @ 120k context. ~45.52 for base weights. Graph on.
60g2 with full 204k context. kv q8/q6 for all.

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6158.3 | 108.60 |
|        4K |  5306.1 | 100.41 |
|       16K |  5065.6 |  83.55 |
|       32K |  4304.8 |  73.98 |
|       64K |  3410.7 |  51.13 |
+-----------+---------+--------+
|    TTFR 0 |     544 |      - |
|   TTFR 4K |    1355 |      - |
|  TTFR 16K |    3425 |      - |
|  TTFR 32K |    7471 |      - |
|  TTFR 64K |   17904 |      - |
+-----------+---------+--------+

  TG Peak (burst): 112.00 104.00 88.00 78.00 56.00

curiouspp8

Apr 12

updated with spec decoding on. Nice improvement on PP but less once getting to larger prompts

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  7117.8 | 110.04 |
|        4K |  6017.1 |  98.01 |
|       16K |  5547.8 |  85.27 |
|       32K |  4751.0 |  74.88 |
|       64K |  3690.3 |  52.15 |
+-----------+---------+--------+
|    TTFR 0 |     500 |      - |
|   TTFR 4K |    1215 |      - |
|  TTFR 16K |    3198 |      - |
|  TTFR 32K |    6736 |      - |
|  TTFR 64K |   16488 |      - |
+-----------+---------+--------+

  TG Peak (burst): 112.00 103.00 91.00 78.00 55.00

ubergarm

Owner Apr 13

Are you using both -khad and -vhad on these tests? oh yes, i think so, i see it now: https://github.com/ikawrakow/ik_llama.cpp/pull/1625

I haven't tried removing those or using unquantized kv cache to see the effects. generally for GPU full offload unquantized f16 kv-cache can be faster for PP i believe. but then you might not have enough VRAM for full context size... tradeoffs!

so many knobs to tweak and benchmark haha...

curiouspp8

Apr 13

•

edited Apr 13

yep, used -khad without your patch. Then tried the patch yesterday, the model loaded and that fixed inferance on minimax graph 2 + -vhad + -mudge, but VRAM usage was identical as without them. Not sure what I was doing wrong. I am rebuilding now from main to compare the officially merged version see if that actually compresses the KV

ubergarm

Owner Apr 13

•

edited Apr 13

So -khad -vhad won't change the size of the kv-cache usage, they just rotate the tensors before quantizing which gives some quality boost. ~~I don't know what -mudge is~~ oh -muge should be used, it will give maybe 10% boost in PP and some few percent in TG probably. It is equivalent of using a mainline pre-fused ffn_(up|gate)_exps quant.

Right, the question to answer is to do some speed benchmarks between these setups:

-khad -ctk q8_0 -vhad -ctv q6_0
-khad -vhad
nothing, just leave it default f16 without hadamard transforms on either k or v cache.

Ahh I see ik already explained it more here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4237769371

curiouspp8

Apr 13

•

edited Apr 13

Since you been so helpful I paused some workloads and just ran full set of combinations. Note that those cards had some other stuff loaded into VRAM but fully idle during the test. Just in case this impacts anything.

Defaults (gpus at 90.6/94.9GB, just as a relative reference point)

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6577.2 | 116.56 |
|        4K |  4853.0 | 108.18 |
|       16K |  4786.6 |  91.15 |
|       32K |  3881.8 |  82.27 |
|       64K |  3179.0 |  69.71 |
+-----------+---------+--------+
|    TTFR 0 |     531 |      - |
|   TTFR 4K |    1481 |      - |
|  TTFR 16K |    3727 |      - |
|  TTFR 32K |    8241 |      - |
|  TTFR 64K |   19177 |      - |
+-----------+---------+--------+

  TG Peak (burst): 120.00 111.00 94.00 86.00 72.00

With -ctk q8_0 -ctv q6_0 (GPUs 85.3/89.7)

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6716.3 | 110.31 |
|        4K |  5565.6 | 100.74 |
|       16K |  5207.9 |  85.93 |
|       32K |  4465.5 |  73.30 |
|       64K |  3486.1 |  55.18 |
+-----------+---------+--------+
|    TTFR 0 |     536 |      - |
|   TTFR 4K |    1286 |      - |
|  TTFR 16K |    3338 |      - |
|  TTFR 32K |    7280 |      - |
|  TTFR 64K |   17459 |      - |
+-----------+---------+--------+

  TG Peak (burst): 120.00 107.00 92.00 76.00 71.00

With -ctk q8_0 -ctv q6_0 -khad -vhad

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6417.0 | 111.95 |
|        4K |  5477.4 |  83.21 |
|       16K |  5257.7 |  78.56 |
|       32K |  4424.8 |  65.65 |
|       64K |  3504.7 |  48.56 |
+-----------+---------+--------+
|    TTFR 0 |     567 |      - |
|   TTFR 4K |    1283 |      - |
|  TTFR 16K |    3418 |      - |
|  TTFR 32K |    7262 |      - |
|  TTFR 64K |   17226 |      - |
+-----------+---------+--------+

  TG Peak (burst): 142.00 88.00 93.00 68.00 51.00

With -ctk q8_0 -ctv q6_0 -khad -vhad (83.8, 88.2)

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6487.2 | 108.75 |
|        4K |  5869.8 |  93.81 |
|       16K |  5457.9 |  80.12 |
|       32K |  4625.9 |  66.09 |
|       64K |  3606.3 |  49.35 |
+-----------+---------+--------+
|    TTFR 0 |     549 |      - |
|   TTFR 4K |    1257 |      - |
|  TTFR 16K |    3310 |      - |
|  TTFR 32K |    7043 |      - |
|  TTFR 64K |   16950 |      - |
+-----------+---------+--------+

  TG Peak (burst): 130.00 97.00 89.00 70.00 59.00

Full final config

  "minimax-m2.7-q3":
    proxy: "http://127.0.0.1:8088"
    env:
      - "CUDA_VISIBLE_DEVICES=0,1"
    cmd: >
      /app/run-server.sh
      --model /models/models--ubergarm--MiniMax-M2.7-GGUF/snapshots/b39e25f035f93fbb15d52bcc5cc8081b717efe65/smol-IQ3_KS/MiniMax-M2.7-smol-IQ3_KS-00001-of-00003.gguf
      --port 8088
      --alias minimax-m2.7-q3
      --jinja
      -c 80000
      -sm graph
      --threads 1
      --n-gpu-layers 99
      -muge
      -ctk q4_0 -ctv q4_0 -khad -vhad
      --batch-size 4096
      --ubatch-size 2048
      --no-mmap
      --host 0.0.0.0
      -cram 20000
      --spec-type ngram-map-k4v --spec-ngram-size-n 8 --draft-min 1 --draft-max 16 --draft-p-min 0.4

curiouspp8

Apr 13

•

edited Apr 13

After few tasks done using "-ctk q4_0 -ctv q4_0 -khad -vhad" I am surprised by coherence and that it generally doesn't feel like a typical degradation due quantization. It mostly works but feels a bit lazier. Like I asked it do this for me. It goes researches how to that in my project then said - here is how you do this if you want to. Vs less quantized versions generally would also do it. I noticed that with weight quantization before. Eg Kimi Q1 was 100% like that. Very smart but very lazy. Q2 was less lazy. Q3 was ~ as you would expect. That that was weights quantization and I found just using ctk/ctv at Q8 was degrading it on longer sessions, but in very obvious ways, they were becoming just dumb or failed tool calls / loops etc. This is not very scientific comparison and mixing of a few things, but hopefully still useful as this is not a bench but real person doing real work with those. With the very latest version of ik and those -khad -vhad optimizations, I'd have to experiment more and now more inclined to include at least Q8 KV as default for all models. q4 on minimax looks very promising so far but too early. I desperately need VRAM and a tiny tradeoff might be 100% worth it. Need to see how long session with opencode holdsup.

ubergarm

Owner Apr 13

•

edited Apr 13

Yeah its all trade-offs! Thanks for the benchmarks showing some price to pay for -khad -vhad mostly in TG speeds interestingly.

Right, for MLA models (e.g. GLM-5.1, DeepSeek, Kimi-K2.5 etc which use latent attention compression already by design) I try not to add extra kv-cache compression and go no lower than q8_0. On stuff like MiniMax its fine to play around going lower, but generally i try not to go below -khad -ctk q6_0 -vhad -ctv q4_0 personally, and try to stay above that and just control my prompts / restart the client frequently. You kinda have to look at the dimensions of the GQA and attention style to know how much the architechture is already skimping to save on kv-cache memory. Qwen3.5 is already very efficient given the gated delta net attn stuff, so i leave it at full f16 usually as its already "cheap".

The idea of what is now known as the "ralph wiggam" loop is interesting. It is just a for loop that keeps restarting the client with state updated into the local files / git repo. This can let you keep max context lower and generally better than pushing past 200k imo.

Also check out this guys speculative decoding settings, I'm still fooling around to dial those in on MiniMax and GLM-5.1 https://huggingface.co/zai-org/GLM-5.1/discussions/5#69dce5bc7d17d64f187cfa5f

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment