Instructions to use ubergarm/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.7-GGUF",
	filename="BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/MiniMax-M2.7-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.7-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Ollama
How to use ubergarm/MiniMax-M2.7-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

How to use ubergarm/MiniMax-M2.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use ubergarm/MiniMax-M2.7-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use ubergarm/MiniMax-M2.7-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Lemonade

How to use ubergarm/MiniMax-M2.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run and chat with the model

lemonade run user.MiniMax-M2.7-GGUF-IQ1_S_Q

List all available models

lemonade list

Quick bench for IQ2_KS on 1 GPU

by curiouspp8 - opened Apr 12

Discussion

curiouspp8

Apr 12

Thank you once again being on top of the new model!

Basic opencode stuff works. Haven't done any work with it. Nice option to have as it fits into 1 RTX 6000 Pro. The IQ3_XXS almost. The base VRAM used 88gb, but context seems pretty VRAM heavy. Loaded 40k @95.3gb.

IQ2_KS used 86.2GB vram duing the test with 120k context size. Overall - in line with VLLM without concurrency, but weird to see such a big TG drop as context increases. Not sure if that's expected in ik_llama.

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  4558.0 | 103.50 |
|        4K |  3946.5 |  90.62 |
|       16K |  3371.0 |  70.50 |
|       32K |  2700.1 |  48.32 |
|       64K |  1974.8 |  28.96 |
+-----------+---------+--------+
|    TTFR 0 |     881 |      - |
|   TTFR 4K |    2032 |      - |
|  TTFR 16K |    5975 |      - |
|  TTFR 32K |   13456 |      - |
|  TTFR 64K |   34677 |      - |
+-----------+---------+--------+

  TG Peak (burst): 107.00 94.00 74.00 51.00 31.00

ubergarm

Owner Apr 12

•

edited Apr 12

@curiouspp8

Great, thanks for the quick test. I'm trying to currently figure out how to do some kind of HumanEval test to decide which will be my daily driver for 96GB VRAM:

MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW) fits ~160k quantized kv-cache
Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW) fits ~256k unquantized kv-cache + mmproj support

If you're following along here, I just figured out one more small thing to get -vhad working now too: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4232579356

So on a single GPU you don't need -sm graph so after applying the above branch+patch and rebuilding ik_llama.cpp you can run with:

./build/bin/llama-server \
  --model "$model" \
  --alias ubergarm/MiniMax-M2.7 \
  -c 163840 \
  -khad -ctk q8_0 -vhad -ctv q6_0 \
  --merge-qkv \
  -muge \
  -ngl 999 \
  -ub 1024 -b 2048 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  --no-mmap \
  --spec-type ngram-map-k4v --spec-ngram-size-n 8 --draft-min 1 --draft-max 16 --draft-p-min 0.4 \
  --cache-ram 32768 \
  --prompt-cache-all

If you don't have 32GB of RAM fre, drop the cache-ram to whatever you want e.g. 8192 for 8GiB etc...

you can probably squeeze some more PP speed increasing -ub 2048 -b 2048 but might need to reduce context length... fiddle with it and find what you like.

i'll eventually get some llama-sweep-bench and look into that drop off issue...

curiouspp8

Apr 12

I would highly recommend spend more time with minimax. Especially 2.7 seems to be very solid update (based on my personal usage so far)
What kinda impact does -vhad have? Haven't encountered this one before.

ubergarm

Owner Apr 12

•

edited Apr 12

I would highly recommend spend more time with minimax.

yeah, initial vibes are that it seems pretty good for some tasks, works well in opencode so far...

but, i think i ran 164 humaneval questions against both models:

MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW) fits ~160k quantized kv-cache
- humaneval pass@1 (base) 0.220 taking 32m48s
Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW) fits ~256k unquantized kv-cache + mmproj support
- humaneval pass@1 (base) 0.494 taking 31m20s

assuming my vibecoded EvalPlus client was actually doing the right thing then Qwen3.5-122B is looking better so far...

What kinda impact does -vhad have? Haven't encountered this one before.

its new, ik added it after all the "turboquant" hype ... it can help if you're quantizing the v cache..

details here: https://github.com/ikawrakow/ik_llama.cpp/pull/1527

ubergarm

Owner Apr 12

Another reason I'll likely stick with Qwen3.5-122B for now on 96GB VRAM:

ubergarm

Owner Apr 12

I did a little write-up on it here: https://www.reddit.com/r/LocalLLaMA/comments/1sjsokz/minimaxm27_vs_qwen35122ba10b_for_96gb_vram_full/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment