Instructions to use ubergarm/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.7-GGUF",
	filename="BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ubergarm/MiniMax-M2.7-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.7-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Ollama
How to use ubergarm/MiniMax-M2.7-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

How to use ubergarm/MiniMax-M2.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/MiniMax-M2.7-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Lemonade

How to use ubergarm/MiniMax-M2.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run and chat with the model

lemonade run user.MiniMax-M2.7-GGUF-IQ1_S_Q

List all available models

lemonade list

4bpw request =)

by BahamutRU - opened Apr 12

Discussion

BahamutRU

Apr 12

16 GB VRAM + 128 GB RAM

128 GB shared RAM

Maybe 125-126 GB max size? 120-121?
Or, maybe, IQ4_NL?

mainline be great, ah-ah. ^_^'

At this moment I use Q4_K_S from unsloth, but quality… wish it were better.

ubergarm

Owner Apr 12

Yes, this is probably a good size. That extra 16GB VRAM can make a difference with this model, as my previous mainline quant was mainline-IQ4_NL 121.234 GiB (4.554 BPW) so just too big for the folks with 128gb strix halo or dgx spark..

That UD-Q4_K_S is probably around 122GiB...

It will be difficult to shave a little more off to fit in under 128GB total, but I think you'll be fine with the 16GB VRAM extra. Probably can't pump -ub though and might have to use stuff like -khad -ctk q8_0 -ctv q6_0 to fit enough kv-cache... -vhad is still being fixed here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625

Okay, I'll finish up getting KLD data this time as the PPL data was wonky on 2.5. This will help me fiddle with something in this ballpark!

eepos

Apr 12

This one at 117GB was my favorite M2.5 quant for 96 GB RAM + 32 GB VRAM, hoping to see it again with M2.7:

https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/tree/main/smol-IQ4_KSS

(Not much room left for context but the model is kinda slow with this setup anyway long context isn't really feasible anyway)

tschunschi

Apr 12

Thanks for all those great quants! 🙏

I used MiniMax-M2.5-GGUF IQ4_NL 121.386 GiB (4.559 BPW) as my daily coding driver the last few months.
So, I'd be happy to see nice ik quant in that range 😊

ndroidph

Apr 12

I would second a mainline quant too, if it's not too much trouble.

ubergarm

Owner Apr 12

@ndroidph

Keep your eyes peeled for AesSedai's mainline quants which are similar recipes as mine using mainline compatible types: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/tree/main

I'll likely upload 2 more then, something that will fit in under 128GB and something that will need a little more than 128GB.

ndroidph

Apr 12

Very nice, thanks!

BahamutRU

Apr 12

AesSedai's mainline quants

I'm waiting for 12 hours!.. ='D

doctorjei

Apr 12

FWIW, I have been using your m2.5 IQ4_XS (115G) on a Strix Halo / 128GB. I am using it headless, though, with a bunch of other optimizations (TurboQuant, etc.) Would love to see m2.7 in the same :)

curiouspp8

Apr 12

FWIW, I have been using your m2.5 IQ4_XS (115G) on a Strix Halo / 128GB. I am using it headless, though, with a bunch of other optimizations (TurboQuant, etc.) Would love to see m2.7 in the same :)

What the speed like on that machine? pp/tg

AesSedai

Apr 12

@BahamutRU they're uploading now and should be available within an hour!

ubergarm

Owner Apr 12

Okay, I added the smol-IQ4_KSS which looks good on the KLD graph, but seems wonky on the PPL graph. The perplexity of this and previous M2.5 model was kinda wonky with some quants scoring "better" than baseline, hence showing KLD as well this time.

I gotta take a break, not gonna release another one just yet and hopefully folks can find the mainline quants they want from Aes for now.

Also be careful, some 4ish BPW UD quants are throwing nan in the perplexity test: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/discussions/1#69dbf578f8841cf541647480

I've tested all mine here look good with ik_llama.cpp's llama-server --validate-quants and run a full clean perplexity/kld on them with no nans!

Holler at me later if you still can't find the exact iq4_nl quant etc, i might do one more, but some of my own early testing suggests Qwen3.5 is still pretty strong for smaller sizes: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/discussions/3#69dc0e13a36186081f3e4b4a

ubergarm

Owner Apr 12

Did a little write-up on r/LocalLLaMA comparing MiniMax-M2.7 vs Qwen3.5-122B for 96GB VRAM situation: https://www.reddit.com/r/LocalLLaMA/comments/1sjsokz/minimaxm27_vs_qwen35122ba10b_for_96gb_vram_full/

doctorjei

Apr 13

FWIW, I have been using your m2.5 IQ4_XS (115G) on a Strix Halo / 128GB. I am using it headless, though, with a bunch of other optimizations (TurboQuant, etc.) Would love to see m2.7 in the same :)

What the speed like on that machine? pp/tg

I posted some stuff about MiniMax M2.7 - @AesSedai 'sIQ4_XS: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/discussions/1#69dc57e4bc88172c7dbbc256

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment