Instructions to use ubergarm/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/MiniMax-M2.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/MiniMax-M2.7-GGUF",
	filename="BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/MiniMax-M2.7-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
llama cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

LM Studio
Jan

vLLM

How to use ubergarm/MiniMax-M2.7-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/MiniMax-M2.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/MiniMax-M2.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Ollama
How to use ubergarm/MiniMax-M2.7-GGUF with Ollama:
```
ollama run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Unsloth Studio

How to use ubergarm/MiniMax-M2.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/MiniMax-M2.7-GGUF to start chatting

How to use ubergarm/MiniMax-M2.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/MiniMax-M2.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use ubergarm/MiniMax-M2.7-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use ubergarm/MiniMax-M2.7-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
```

Lemonade

How to use ubergarm/MiniMax-M2.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q

Run and chat with the model

lemonade run user.MiniMax-M2.7-GGUF-IQ1_S_Q

List all available models

lemonade list

Testing Q5 flavors (ubergarm / aessedai / unsloth) for "speed" on 8x RTX 3090

#10

by dehnhaide - opened Apr 16

Discussion

dehnhaide

Apr 16

In scope:

MODEL: ubergarm/MiniMax-M2.7-IQ5_K
model type = 230B.A10B
model ftype = IQ5_K - 5.5 bpw
model params = 228.690 B
model size = 157.771 GiB (5.926 BPW)
repeating layers = 156.555 GiB (5.912 BPW, 227.461 B parameters)
general.name = MiniMax M2.7

MODEL: aessedai/MiniMax-M2.7-Q5_K_M
model type = 230B.A10B
model ftype = Q8_0
model params = 228.690 B
model size = 157.226 GiB (5.906 BPW)
repeating layers = 156.010 GiB (5.892 BPW, 227.461 B parameters)
general.name = MiniMax M2.7

MODEL: unsloth/MiniMax-M2.7-UD-Q5_K_XL
model type = 230B.A10B
model ftype = Q5_K - Medium
model params = 228.690 B
model size = 157.797 GiB (5.927 BPW)
repeating layers = 156.581 GiB (5.913 BPW, 227.461 B parameters)
general.name = Minimax-M2.7

Common launch command (version: 4416 (4945d3b7)):
/ik_llama.cpp/build/bin/llama-sweep-bench --model "model_name" -c 16384 -fa 1 --no-mmap -ngl 999 --jinja --seed 1976 -ctk q8_0 -khad -ctv q8_0 -vhad -muge -b 2048 -ub 512 -sm graph --threads 24 --parallel 1 --fit --fit-margin 3072 -ts 0.875,0.885,0.875,0.885,0.875,0.885,0.875,0.8

MODEL: ubergarm/MiniMax-M2.7-IQ5_K
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.796	643.32	2.486	51.49
512	128	512	0.369	1388.78	2.629	48.68
512	128	1024	0.370	1384.23	2.663	48.07
512	128	1536	0.372	1377.91	2.642	48.45
512	128	2048	0.373	1372.47	2.642	48.45
512	128	2560	0.393	1304.17	2.650	48.30
512	128	3072	0.375	1363.74	2.625	48.76
512	128	3584	0.376	1363.38	2.611	49.03
512	128	4096	0.411	1246.88	2.744	46.65
512	128	4608	0.378	1353.56	2.779	46.06
512	128	5120	0.380	1346.52	2.773	46.16
512	128	5632	0.409	1252.25	2.771	46.19
512	128	6144	0.382	1340.22	2.767	46.26
512	128	6656	0.382	1339.58	2.845	44.99
512	128	7168	0.384	1334.34	2.817	45.44
512	128	7680	0.383	1336.05	2.897	44.19
512	128	8192	0.385	1329.24	2.826	45.30
512	128	8704	0.387	1324.24	2.804	45.65
512	128	9216	0.388	1321.28	2.875	44.52
512	128	9728	0.388	1320.50	2.843	45.03
512	128	10240	0.390	1312.12	2.910	43.98
512	128	10752	0.391	1310.44	2.888	44.32
512	128	11264	0.391	1307.79	2.936	43.59
512	128	11776	0.444	1152.42	2.939	43.56
512	128	12288	0.465	1101.35	2.860	44.76
512	128	12800	0.394	1299.20	2.930	43.68
512	128	13312	0.397	1290.64	2.980	42.95
512	128	13824	0.399	1283.33	2.978	42.98
512	128	14336	0.397	1290.10	2.938	43.57
512	128	14848	0.399	1282.03	2.945	43.47
512	128	15360	0.426	1201.08	2.970	43.10
512	128	15872	0.402	1274.47	2.955	43.31

MODEL: aessedai/MiniMax-M2.7-Q5_K_M
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.569	900.06	2.131	60.06
512	128	512	0.353	1451.87	2.224	57.55
512	128	1024	0.369	1386.36	2.197	58.25
512	128	1536	0.356	1439.61	2.200	58.19
512	128	2048	0.377	1358.53	2.218	57.71
512	128	2560	0.370	1384.66	2.270	56.39
512	128	3072	0.361	1417.56	2.291	55.87
512	128	3584	0.363	1411.63	2.315	55.30
512	128	4096	0.361	1419.19	2.387	53.62
512	128	4608	0.362	1414.17	2.363	54.17
512	128	5120	0.366	1397.51	2.360	54.24
512	128	5632	0.366	1399.66	2.410	53.11
512	128	6144	0.370	1382.89	2.461	52.00
512	128	6656	0.367	1396.45	2.418	52.95
512	128	7168	0.369	1386.97	2.368	54.06
512	128	7680	0.369	1386.16	2.408	53.17
512	128	8192	0.370	1383.54	2.432	52.64
512	128	8704	0.370	1382.38	2.519	50.82
512	128	9216	0.372	1375.36	2.521	50.77
512	128	9728	0.373	1372.62	2.498	51.24
512	128	10240	0.376	1363.38	2.525	50.70
512	128	10752	0.375	1365.42	2.489	51.43
512	128	11264	0.377	1358.84	2.558	50.05
512	128	11776	0.376	1360.63	2.537	50.45
512	128	12288	0.378	1353.63	2.603	49.16
512	128	12800	0.378	1353.04	2.587	49.48
512	128	13312	0.381	1342.72	2.629	48.69
512	128	13824	0.383	1338.30	2.582	49.57
512	128	14336	0.413	1240.04	2.648	48.34
512	128	14848	0.389	1317.81	2.575	49.71
512	128	15360	0.388	1321.22	2.554	50.11
512	128	15872	0.386	1328.07	2.552	50.16

MODEL: unsloth/MiniMax-M2.7-UD-Q5_K_XL

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.650	787.93	2.094	61.12
512	128	512	0.353	1451.45	2.243	57.07
512	128	1024	0.353	1449.03	2.212	57.88
512	128	1536	0.356	1440.19	2.228	57.44
512	128	2048	0.358	1432.16	2.272	56.33
512	128	2560	0.358	1428.59	2.272	56.34
512	128	3072	0.360	1420.65	2.237	57.23
512	128	3584	0.360	1423.10	2.283	56.07
512	128	4096	0.361	1419.62	2.356	54.33
512	128	4608	0.362	1415.34	2.350	54.48
512	128	5120	0.363	1411.33	2.323	55.09
512	128	5632	0.365	1401.12	2.381	53.75
512	128	6144	0.369	1388.57	2.404	53.23
512	128	6656	0.367	1394.91	2.368	54.06
512	128	7168	0.368	1392.53	2.353	54.40
512	128	7680	0.368	1390.67	2.433	52.60
512	128	8192	0.370	1385.52	2.463	51.97
512	128	8704	0.372	1377.16	2.446	52.33
512	128	9216	0.371	1378.97	2.533	50.54
512	128	9728	0.372	1374.74	2.514	50.92
512	128	10240	0.374	1368.52	2.565	49.91
512	128	10752	0.374	1369.33	2.475	51.73
512	128	11264	0.376	1363.03	2.505	51.09
512	128	11776	0.376	1363.24	2.499	51.22
512	128	12288	0.378	1355.07	2.534	50.51
512	128	12800	0.378	1353.47	2.540	50.38
512	128	13312	0.380	1349.10	2.542	50.36
512	128	13824	0.381	1344.73	2.554	50.12
512	128	14336	0.449	1139.19	2.498	51.25
512	128	14848	0.383	1337.16	2.597	49.29
512	128	15360	0.386	1326.97	2.554	50.11
512	128	15872	0.385	1329.90	2.545	50.30

Useless (non-belligerent) conclusion: for me the IQ quants continue to be (for reasons / explanations I must admit I don't master) still the slowest on 8X RTX3090 (with or without CPU / RAM offloading)

ubergarm

Owner Apr 17

@dehnhaide

Thanks for doing some llama-sweep-bench speed benchmarks. Sounds like you're full offloading on 8x 3090s?

/ik_llama.cpp/build/bin/llama-sweep-bench \
  --model "model_name" \
  -c 16384 \
  -fa 1 \
  --no-mmap \
  -ngl 999 \
  --jinja \
  --seed 1976 \
  -ctk q8_0 -khad -ctv q8_0 -vhad \
  -muge \
  -b 2048 -ub 512 \
  -sm graph \
  --threads 24 \
  --parallel 1 \
  --fit \
  --fit-margin 3072 -ts 0.875,0.885,0.875,0.885,0.875,0.885,0.875,0.8

A few tips:

Use --threads 1 when doing full GPU offload to avoid extra synchronization latency due to unused CPU threads (might give 1-3% boost)
For more PP pump batch sizes if VRAM allows e.g. -ub 2048 -b 2048 or even -ub 4096 -b 4096
-khad -vhad can have some overhead, and not really needed for q8_0 (best for q6_0 and under likely) so try omitting
add --warmup-batch to help fix the odd drip in the first batch
add -n 64 to speed it up as it will generate less tokens per batch and run your tests much faster

the IQ quants continue to be (for reasons / explanations I must admit I don't master) still the slowest on 8X RTX3090 (with or without CPU / RAM offloading)

Yes, it is a trade-off in that using a more complex IQ quant can give better perplexity per BPW, but at cost of extra compute depending on backend implementation.

FBykov

Apr 23

The flags --merge-qkv and -grt bf16 can boost TPS (tokens per second) by ~20%. The -grt flag is particularly beneficial if your GPU is not running at PCIe x16 speed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment