Instructions to use ubergarm/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/MiniMax-M2.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/MiniMax-M2.7-GGUF", filename="BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ubergarm/MiniMax-M2.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: ./llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Use Docker
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- LM Studio
- Jan
- vLLM
How to use ubergarm/MiniMax-M2.7-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/MiniMax-M2.7-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/MiniMax-M2.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- Ollama
How to use ubergarm/MiniMax-M2.7-GGUF with Ollama:
ollama run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- Unsloth Studio
How to use ubergarm/MiniMax-M2.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/MiniMax-M2.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/MiniMax-M2.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/MiniMax-M2.7-GGUF to start chatting
- Pi
How to use ubergarm/MiniMax-M2.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/MiniMax-M2.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/MiniMax-M2.7-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- Lemonade
How to use ubergarm/MiniMax-M2.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Run and chat with the model
lemonade run user.MiniMax-M2.7-GGUF-IQ1_S_Q
List all available models
lemonade list
Testing Q5 flavors (ubergarm / aessedai / unsloth) for "speed" on 8x RTX 3090
In scope:
MODEL: ubergarm/MiniMax-M2.7-IQ5_K
model type = 230B.A10B
model ftype = IQ5_K - 5.5 bpw
model params = 228.690 B
model size = 157.771 GiB (5.926 BPW)
repeating layers = 156.555 GiB (5.912 BPW, 227.461 B parameters)
general.name = MiniMax M2.7
MODEL: aessedai/MiniMax-M2.7-Q5_K_M
model type = 230B.A10B
model ftype = Q8_0
model params = 228.690 B
model size = 157.226 GiB (5.906 BPW)
repeating layers = 156.010 GiB (5.892 BPW, 227.461 B parameters)
general.name = MiniMax M2.7
MODEL: unsloth/MiniMax-M2.7-UD-Q5_K_XL
model type = 230B.A10B
model ftype = Q5_K - Medium
model params = 228.690 B
model size = 157.797 GiB (5.927 BPW)
repeating layers = 156.581 GiB (5.913 BPW, 227.461 B parameters)
general.name = Minimax-M2.7
Common launch command (version: 4416 (4945d3b7)):
/ik_llama.cpp/build/bin/llama-sweep-bench --model "model_name" -c 16384 -fa 1 --no-mmap -ngl 999 --jinja --seed 1976 -ctk q8_0 -khad -ctv q8_0 -vhad -muge -b 2048 -ub 512 -sm graph --threads 24 --parallel 1 --fit --fit-margin 3072 -ts 0.875,0.885,0.875,0.885,0.875,0.885,0.875,0.8
MODEL: ubergarm/MiniMax-M2.7-IQ5_K
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 0.796 | 643.32 | 2.486 | 51.49 |
| 512 | 128 | 512 | 0.369 | 1388.78 | 2.629 | 48.68 |
| 512 | 128 | 1024 | 0.370 | 1384.23 | 2.663 | 48.07 |
| 512 | 128 | 1536 | 0.372 | 1377.91 | 2.642 | 48.45 |
| 512 | 128 | 2048 | 0.373 | 1372.47 | 2.642 | 48.45 |
| 512 | 128 | 2560 | 0.393 | 1304.17 | 2.650 | 48.30 |
| 512 | 128 | 3072 | 0.375 | 1363.74 | 2.625 | 48.76 |
| 512 | 128 | 3584 | 0.376 | 1363.38 | 2.611 | 49.03 |
| 512 | 128 | 4096 | 0.411 | 1246.88 | 2.744 | 46.65 |
| 512 | 128 | 4608 | 0.378 | 1353.56 | 2.779 | 46.06 |
| 512 | 128 | 5120 | 0.380 | 1346.52 | 2.773 | 46.16 |
| 512 | 128 | 5632 | 0.409 | 1252.25 | 2.771 | 46.19 |
| 512 | 128 | 6144 | 0.382 | 1340.22 | 2.767 | 46.26 |
| 512 | 128 | 6656 | 0.382 | 1339.58 | 2.845 | 44.99 |
| 512 | 128 | 7168 | 0.384 | 1334.34 | 2.817 | 45.44 |
| 512 | 128 | 7680 | 0.383 | 1336.05 | 2.897 | 44.19 |
| 512 | 128 | 8192 | 0.385 | 1329.24 | 2.826 | 45.30 |
| 512 | 128 | 8704 | 0.387 | 1324.24 | 2.804 | 45.65 |
| 512 | 128 | 9216 | 0.388 | 1321.28 | 2.875 | 44.52 |
| 512 | 128 | 9728 | 0.388 | 1320.50 | 2.843 | 45.03 |
| 512 | 128 | 10240 | 0.390 | 1312.12 | 2.910 | 43.98 |
| 512 | 128 | 10752 | 0.391 | 1310.44 | 2.888 | 44.32 |
| 512 | 128 | 11264 | 0.391 | 1307.79 | 2.936 | 43.59 |
| 512 | 128 | 11776 | 0.444 | 1152.42 | 2.939 | 43.56 |
| 512 | 128 | 12288 | 0.465 | 1101.35 | 2.860 | 44.76 |
| 512 | 128 | 12800 | 0.394 | 1299.20 | 2.930 | 43.68 |
| 512 | 128 | 13312 | 0.397 | 1290.64 | 2.980 | 42.95 |
| 512 | 128 | 13824 | 0.399 | 1283.33 | 2.978 | 42.98 |
| 512 | 128 | 14336 | 0.397 | 1290.10 | 2.938 | 43.57 |
| 512 | 128 | 14848 | 0.399 | 1282.03 | 2.945 | 43.47 |
| 512 | 128 | 15360 | 0.426 | 1201.08 | 2.970 | 43.10 |
| 512 | 128 | 15872 | 0.402 | 1274.47 | 2.955 | 43.31 |
MODEL: aessedai/MiniMax-M2.7-Q5_K_M
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 0.569 | 900.06 | 2.131 | 60.06 |
| 512 | 128 | 512 | 0.353 | 1451.87 | 2.224 | 57.55 |
| 512 | 128 | 1024 | 0.369 | 1386.36 | 2.197 | 58.25 |
| 512 | 128 | 1536 | 0.356 | 1439.61 | 2.200 | 58.19 |
| 512 | 128 | 2048 | 0.377 | 1358.53 | 2.218 | 57.71 |
| 512 | 128 | 2560 | 0.370 | 1384.66 | 2.270 | 56.39 |
| 512 | 128 | 3072 | 0.361 | 1417.56 | 2.291 | 55.87 |
| 512 | 128 | 3584 | 0.363 | 1411.63 | 2.315 | 55.30 |
| 512 | 128 | 4096 | 0.361 | 1419.19 | 2.387 | 53.62 |
| 512 | 128 | 4608 | 0.362 | 1414.17 | 2.363 | 54.17 |
| 512 | 128 | 5120 | 0.366 | 1397.51 | 2.360 | 54.24 |
| 512 | 128 | 5632 | 0.366 | 1399.66 | 2.410 | 53.11 |
| 512 | 128 | 6144 | 0.370 | 1382.89 | 2.461 | 52.00 |
| 512 | 128 | 6656 | 0.367 | 1396.45 | 2.418 | 52.95 |
| 512 | 128 | 7168 | 0.369 | 1386.97 | 2.368 | 54.06 |
| 512 | 128 | 7680 | 0.369 | 1386.16 | 2.408 | 53.17 |
| 512 | 128 | 8192 | 0.370 | 1383.54 | 2.432 | 52.64 |
| 512 | 128 | 8704 | 0.370 | 1382.38 | 2.519 | 50.82 |
| 512 | 128 | 9216 | 0.372 | 1375.36 | 2.521 | 50.77 |
| 512 | 128 | 9728 | 0.373 | 1372.62 | 2.498 | 51.24 |
| 512 | 128 | 10240 | 0.376 | 1363.38 | 2.525 | 50.70 |
| 512 | 128 | 10752 | 0.375 | 1365.42 | 2.489 | 51.43 |
| 512 | 128 | 11264 | 0.377 | 1358.84 | 2.558 | 50.05 |
| 512 | 128 | 11776 | 0.376 | 1360.63 | 2.537 | 50.45 |
| 512 | 128 | 12288 | 0.378 | 1353.63 | 2.603 | 49.16 |
| 512 | 128 | 12800 | 0.378 | 1353.04 | 2.587 | 49.48 |
| 512 | 128 | 13312 | 0.381 | 1342.72 | 2.629 | 48.69 |
| 512 | 128 | 13824 | 0.383 | 1338.30 | 2.582 | 49.57 |
| 512 | 128 | 14336 | 0.413 | 1240.04 | 2.648 | 48.34 |
| 512 | 128 | 14848 | 0.389 | 1317.81 | 2.575 | 49.71 |
| 512 | 128 | 15360 | 0.388 | 1321.22 | 2.554 | 50.11 |
| 512 | 128 | 15872 | 0.386 | 1328.07 | 2.552 | 50.16 |
MODEL: unsloth/MiniMax-M2.7-UD-Q5_K_XL
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 0.650 | 787.93 | 2.094 | 61.12 |
| 512 | 128 | 512 | 0.353 | 1451.45 | 2.243 | 57.07 |
| 512 | 128 | 1024 | 0.353 | 1449.03 | 2.212 | 57.88 |
| 512 | 128 | 1536 | 0.356 | 1440.19 | 2.228 | 57.44 |
| 512 | 128 | 2048 | 0.358 | 1432.16 | 2.272 | 56.33 |
| 512 | 128 | 2560 | 0.358 | 1428.59 | 2.272 | 56.34 |
| 512 | 128 | 3072 | 0.360 | 1420.65 | 2.237 | 57.23 |
| 512 | 128 | 3584 | 0.360 | 1423.10 | 2.283 | 56.07 |
| 512 | 128 | 4096 | 0.361 | 1419.62 | 2.356 | 54.33 |
| 512 | 128 | 4608 | 0.362 | 1415.34 | 2.350 | 54.48 |
| 512 | 128 | 5120 | 0.363 | 1411.33 | 2.323 | 55.09 |
| 512 | 128 | 5632 | 0.365 | 1401.12 | 2.381 | 53.75 |
| 512 | 128 | 6144 | 0.369 | 1388.57 | 2.404 | 53.23 |
| 512 | 128 | 6656 | 0.367 | 1394.91 | 2.368 | 54.06 |
| 512 | 128 | 7168 | 0.368 | 1392.53 | 2.353 | 54.40 |
| 512 | 128 | 7680 | 0.368 | 1390.67 | 2.433 | 52.60 |
| 512 | 128 | 8192 | 0.370 | 1385.52 | 2.463 | 51.97 |
| 512 | 128 | 8704 | 0.372 | 1377.16 | 2.446 | 52.33 |
| 512 | 128 | 9216 | 0.371 | 1378.97 | 2.533 | 50.54 |
| 512 | 128 | 9728 | 0.372 | 1374.74 | 2.514 | 50.92 |
| 512 | 128 | 10240 | 0.374 | 1368.52 | 2.565 | 49.91 |
| 512 | 128 | 10752 | 0.374 | 1369.33 | 2.475 | 51.73 |
| 512 | 128 | 11264 | 0.376 | 1363.03 | 2.505 | 51.09 |
| 512 | 128 | 11776 | 0.376 | 1363.24 | 2.499 | 51.22 |
| 512 | 128 | 12288 | 0.378 | 1355.07 | 2.534 | 50.51 |
| 512 | 128 | 12800 | 0.378 | 1353.47 | 2.540 | 50.38 |
| 512 | 128 | 13312 | 0.380 | 1349.10 | 2.542 | 50.36 |
| 512 | 128 | 13824 | 0.381 | 1344.73 | 2.554 | 50.12 |
| 512 | 128 | 14336 | 0.449 | 1139.19 | 2.498 | 51.25 |
| 512 | 128 | 14848 | 0.383 | 1337.16 | 2.597 | 49.29 |
| 512 | 128 | 15360 | 0.386 | 1326.97 | 2.554 | 50.11 |
| 512 | 128 | 15872 | 0.385 | 1329.90 | 2.545 | 50.30 |
Useless (non-belligerent) conclusion: for me the IQ quants continue to be (for reasons / explanations I must admit I don't master) still the slowest on 8X RTX3090 (with or without CPU / RAM offloading)
Thanks for doing some llama-sweep-bench speed benchmarks. Sounds like you're full offloading on 8x 3090s?
/ik_llama.cpp/build/bin/llama-sweep-bench \
--model "model_name" \
-c 16384 \
-fa 1 \
--no-mmap \
-ngl 999 \
--jinja \
--seed 1976 \
-ctk q8_0 -khad -ctv q8_0 -vhad \
-muge \
-b 2048 -ub 512 \
-sm graph \
--threads 24 \
--parallel 1 \
--fit \
--fit-margin 3072 -ts 0.875,0.885,0.875,0.885,0.875,0.885,0.875,0.8
A few tips:
- Use
--threads 1when doing full GPU offload to avoid extra synchronization latency due to unused CPU threads (might give 1-3% boost) - For more PP pump batch sizes if VRAM allows e.g.
-ub 2048 -b 2048or even-ub 4096 -b 4096 -khad -vhadcan have some overhead, and not really needed for q8_0 (best for q6_0 and under likely) so try omitting- add
--warmup-batchto help fix the odd drip in the first batch - add
-n 64to speed it up as it will generate less tokens per batch and run your tests much faster
the IQ quants continue to be (for reasons / explanations I must admit I don't master) still the slowest on 8X RTX3090 (with or without CPU / RAM offloading)
Yes, it is a trade-off in that using a more complex IQ quant can give better perplexity per BPW, but at cost of extra compute depending on backend implementation.
The flags --merge-qkv and -grt bf16 can boost TPS (tokens per second) by ~20%. The -grt flag is particularly beneficial if your GPU is not running at PCIe x16 speed.