Instructions to use unsloth/MiniMax-M2.1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/MiniMax-M2.1-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/MiniMax-M2.1-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/MiniMax-M2.1-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/MiniMax-M2.1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/MiniMax-M2.1-GGUF",
	filename="BF16/MiniMax-M2.1-BF16-00001-of-00010.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/MiniMax-M2.1-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

LM Studio
Jan

vLLM

How to use unsloth/MiniMax-M2.1-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/MiniMax-M2.1-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/MiniMax-M2.1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

SGLang

How to use unsloth/MiniMax-M2.1-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/MiniMax-M2.1-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/MiniMax-M2.1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/MiniMax-M2.1-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/MiniMax-M2.1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use unsloth/MiniMax-M2.1-GGUF with Ollama:
```
ollama run hf.co/unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL
```

Unsloth Studio new

How to use unsloth/MiniMax-M2.1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/MiniMax-M2.1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/MiniMax-M2.1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/MiniMax-M2.1-GGUF to start chatting

Pi new

How to use unsloth/MiniMax-M2.1-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/MiniMax-M2.1-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Docker Model Runner
How to use unsloth/MiniMax-M2.1-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL
```

Lemonade

How to use unsloth/MiniMax-M2.1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/MiniMax-M2.1-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.MiniMax-M2.1-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Report: getting 20 t/s with UD-Q4_K_XL and 72 VRAM

by SlavikF - opened Dec 26, 2025

Discussion

SlavikF

Dec 26, 2025

•

edited Dec 26, 2025

Thank you for publishing quants!

My system:

Intel Xeon W5-3425 (12c/24t)
256GB DDR5-4800 (8 channels)
RTX 4090D 48GB
RTX 3090

llama.cpp:

  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 7542 (af3be131c) with GNU 11.4.0 for Linux x86_64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12

Getting speed:

prompt eval time =   29609.83 ms /  1987 tokens (  ...   67.11 tokens per second)
       eval time =   91223.57 ms /  1829 tokens (   ...  20.05 tokens per second)

Obviously the speed is lower on larger context.

I'm running it in router mode and using model.ini settings:

[local-minimax230b]
temp=1.0
top-p=0.95
top-k=40
ctx-size=65536

Works great!

danielhanchen

Unsloth AI org Dec 27, 2025

Oh very nice!

aaron-newsome

Dec 27, 2025

would love to see what kind of speed you get with 100k of context. the previous M2, i got 80 tokens per second on a simple prompt, slows down considerably over 100k - on my rig at least.

CHNtentes

Dec 27, 2025

would love to see what kind of speed you get with 100k of context. the previous M2, i got 80 tokens per second on a simple prompt, slows down considerably over 100k - on my rig at least.

that's normal for transformer models :)
linear (hybrid) type models like qwen3-next or kimi-linear, slow down less with long context

SlavikF

Dec 27, 2025

tested with 16k context:

PP stay over 60 t/s
TG goes down to 10 t/s

Tried higher context, but requests with higher context fails:

srv operator(): http client error: Failed to read connection

Looks like that's because of new llama.cpp "--fit" flag. Need to sort it out...

aaron-newsome

Dec 27, 2025

124k of context in opencode and i'm getting 22/tps. not bad. 3x RTX PRO 6000 Blackwell. Maybe I can improve it with the 4th GPU.

larskort

Dec 27, 2025

•

edited Dec 27, 2025

Hello.
My numbers on homelab with two nodes from trash, may be this info useful for someone who look at components from secondhand.
MiniMax-M2.1-UD-Q4_K_XL

Runnimg params:
./llama-server -m /mnt/nvme1/llms/MiniMax-M2.1-UD-Q4_K_XL-00001-of-00003.gguf -c 16000 -fa on --host 0.0.0.0 --port 8080 --jinja --threads 36 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0 --presence-penalty 0.25 --no-context-shift -b 1000 -ub 1000 --no-mmap -fit off -ngl 99 -ctk q8_0 -ctv q8_0 --rpc 192.168.10.102:50052 -ts 8,8,8,15,15,8,0,0,8,15,15,15,17

PP:
115 t/s at 2K
108.83 at 6.2K

TG:
18 t/s at near zero context
15 t/s at 2.5K
13.5 t/s at 6.8K
13.2 at 8K
13.0 at 9K
12.8 at 10K

Latest llama.cpp + llama.cpp-rpc
2.5G LAN

Server1
i7-7800X (LGA2066) + 64GB DDR4 3200 (4 channels)
7800XT 16G
7800XT 16G
Mi50 16G
Mi50 16G

Server2
2x E5-2697V4 at Supermicto X10-DRG + 256GB DDR4 2400 (4 channels)
Nvidia-P100 16G
Nvidia-P100 16G
CMP 90HX 10G
CMP 50HX 10G
CMP 50HX 10G
P102-100 10G

PCI-SWITCH 1 X1 -> 4 X1 (this part ruins PP. DO NOT USE THAT)
P102-100 10G
P102-100 10G
P102-100 10G (disabled by -ts)
P102-100 10G (disabled by -ts)

ciprianv

Jan 3

•

edited Jan 3

did using a second node via rpc improve the results vs running on the Server2 only for you?

I am using 2 P620 each with 2x3090 and 256gb ram, connected directly by network cable from one to anothe as they have 10gbit network adapters, and I found that adding other 2 x 3090 via rpc, is slightly slower, not faster.. starts at 19t/s and 215 t/s prompt and with rpc i get 18 and 200t/s. Latest llama.cpp.

this is the slightly fastest result, no rpc used:
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn_..exps.=CPU | pp4096 | 217.26 ± 1.03 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 | 18.94 ± 0.10 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d4096 | 213.35 ± 0.85 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 @ d4096 | 17.92 ± 0.36 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d8192 | 208.75 ± 1.60 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 @ d8192 | 17.21 ± 0.24 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d16384 | 197.91 ± 1.16 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 @ d16384 | 16.47 ± 0.12 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d32768 | 179.85 ± 0.92 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 @ d32768 | 14.05 ± 0.07 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d65536 | 142.39 ± 1.13 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 @ d65536 | 10.71 ± 0.03 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d102400 | 117.57 ± 0.41 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | tg32 @ d102400 | 8.27 ± 0.03 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn..exps.=CPU | pp4096 @ d110592 | 112.99 ± 0.32 |
| minimax-m2 230B.A10B Q5_K - Medium | 150.65 GiB | 228.69 B | CUDA,BLAS | 58 | 2048 | q8_0 | q8_0 | 1 | blk.([1-9]|[1-9][0-9]).ffn.._exps.=CPU | tg32 @ d110592 | 7.84 ± 0.02 |

CHNtentes

Jan 3

did using a second node via rpc improve the results vs running on the Server2 only for you?

I am using 2 P620 each with 2x3090 and 256gb ram, connected directly by network cable from one to anothe as they have 10gbit network adapters, and I found that adding other 2 x 3090 via rpc, is slightly slower, not faster.. starts at 19t/s and 215 t/s prompt and with rpc i get 18 and 200t/s. Latest llama.cpp.

this is the slightly fastest result, no rpc used:

model size params backend threads n_ubatch type_k type_v fa ot test t/s

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d4096

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d4096

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d8192

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d8192

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d16384

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d16384

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d32768

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d32768

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d65536

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d65536

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d102400

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d102400

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU pp4096 @ d110592

minimax-m2 230B.A10B Q5_K - Medium 150.65 GiB 228.69 B CUDA,BLAS 58 2048 q8_0 q8_0 1 blk.([1-9] [1-9][0-9]).ffn_.*._exps.=CPU tg32 @ d110592

model	size	params	backend	threads	n_ubatch	type_k	type_v	fa	ot	test	t/s
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d4096
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d4096
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d8192
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d8192
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d16384
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d16384
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d32768
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d32768
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d65536
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d65536
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d102400
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d102400
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	pp4096 @ d110592
minimax-m2 230B.A10B Q5_K - Medium	150.65 GiB	228.69 B	CUDA,BLAS	58	2048	q8_0	q8_0	1	blk.([1-9]	[1-9][0-9]).ffn_.*._exps.=CPU	tg32 @ d110592

there's no tps in the table...

ciprianv

Jan 3

i removed the formatting

larskort

Jan 4

TLDR; Im receive boost not from llama-rpc, but from much more fast cards (7800XT + Mi50) in S1.
TLDR2; in Server2 cmoe + ram gives 11.7 t/s down to 6 on @32K .

@ciprianv mining cards like cmp50,90 and p102 has very specific limitations like broken FMA math (in github some people try reverse-research nvidia leaks - but unsuccessfully for this option) and PCI @1x1 (in cmp50 may be fixed to @1x16 by soldering some SMD. In CMP90 this is in-chip limitation).
So really they normally for something like 30b moe, but not for 200b+ moe. When i started build Server2 (in dec 2019) it used for Folding@Home (my family suffered from first european wave of covid), not planned use to run local LLMs, but - it works :)
I plan to replace minings with cheap 2x3090 and 2xV100 (32GB) cards, so Server2 will be abe to run anything in solo, get additional 72Gb VRAM without any PCI-Switches. And it receive working tensor cores. :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment