Instructions to use mlx-community/MiniMax-M2.1-3bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlx-community/MiniMax-M2.1-3bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/MiniMax-M2.1-3bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Transformers

How to use mlx-community/MiniMax-M2.1-3bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mlx-community/MiniMax-M2.1-3bit", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mlx-community/MiniMax-M2.1-3bit", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("mlx-community/MiniMax-M2.1-3bit", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

vLLM

How to use mlx-community/MiniMax-M2.1-3bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mlx-community/MiniMax-M2.1-3bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlx-community/MiniMax-M2.1-3bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mlx-community/MiniMax-M2.1-3bit

SGLang

How to use mlx-community/MiniMax-M2.1-3bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mlx-community/MiniMax-M2.1-3bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlx-community/MiniMax-M2.1-3bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mlx-community/MiniMax-M2.1-3bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlx-community/MiniMax-M2.1-3bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Pi new

How to use mlx-community/MiniMax-M2.1-3bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/MiniMax-M2.1-3bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mlx-community/MiniMax-M2.1-3bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

MLX LM

How to use mlx-community/MiniMax-M2.1-3bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "mlx-community/MiniMax-M2.1-3bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "mlx-community/MiniMax-M2.1-3bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "mlx-community/MiniMax-M2.1-3bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Docker Model Runner
How to use mlx-community/MiniMax-M2.1-3bit with Docker Model Runner:
```
docker model run hf.co/mlx-community/MiniMax-M2.1-3bit
```

Anyone running this with M4 Max 128gb? How does it compare to 4bit quantization?

by tumma72 - opened Jan 2

Discussion

tumma72

MLX Community org Jan 2

Thanks for pushing this model, I have seen in the 4bit quantization which would be my standard goto version that MiniMax M2.1 is too large to fit in 128Gb and that it still seems to have issues with the thinking templates. I am wondering if anyone has successfully tried this with 128Gb of shared memory and if there are any issues. Also what’s the tokens/s you are getting and with which infrastructure

bibproj

MLX Community org Jan 2

Unsloth says they made some fixes to the chat template. Their jinja template is found at https://huggingface.co/unsloth/MiniMax-M2.1/blob/main/chat_template.jinja

Would you be able to test it with their template to see if that template solves your issue please?

This model is 100 GB in size.
You might have to type something like sudo sysctl iogpu.wired_limit_mb=117760 in the terminal to tell MacOS you want to use 115 GB for the GPU memory. You could even try 122880 for 120 GB?

tumma72

MLX Community org Jan 3

Thanks I will do once I got a decent internet connection, I am traveling at the moment and it won’t work with 100Gb download 😉 I’ll report back after the test.

bibproj

MLX Community org Jan 7

@tumma72

Perhaps retry it with the updated tokenizer_config.json from the 4-bit version. This was updated yesterday. For more information, see the discussion at https://huggingface.co/mlx-community/MiniMax-M2.1-4bit/discussions/3

tumma72

MLX Community org Jan 7

Thanks, I have tried the 3bit version with the unsloth template and it works, I can get 20t/s at the beginning going to 5t/s for longer prompts. It’s usable but slow, at least for continued development. I’ll try the other template and report. Overall the memory footprint is 97.6Gb for loading and with the KV cache is getting another 12-13Gb after a few exchanges.

bibproj

MLX Community org Jan 8

Very good.
Thank you for the feedback!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment