MiniMax-M2.7 — 116 GB (MLX)
Mixed-precision MLX build of MiniMaxAI/MiniMax-M2.7, prepared by baa.ai.
Metrics
| Metric | Value |
|---|---|
| Size on disk | 116 GB (25 shards) |
| Average bits per weight | 3.78 |
| Group size | 64 |
| Framework | MLX (Apple Silicon) |
Benchmarks
| Benchmark | Score | Notes |
|---|---|---|
| HumanEval pass@1 (single-shot) | 93.3% (153/164) | 164/164 completed, 0 skipped |
Settings match the Recommended inference settings below. A second-pass retry of the 11 single-shot failures and decode-throughput measurement are pending; this card will be updated when those runs complete.
Recommended inference settings
These are the sampler parameters used for the HumanEval run above and are a good default for reasoning / code tasks:
sampler_params = {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"repetition_penalty": 1.1,
"max_tokens": 8192,
}
Chat template — thinking mode
MiniMax-M2.7 uses a <think>…</think> reasoning block. Important: the base chat template injects <think>\n at the end of the prompt before generation, so the model's output begins inside the reasoning block with no opening tag. To recover the final answer, strip everything up to and including the first </think>:
def strip_thinking(text: str) -> str:
if "</think>" in text:
return text.split("</think>", 1)[1].strip()
return text.strip()
Give the model enough token budget that it can finish reasoning and emit the </think> closing tag — we recommend at least 4096, and 8192 for harder problems.
Usage
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
model, tokenizer = load("baa-ai/MiniMax-M2.7-RAM-116GB-MLX")
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
logits_processors = make_logits_processors(repetition_penalty=1.1)
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Write a Python function that reverses a string."}],
tokenize=False,
add_generation_prompt=True,
)
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=8192,
sampler=sampler,
logits_processors=logits_processors,
)
if "</think>" in response:
response = response.split("</think>", 1)[1].strip()
print(response)
Hardware
- Apple Silicon Mac with ~128 GB unified memory recommended for comfortable inference.
- Runs on less with swap, at substantially reduced throughput.
Variants
| Variant | Size | Link |
|---|---|---|
| 100 GB | 100.1 GB | baa-ai/MiniMax-M2.7-RAM-100GB-MLX |
| 111 GB | 110.9 GB | baa-ai/MiniMax-M2.7-RAM-111GB-MLX |
| 116 GB | 116.0 GB | baa-ai/MiniMax-M2.7-RAM-116GB-MLX |
| 120 GB | 120.1 GB | baa-ai/MiniMax-M2.7-RAM-120GB-MLX |
License
Inherited from the upstream MiniMax-M2.7 license: non-commercial use permitted; commercial use requires written authorization from MiniMax.
Quantized by baa.ai
- Downloads last month
- 452
4-bit
Model tree for baa-ai/MiniMax-M2.7-RAM-116GB-MLX
Base model
MiniMaxAI/MiniMax-M2.7