MiniMax-M2.7 — 116 GB (MLX)

Mixed-precision MLX build of MiniMaxAI/MiniMax-M2.7, prepared by baa.ai.

Metrics

Metric Value
Size on disk 116 GB (25 shards)
Average bits per weight 3.78
Group size 64
Framework MLX (Apple Silicon)

Benchmarks

Benchmark Score Notes
HumanEval pass@1 (single-shot) 93.3% (153/164) 164/164 completed, 0 skipped

Settings match the Recommended inference settings below. A second-pass retry of the 11 single-shot failures and decode-throughput measurement are pending; this card will be updated when those runs complete.

Recommended inference settings

These are the sampler parameters used for the HumanEval run above and are a good default for reasoning / code tasks:

sampler_params = {
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.1,
    "max_tokens": 8192,
}

Chat template — thinking mode

MiniMax-M2.7 uses a <think>…</think> reasoning block. Important: the base chat template injects <think>\n at the end of the prompt before generation, so the model's output begins inside the reasoning block with no opening tag. To recover the final answer, strip everything up to and including the first </think>:

def strip_thinking(text: str) -> str:
    if "</think>" in text:
        return text.split("</think>", 1)[1].strip()
    return text.strip()

Give the model enough token budget that it can finish reasoning and emit the </think> closing tag — we recommend at least 4096, and 8192 for harder problems.

Usage

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("baa-ai/MiniMax-M2.7-RAM-116GB-MLX")

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
logits_processors = make_logits_processors(repetition_penalty=1.1)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a Python function that reverses a string."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=8192,
    sampler=sampler,
    logits_processors=logits_processors,
)

if "</think>" in response:
    response = response.split("</think>", 1)[1].strip()
print(response)

Hardware

  • Apple Silicon Mac with ~128 GB unified memory recommended for comfortable inference.
  • Runs on less with swap, at substantially reduced throughput.

Variants

License

Inherited from the upstream MiniMax-M2.7 license: non-commercial use permitted; commercial use requires written authorization from MiniMax.


Quantized by baa.ai

Downloads last month
452
Safetensors
Model size
229B params
Tensor type
BF16
·
F32
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for baa-ai/MiniMax-M2.7-RAM-116GB-MLX

Quantized
(56)
this model

Collection including baa-ai/MiniMax-M2.7-RAM-116GB-MLX