Instructions to use OsaurusAI/MiniMax-M2.7-JANGTQ_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OsaurusAI/MiniMax-M2.7-JANGTQ_K with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/MiniMax-M2.7-JANGTQ_K")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use OsaurusAI/MiniMax-M2.7-JANGTQ_K with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ_K"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/MiniMax-M2.7-JANGTQ_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OsaurusAI/MiniMax-M2.7-JANGTQ_K with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ_K"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OsaurusAI/MiniMax-M2.7-JANGTQ_K

Run Hermes

hermes

MLX LM

How to use OsaurusAI/MiniMax-M2.7-JANGTQ_K with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "OsaurusAI/MiniMax-M2.7-JANGTQ_K"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ_K"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "OsaurusAI/MiniMax-M2.7-JANGTQ_K",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Add MMLU-200 (93.5%), speed/memory benchmarks, fix variants table sizes (47→56 GB), expand topic tags

by dealignai - opened 18 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+22

-4

Files changed (1) hide show

README.md +22 -4

README.md CHANGED Viewed

@@ -14,6 +14,12 @@ tags:
   - minimax-m2
   - moe
   - apple-silicon
 pipeline_tag: text-generation
 base_model: MiniMaxAI/MiniMax-M2.7
 base_model_relation: quantized
@@ -39,6 +45,18 @@ JANGTQ_K** quantization in JANGTQ-PRESTACK layout.
 - **Bundle size:** **~74 GB on-disk** (~3-bit avg routed)
 - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
 ## Why mixed-bit?
 `down_proj`'s output enters the residual stream and accumulates across
@@ -49,10 +67,10 @@ quality close to full-4-bit (~115 GB) at **64% the size**.
 ## Variants in the MiniMax-M2.7 line
-| Variant | Routed bits (avg) | Bundle size | Use case |
-|---|---|---|---|
-| `MiniMax-M2.7-JANGTQ` | 2-bit | 47 GB | smallest, best for tight RAM |
-| **`MiniMax-M2.7-JANGTQ_K` (this)** | **~3-bit (mixed 2/4)** | **74 GB** | **quality close to 4-bit at 2-bit-ish size** |
 ## Loading

   - minimax-m2
   - moe
   - apple-silicon
+  - text-generation
+  - conversational
+  - reasoning
+  - chain-of-thought
+  - quantization
+  - 230b
 pipeline_tag: text-generation
 base_model: MiniMaxAI/MiniMax-M2.7
 base_model_relation: quantized
 - **Bundle size:** **~74 GB on-disk** (~3-bit avg routed)
 - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
+## Benchmarks
+| Metric | Value | Setup |
+|---|---|---|
+| **MMLU-200** | **93.5%** (187/200) | thinking ON, `q_per_subject=20`, 10 subjects |
+| Median speed | ~37 tok/s | M4 Max 128 GB, MLX 0.31 |
+| GPU memory at load | ~75 GB | warm |
+MMLU eval used the standard `mmlu_jangtq_resume.py` runner with the model's
+default chat template (`enable_thinking` undefined → thinking ON, which the
+M2.7 template auto-opens with `<think>\n` after the assistant prefix).
 ## Why mixed-bit?
 `down_proj`'s output enters the residual stream and accumulates across
 ## Variants in the MiniMax-M2.7 line
+| Variant | Routed bits (avg) | Size | MMLU-200 | Use case |
+|---|---|---|---|---|
+| [`MiniMax-M2.7-JANGTQ`](https://huggingface.co/OsaurusAI/MiniMax-M2.7-JANGTQ) | 2-bit | 56 GB | 91.5% | smallest, best for tight RAM |
+| **`MiniMax-M2.7-JANGTQ_K` (this)** | **~3-bit (mixed 2/4)** | **74 GB** | **93.5%** | **+2.0pp MMLU vs JANGTQ for +18 GB** |
 ## Loading