Instructions to use JANGQ-AI/MiniMax-M2.7-JANG_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JANGQ-AI/MiniMax-M2.7-JANG_K with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("JANGQ-AI/MiniMax-M2.7-JANG_K")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use JANGQ-AI/MiniMax-M2.7-JANG_K with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_K"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JANGQ-AI/MiniMax-M2.7-JANG_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use JANGQ-AI/MiniMax-M2.7-JANG_K with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_K"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default JANGQ-AI/MiniMax-M2.7-JANG_K

Run Hermes

hermes

MLX LM

How to use JANGQ-AI/MiniMax-M2.7-JANG_K with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "JANGQ-AI/MiniMax-M2.7-JANG_K"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_K"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "JANGQ-AI/MiniMax-M2.7-JANG_K",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

MiniMax-M2.7-JANG_K

MiniMax M2.7 — 86 GB on disk (down from ~230 GB FP8 source) — mixed-bit JANG_K quantization using mx.quantize affine, prestacked switch_mlp.

Source: MiniMaxAI/MiniMax-M2.7 (62 layers, 256 routed experts top-8, 196K context)
Quantization: mixed-bit affine (mx.quantize, group_size=128):
- down_proj: 4-bit (output enters residual stream — more sensitive)
- gate_proj: 2-bit + AWQ pre-scaling (gated activation)
- up_proj: 2-bit + AWQ pre-scaling (gated activation)
- attention q/k/v/o_proj: 8-bit affine
- embed: 6-bit / lm_head: 8-bit
- norms / router gate / expert_bias: fp16 passthrough
Routed-expert layout: prestacked along axis 0 as block_sparse_moe.switch_mlp.{gate,up,down}_proj of shape (n_experts, out, in_packed) — instant cold load, no runtime sidecar.
Bundle size: ~86 GB on-disk (~3.0-bit avg routed including AWQ scales)
Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio 256 GB

Why JANG_K?

down_proj's output enters the residual stream and accumulates across 62 layers — quantization noise compounds. gate_proj and up_proj enter through SwiGLU's multiplicative gate (silu(gate) × up) which dampens noise. Spending 4 bits on down and 2 bits on gate/up gives quality close to full-4-bit at considerably smaller size.

AWQ

Activation-aware scaling on the 2-bit projections (gate_proj, up_proj):

Per-layer (hidden,) scale: s = clip((max(|x|) + eps)^0.5, min=1.0) (16 calibration prompts × ≤256 tokens; floor=1.0 prevents inverse-fold from amplifying dead channels)
Pre-scale weights along input axis: W' = W * s[None, None, :]
Inverse fold into preceding norm: post_attention_layernorm.weight /= s
Forward math is preserved exactly; quantization grid is reallocated toward high-importance input channels.

down_proj does not need AWQ — it stays at 4-bit.

Loading

Loadable via stock mlx-lm (no JANG runtime required):

from mlx_lm import load, generate
model, tok = load("JANGQ-AI/MiniMax-M2.7-JANG_K")

messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 tokenize=False)
print(generate(model, tok, prompt=prompt, max_tokens=128))

Reasoning + tools

Default: thinking ON (chat template inserts <think>\n after assistant prefix)
Reasoning parser: qwen3 (extracts <think>...</think> blocks)
Tool parser: minimax

Disable reasoning:

prompt = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 tokenize=False, enable_thinking=False)

Variants in the MiniMax-M2.7 line on JANGQ-AI

Variant	Routed bits	Bundle size	Loader
`MiniMax-M2.7-JANGTQ`	2-bit codebook	47 GB	`jang_tools.load_jangtq`
`MiniMax-M2.7-JANGTQ_K`	mixed 2/4 codebook	74 GB	`jang_tools.load_jangtq`
`MiniMax-M2.7-JANG_K` (this)	mixed 2/4 affine + AWQ	86 GB	stock `mlx_lm`

Credits

Quantization toolchain: JANG by Jinho Jang <eric@jangq.ai>
Base model: MiniMax-M2.7 by MiniMaxAI
Pipeline: MiniMax M2 → JANG affine quantization (per-projection 2/4/2 + AWQ on 2-bit gates) → release on JANGQ-AI

Downloads last month: 89

Safetensors

Model size

23B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/MiniMax-M2.7-JANG_K

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(106)

this model