Instructions to use JANGQ-AI/MiniMax-M2.7-JANG_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use JANGQ-AI/MiniMax-M2.7-JANG_K with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("JANGQ-AI/MiniMax-M2.7-JANG_K") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use JANGQ-AI/MiniMax-M2.7-JANG_K with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_K"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "JANGQ-AI/MiniMax-M2.7-JANG_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use JANGQ-AI/MiniMax-M2.7-JANG_K with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_K"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default JANGQ-AI/MiniMax-M2.7-JANG_K
Run Hermes
hermes
- MLX LM
How to use JANGQ-AI/MiniMax-M2.7-JANG_K with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "JANGQ-AI/MiniMax-M2.7-JANG_K"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_K" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JANGQ-AI/MiniMax-M2.7-JANG_K", "messages": [ {"role": "user", "content": "Hello"} ] }'

MiniMax-M2.7-JANG_K
MiniMax M2.7 — 86 GB on disk (down from ~230 GB FP8 source) — mixed-bit
JANG_K quantization using mx.quantize affine, prestacked switch_mlp.
- Source: MiniMaxAI/MiniMax-M2.7 (62 layers, 256 routed experts top-8, 196K context)
- Quantization: mixed-bit affine (
mx.quantize,group_size=128):down_proj: 4-bit (output enters residual stream — more sensitive)gate_proj: 2-bit + AWQ pre-scaling (gated activation)up_proj: 2-bit + AWQ pre-scaling (gated activation)- attention
q/k/v/o_proj: 8-bit affine - embed: 6-bit / lm_head: 8-bit
- norms / router gate / expert_bias: fp16 passthrough
- Routed-expert layout: prestacked along axis 0 as
block_sparse_moe.switch_mlp.{gate,up,down}_projof shape(n_experts, out, in_packed)— instant cold load, no runtime sidecar. - Bundle size: ~86 GB on-disk (~3.0-bit avg routed including AWQ scales)
- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio 256 GB
Why JANG_K?
down_proj's output enters the residual stream and accumulates across
62 layers — quantization noise compounds. gate_proj and up_proj
enter through SwiGLU's multiplicative gate (silu(gate) × up) which
dampens noise. Spending 4 bits on down and 2 bits on gate/up gives
quality close to full-4-bit at considerably smaller size.
AWQ
Activation-aware scaling on the 2-bit projections (gate_proj, up_proj):
- Per-layer
(hidden,)scale:s = clip((max(|x|) + eps)^0.5, min=1.0)(16 calibration prompts × ≤256 tokens; floor=1.0 prevents inverse-fold from amplifying dead channels) - Pre-scale weights along input axis:
W' = W * s[None, None, :] - Inverse fold into preceding norm:
post_attention_layernorm.weight /= s - Forward math is preserved exactly; quantization grid is reallocated toward high-importance input channels.
down_proj does not need AWQ — it stays at 4-bit.
Loading
Loadable via stock mlx-lm (no JANG runtime required):
from mlx_lm import load, generate
model, tok = load("JANGQ-AI/MiniMax-M2.7-JANG_K")
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True,
tokenize=False)
print(generate(model, tok, prompt=prompt, max_tokens=128))
Reasoning + tools
- Default: thinking ON (chat template inserts
<think>\nafter assistant prefix) - Reasoning parser:
qwen3(extracts<think>...</think>blocks) - Tool parser:
minimax - Disable reasoning:
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False, enable_thinking=False)
Variants in the MiniMax-M2.7 line on JANGQ-AI
| Variant | Routed bits | Bundle size | Loader |
|---|---|---|---|
MiniMax-M2.7-JANGTQ |
2-bit codebook | 47 GB | jang_tools.load_jangtq |
MiniMax-M2.7-JANGTQ_K |
mixed 2/4 codebook | 74 GB | jang_tools.load_jangtq |
MiniMax-M2.7-JANG_K (this) |
mixed 2/4 affine + AWQ | 86 GB | stock mlx_lm |
Credits
- Quantization toolchain: JANG by Jinho Jang <eric@jangq.ai>
- Base model: MiniMax-M2.7 by MiniMaxAI
- Pipeline: MiniMax M2 → JANG affine quantization (per-projection 2/4/2 + AWQ on 2-bit gates) → release on JANGQ-AI
- Downloads last month
- 89
Quantized
Model tree for JANGQ-AI/MiniMax-M2.7-JANG_K
Base model
MiniMaxAI/MiniMax-M2.7