MiniMax-M2.7-JANG_K

MiniMax M2.7 — 86 GB on disk (down from ~230 GB FP8 source) — mixed-bit JANG_K quantization using mx.quantize affine, prestacked switch_mlp.

  • Source: MiniMaxAI/MiniMax-M2.7 (62 layers, 256 routed experts top-8, 196K context)
  • Quantization: mixed-bit affine (mx.quantize, group_size=128):
    • down_proj: 4-bit (output enters residual stream — more sensitive)
    • gate_proj: 2-bit + AWQ pre-scaling (gated activation)
    • up_proj: 2-bit + AWQ pre-scaling (gated activation)
    • attention q/k/v/o_proj: 8-bit affine
    • embed: 6-bit / lm_head: 8-bit
    • norms / router gate / expert_bias: fp16 passthrough
  • Routed-expert layout: prestacked along axis 0 as block_sparse_moe.switch_mlp.{gate,up,down}_proj of shape (n_experts, out, in_packed) — instant cold load, no runtime sidecar.
  • Bundle size: ~86 GB on-disk (~3.0-bit avg routed including AWQ scales)
  • Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio 256 GB

Why JANG_K?

down_proj's output enters the residual stream and accumulates across 62 layers — quantization noise compounds. gate_proj and up_proj enter through SwiGLU's multiplicative gate (silu(gate) × up) which dampens noise. Spending 4 bits on down and 2 bits on gate/up gives quality close to full-4-bit at considerably smaller size.

AWQ

Activation-aware scaling on the 2-bit projections (gate_proj, up_proj):

  • Per-layer (hidden,) scale: s = clip((max(|x|) + eps)^0.5, min=1.0) (16 calibration prompts × ≤256 tokens; floor=1.0 prevents inverse-fold from amplifying dead channels)
  • Pre-scale weights along input axis: W' = W * s[None, None, :]
  • Inverse fold into preceding norm: post_attention_layernorm.weight /= s
  • Forward math is preserved exactly; quantization grid is reallocated toward high-importance input channels.

down_proj does not need AWQ — it stays at 4-bit.

Loading

Loadable via stock mlx-lm (no JANG runtime required):

from mlx_lm import load, generate
model, tok = load("JANGQ-AI/MiniMax-M2.7-JANG_K")

messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 tokenize=False)
print(generate(model, tok, prompt=prompt, max_tokens=128))

Reasoning + tools

  • Default: thinking ON (chat template inserts <think>\n after assistant prefix)
  • Reasoning parser: qwen3 (extracts <think>...</think> blocks)
  • Tool parser: minimax
  • Disable reasoning:
    prompt = tok.apply_chat_template(messages, add_generation_prompt=True,
                                     tokenize=False, enable_thinking=False)
    

Variants in the MiniMax-M2.7 line on JANGQ-AI

Variant Routed bits Bundle size Loader
MiniMax-M2.7-JANGTQ 2-bit codebook 47 GB jang_tools.load_jangtq
MiniMax-M2.7-JANGTQ_K mixed 2/4 codebook 74 GB jang_tools.load_jangtq
MiniMax-M2.7-JANG_K (this) mixed 2/4 affine + AWQ 86 GB stock mlx_lm

Credits

  • Quantization toolchain: JANG by Jinho Jang <eric@jangq.ai>
  • Base model: MiniMax-M2.7 by MiniMaxAI
  • Pipeline: MiniMax M2 → JANG affine quantization (per-projection 2/4/2 + AWQ on 2-bit gates) → release on JANGQ-AI
Downloads last month
89
Safetensors
Model size
23B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/MiniMax-M2.7-JANG_K

Quantized
(106)
this model