🔧 2026-04-14 · chat_template.jinja fix — re-download if cached

Earlier versions of this repo shipped a chat_template that unconditionally forced <think> reasoning mode, ignoring enable_thinking=False. Synced with JANG_2L: the new template respects the flag, so callers can now skip reasoning for fast direct answers.

If you downloaded this model before 2026-04-14, please re-download chat_template.jinja:

huggingface-cli download JANGQ-AI/MiniMax-M2.7-JANGTQ chat_template.jinja --local-dir /path/to/your/local/copy

Or pass tools-only prompts (tool calling works regardless). Model weights are unchanged.

MLX Studio

JANGQ

MiniMax-M2.7 JANGTQ

MiniMax M2.7 228B MoE — 2.15-bit codebook + Hadamard, 56.5 GB

The smallest, highest-quality MiniMax M2.7 on Apple Silicon.

⚠️ Recommended: Run in MLX Studio for the best experience. MLX Studio bundles the JANGTQ runtime, handles thinking mode, and uses the custom Metal kernels this model needs. Stock mlx_lm.load() will NOT load this model — see usage instructions below.

Follow development on Twitter: @jangq_ai


What is JANGTQ?

JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 4-entry codebook, and accumulate dot products against a Hadamard-rotated input (QuIP# "rotate-input-once" math).

Result: smaller than affine 2-bit, higher quality than affine 2-bit, runs at 89% of affine 2-bit speed on Apple Silicon.

JANG_2L (affine) JANGTQ Δ
Disk size ~63 GB 56.5 GB −10%
GPU memory ~62.6 GB 56.5 GB −10%
Avg bits/param 2.10 ~2.15 +0.05
MMLU (200q) 88% 91.5% +3.5 pp
Decode speed (M3 Ultra) 48-50 tok/s 44.3 tok/s ~89% of affine

JANGTQ trades ~10% speed for ~10% disk savings AND a quality improvement. The 2-bit codebook learned via Lloyd-Max is strictly more expressive than uniform 2-bit affine for the Gaussian-ish distribution of Hadamard-rotated weights, so the same bit budget reproduces the original weight matrix more faithfully.


MMLU Benchmark (200 questions, 10 subjects, reasoning ON)

Overall: 183/200 = 91.5%

Tested 2026-04-13 on Mac Studio M3 Ultra. Reasoning enabled (MiniMax M2.7 is an always-reasoning model); <think>…</think> stripped before scoring.

Subject JANGTQ JANG_2L (affine) JANG_3L/4M
astronomy 20/20 (100%)
high_school_biology 20/20 (100%)
abstract_algebra 19/20 (95%)
college_computer_science 19/20 (95%)
high_school_mathematics 19/20 (95%)
college_physics 18/20 (90%)
high_school_chemistry 18/20 (90%)
anatomy 17/20 (85%)
world_religions 17/20 (85%)
logical_fallacies 16/20 (80%)
Total 183/200 = 91.5% ~88% ~95.5%

JANGTQ sits cleanly between affine JANG_2L (88%) and the larger JANG_3L/4M (95.5%) — capturing most of the quality of the 3L/4M profiles at ~55-60% of their disk footprint.

Speed Benchmarks (Mac Studio M3 Ultra)

Prompt / max_tok observed tok tok/s
"Capital of France?" / 50 50 / 50 35.6
"Capital of France?" / 150 66 / 150 37.5
"Count 1-30" / 150 150 / 150 42.2
"Photosynthesis 5 sent" / 300 300 / 300 44.5
"Poem + 17×23" / 300 296 / 300 44.0
MMLU average (200q, reasoning on) 41.9

Steady-state (300-tok and longer): ~44.3 tok/s. Short prompts appear slower due to fixed prefill amortization.


Important Settings

MiniMax M2.7 is an always-reasoning model. The chat template unconditionally opens <think>\n at each assistant turn.

Setting Value Notes
Temperature 1.0 REQUIRED — temp=0 can cause thinking loops
Top P 0.95
Top K 40
Repetition Penalty 1.1 Optional, helps prevent loops
max_tokens ≥ 8192 Give reasoning room to converge

Strip <think>…</think> from the response before using the final answer.


Model Details

Metric Value
Source MiniMaxAI/MiniMax-M2.7 (FP8 E4M3)
Architecture MoE (256 experts, top-8 active), standard Q/K/V attention, partial RoPE
Total parameters 228.7 B
Active per token ~1.4 B
Profile JANGTQ
Format JANGTQ (codebook+Hadamard)weight_format: mxtq in jang_config.json
Avg bits/param ~2.15
Disk 56.55 GB
GPU active (loaded) 56.50 GB
GPU peak (decoding) 57-58 GB
Load time ~10 s
Context 192 K tokens
Chat template Always-reasoning (<think>\n opened at assistant start)

JANGTQ Bit Allocation

Component Bits Format Why
Routed expert MLP (gate/up/down) — 98% of params 2 JANGTQ codebook + Hadamard Sparsely activated (8 of 256 per token); the learned codebook on Hadamard-rotated rows reproduces the distribution better than uniform 2-bit affine
Attention (Q/K/V/O) 8 affine (nn.QuantizedLinear, group_size=64) Runs on every token; quality-critical
Shared expert 8 affine Runs on every token
Embed tokens / LM head 8 affine Quality-critical input/output projections
Router gate fp16 unquantized nn.Linear Routing precision matters; ~0.8M params, negligible size
RMSNorms / RoPE / biases fp16 unquantized Already tiny

The routed experts are the 98% of parameters and the natural compression target. JANGTQ pushes them to 2-bit with a codebook-learned quantizer and a random Hadamard rotation. Everything else stays at 8-bit affine so the quality- critical hot path (attention + embed + shared expert) runs at full precision.


Usage

This model requires the jang-tools loader — stock mlx_lm.load() does NOT recognize weight_format: mxtq and will reject the model. The loader applies Metal kernel monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block Hadamard, router compile, QKV fusion, thread-tiling OPT=10/20).

pip install jang-tools
# Or from source: git clone https://github.com/JANGQ-AI/jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in 5 sentences."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True)

# Strip reasoning to get the final answer
if "</think>" in out:
    out = out.split("</think>")[-1].strip()
print(out)

On first load you'll see log lines like:

Loading JANGTQ: MiniMax-M2.7-JANGTQ
  seed=42, bits_map={'attention': 8, ..., 'routed_expert': 2, ...}
  61 shards
  TQ groups: 47616, regular: 1123
  Replaced 186 modules with TurboQuantLinear
  Patched SwitchGLU class for fused gate+up (62 TQ instances)
  P15 mx.compile(router) applied to 1 MoE class(es)
  P18 QKV fusion: 1 class(es), 62 instances
  Done

That's all four classes of optimizations (P3/P15/P17/P18) engaging. Expected decode: ~44 tok/s on M3 Ultra, ~35-40 tok/s on M4 Max, ~25-30 tok/s on M4 Pro.

Minimum Hardware

GPU Min RAM Notes
M3 Ultra / M2 Ultra 96 GB Tested on 256 GB, 44 tok/s
M4 Max 96 GB Expected ~35-40 tok/s
M4 Pro 64 GB Very tight; expect ~25-30 tok/s
M3 Max / M2 Max 96 GB Expected ~30-35 tok/s

56.5 GB of GPU memory is needed just for the weights; add 2-5 GB for KV cache and intermediate activations, plus enough system memory for the OS + other processes.

Why JANG for MiniMax

Standard MLX uniform quantization on MiniMax produces completely broken output at every bit level — MMLU drops to ~25% (random guessing) because the MoE router becomes unreliable. JANG's mixed-precision approach (attention + router at full precision, routed experts at 2-bit) is the only working quantized MiniMax on Apple Silicon.

JANGTQ takes this one step further by using a learned codebook for the 2-bit expert weights. For MiniMax M2.5, JANG_2L (affine) scored 74% MMLU vs MLX's 25%. For MiniMax M2.7, JANGTQ scores 91.5% — the highest-quality sub-60-GB MiniMax quant on any runtime.


Compression Math

Quantization (offline, per weight matrix):
  w_rot[r, i]  = (H ⊙ signs * w^T)[r, i]           # randomized Hadamard rotation
  norms[r]     = ||w_rot[r, :]||₂
  packed[r, i] = argmin_c ||w_rot[r, i]/norms[r] - codebook[c]||   # Lloyd-Max 2-bit

Inference (runtime):
  x_rot   = H ⊙ (signs * x)                         # O(d log d) rotation
  y[b, r] = norms[r] · Σᵢ x_rot[b, i] · codebook[unpack(packed[r, i])]

The Hadamard rotation flattens the heavy tail of the weight distribution, so a 4-entry codebook (2-bit) captures it with minimal error. The rotation is symmetric (H @ H = I), so rotating the input once at runtime is mathematically equivalent to rotating every weight once at quantization time.

Credit: QuIP# for the rotate-input-once insight.


Known Behaviors / Settings

  • Always-reasoning: chat template opens <think>\n at assistant start. Give it max_tokens ≥ 8192 in benchmarks.
  • Stop token: single EOS [e~[ = id 200020. mlx_lm reads this correctly from generation_config.json.
  • Temperature 1.0 required: greedy/temp=0 can cause the reasoning to get stuck in a loop. Top-p 0.95 + top-k 40 recommended.
  • GPU RAM: 56.5 GB base + KV cache grows with conversation length. Budget 60-65 GB for typical use, more for very long contexts.

Created by Jinho Jang (eric@jangq.ai) — part of the JANG collection.

Base model: MiniMaxAI/MiniMax-M2.7. Quantization method: JANGTQ (codebook + randomized Hadamard, see math above). License: follows the upstream MiniMax open license.

Downloads last month
-
Safetensors
Model size
15B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ

Quantized
(52)
this model

Paper for JANGQ-AI/MiniMax-M2.7-JANGTQ