🔧 2026-04-14 · chat_template.jinja fix — re-download if cached
Earlier versions of this repo shipped a chat_template that unconditionally forced
<think>reasoning mode, ignoringenable_thinking=False. Synced with JANG_2L: the new template respects the flag, so callers can now skip reasoning for fast direct answers.If you downloaded this model before 2026-04-14, please re-download
chat_template.jinja:huggingface-cli download JANGQ-AI/MiniMax-M2.7-JANGTQ chat_template.jinja --local-dir /path/to/your/local/copyOr pass tools-only prompts (tool calling works regardless). Model weights are unchanged.
MiniMax-M2.7 JANGTQ
MiniMax M2.7 228B MoE — 2.15-bit codebook + Hadamard, 56.5 GB
The smallest, highest-quality MiniMax M2.7 on Apple Silicon.
⚠️ Recommended: Run in MLX Studio for the best experience. MLX Studio bundles the JANGTQ runtime, handles thinking mode, and uses the custom Metal kernels this model needs. Stock
mlx_lm.load()will NOT load this model — see usage instructions below.
Follow development on Twitter: @jangq_ai
What is JANGTQ?
JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 4-entry codebook, and accumulate dot products against a Hadamard-rotated input (QuIP# "rotate-input-once" math).
Result: smaller than affine 2-bit, higher quality than affine 2-bit, runs at 89% of affine 2-bit speed on Apple Silicon.
| JANG_2L (affine) | JANGTQ | Δ | |
|---|---|---|---|
| Disk size | ~63 GB | 56.5 GB | −10% |
| GPU memory | ~62.6 GB | 56.5 GB | −10% |
| Avg bits/param | 2.10 | ~2.15 | +0.05 |
| MMLU (200q) | 88% | 91.5% | +3.5 pp |
| Decode speed (M3 Ultra) | 48-50 tok/s | 44.3 tok/s | ~89% of affine |
JANGTQ trades ~10% speed for ~10% disk savings AND a quality improvement. The 2-bit codebook learned via Lloyd-Max is strictly more expressive than uniform 2-bit affine for the Gaussian-ish distribution of Hadamard-rotated weights, so the same bit budget reproduces the original weight matrix more faithfully.
MMLU Benchmark (200 questions, 10 subjects, reasoning ON)
Overall: 183/200 = 91.5%
Tested 2026-04-13 on Mac Studio M3 Ultra. Reasoning enabled (MiniMax M2.7 is
an always-reasoning model); <think>…</think> stripped before scoring.
| Subject | JANGTQ | JANG_2L (affine) | JANG_3L/4M |
|---|---|---|---|
| astronomy | 20/20 (100%) | — | — |
| high_school_biology | 20/20 (100%) | — | — |
| abstract_algebra | 19/20 (95%) | — | — |
| college_computer_science | 19/20 (95%) | — | — |
| high_school_mathematics | 19/20 (95%) | — | — |
| college_physics | 18/20 (90%) | — | — |
| high_school_chemistry | 18/20 (90%) | — | — |
| anatomy | 17/20 (85%) | — | — |
| world_religions | 17/20 (85%) | — | — |
| logical_fallacies | 16/20 (80%) | — | — |
| Total | 183/200 = 91.5% | ~88% | ~95.5% |
JANGTQ sits cleanly between affine JANG_2L (88%) and the larger JANG_3L/4M (95.5%) — capturing most of the quality of the 3L/4M profiles at ~55-60% of their disk footprint.
Speed Benchmarks (Mac Studio M3 Ultra)
| Prompt / max_tok | observed tok | tok/s |
|---|---|---|
| "Capital of France?" / 50 | 50 / 50 | 35.6 |
| "Capital of France?" / 150 | 66 / 150 | 37.5 |
| "Count 1-30" / 150 | 150 / 150 | 42.2 |
| "Photosynthesis 5 sent" / 300 | 300 / 300 | 44.5 |
| "Poem + 17×23" / 300 | 296 / 300 | 44.0 |
| MMLU average (200q, reasoning on) | — | 41.9 |
Steady-state (300-tok and longer): ~44.3 tok/s. Short prompts appear slower due to fixed prefill amortization.
Important Settings
MiniMax M2.7 is an always-reasoning model. The chat template
unconditionally opens <think>\n at each assistant turn.
| Setting | Value | Notes |
|---|---|---|
| Temperature | 1.0 | REQUIRED — temp=0 can cause thinking loops |
| Top P | 0.95 | |
| Top K | 40 | |
| Repetition Penalty | 1.1 | Optional, helps prevent loops |
| max_tokens | ≥ 8192 | Give reasoning room to converge |
Strip <think>…</think> from the response before using the final answer.
Model Details
| Metric | Value |
|---|---|
| Source | MiniMaxAI/MiniMax-M2.7 (FP8 E4M3) |
| Architecture | MoE (256 experts, top-8 active), standard Q/K/V attention, partial RoPE |
| Total parameters | 228.7 B |
| Active per token | ~1.4 B |
| Profile | JANGTQ |
| Format | JANGTQ (codebook+Hadamard) — weight_format: mxtq in jang_config.json |
| Avg bits/param | ~2.15 |
| Disk | 56.55 GB |
| GPU active (loaded) | 56.50 GB |
| GPU peak (decoding) | 57-58 GB |
| Load time | ~10 s |
| Context | 192 K tokens |
| Chat template | Always-reasoning (<think>\n opened at assistant start) |
JANGTQ Bit Allocation
| Component | Bits | Format | Why |
|---|---|---|---|
| Routed expert MLP (gate/up/down) — 98% of params | 2 | JANGTQ codebook + Hadamard | Sparsely activated (8 of 256 per token); the learned codebook on Hadamard-rotated rows reproduces the distribution better than uniform 2-bit affine |
| Attention (Q/K/V/O) | 8 | affine (nn.QuantizedLinear, group_size=64) |
Runs on every token; quality-critical |
| Shared expert | 8 | affine | Runs on every token |
| Embed tokens / LM head | 8 | affine | Quality-critical input/output projections |
| Router gate | fp16 | unquantized nn.Linear |
Routing precision matters; ~0.8M params, negligible size |
| RMSNorms / RoPE / biases | fp16 | unquantized | Already tiny |
The routed experts are the 98% of parameters and the natural compression target. JANGTQ pushes them to 2-bit with a codebook-learned quantizer and a random Hadamard rotation. Everything else stays at 8-bit affine so the quality- critical hot path (attention + embed + shared expert) runs at full precision.
Usage
This model requires the jang-tools loader — stock mlx_lm.load() does
NOT recognize weight_format: mxtq and will reject the model. The loader
applies Metal kernel monkey-patches at load time (fused gate+up+SwiGLU, gather
TQ, multi-block Hadamard, router compile, QKV fusion, thread-tiling OPT=10/20).
pip install jang-tools
# Or from source: git clone https://github.com/JANGQ-AI/jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)
messages = [{"role": "user", "content": "Explain photosynthesis in 5 sentences."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True)
# Strip reasoning to get the final answer
if "</think>" in out:
out = out.split("</think>")[-1].strip()
print(out)
On first load you'll see log lines like:
Loading JANGTQ: MiniMax-M2.7-JANGTQ
seed=42, bits_map={'attention': 8, ..., 'routed_expert': 2, ...}
61 shards
TQ groups: 47616, regular: 1123
Replaced 186 modules with TurboQuantLinear
Patched SwitchGLU class for fused gate+up (62 TQ instances)
P15 mx.compile(router) applied to 1 MoE class(es)
P18 QKV fusion: 1 class(es), 62 instances
Done
That's all four classes of optimizations (P3/P15/P17/P18) engaging. Expected decode: ~44 tok/s on M3 Ultra, ~35-40 tok/s on M4 Max, ~25-30 tok/s on M4 Pro.
Minimum Hardware
| GPU | Min RAM | Notes |
|---|---|---|
| M3 Ultra / M2 Ultra | 96 GB | Tested on 256 GB, 44 tok/s |
| M4 Max | 96 GB | Expected ~35-40 tok/s |
| M4 Pro | 64 GB | Very tight; expect ~25-30 tok/s |
| M3 Max / M2 Max | 96 GB | Expected ~30-35 tok/s |
56.5 GB of GPU memory is needed just for the weights; add 2-5 GB for KV cache and intermediate activations, plus enough system memory for the OS + other processes.
Why JANG for MiniMax
Standard MLX uniform quantization on MiniMax produces completely broken output at every bit level — MMLU drops to ~25% (random guessing) because the MoE router becomes unreliable. JANG's mixed-precision approach (attention + router at full precision, routed experts at 2-bit) is the only working quantized MiniMax on Apple Silicon.
JANGTQ takes this one step further by using a learned codebook for the 2-bit expert weights. For MiniMax M2.5, JANG_2L (affine) scored 74% MMLU vs MLX's 25%. For MiniMax M2.7, JANGTQ scores 91.5% — the highest-quality sub-60-GB MiniMax quant on any runtime.
Compression Math
Quantization (offline, per weight matrix):
w_rot[r, i] = (H ⊙ signs * w^T)[r, i] # randomized Hadamard rotation
norms[r] = ||w_rot[r, :]||₂
packed[r, i] = argmin_c ||w_rot[r, i]/norms[r] - codebook[c]|| # Lloyd-Max 2-bit
Inference (runtime):
x_rot = H ⊙ (signs * x) # O(d log d) rotation
y[b, r] = norms[r] · Σᵢ x_rot[b, i] · codebook[unpack(packed[r, i])]
The Hadamard rotation flattens the heavy tail of the weight distribution, so a
4-entry codebook (2-bit) captures it with minimal error. The rotation is
symmetric (H @ H = I), so rotating the input once at runtime is
mathematically equivalent to rotating every weight once at quantization time.
Credit: QuIP# for the rotate-input-once insight.
Known Behaviors / Settings
- Always-reasoning: chat template opens
<think>\nat assistant start. Give itmax_tokens ≥ 8192in benchmarks. - Stop token: single EOS
[e~[= id 200020.mlx_lmreads this correctly fromgeneration_config.json. - Temperature 1.0 required: greedy/temp=0 can cause the reasoning to get stuck in a loop. Top-p 0.95 + top-k 40 recommended.
- GPU RAM: 56.5 GB base + KV cache grows with conversation length. Budget 60-65 GB for typical use, more for very long contexts.
Created by Jinho Jang (eric@jangq.ai) — part of the JANG collection.
Base model: MiniMaxAI/MiniMax-M2.7. Quantization method: JANGTQ (codebook + randomized Hadamard, see math above). License: follows the upstream MiniMax open license.
- Downloads last month
- -
Quantized
Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ
Base model
MiniMaxAI/MiniMax-M2.7