CRITICAL FIX (2026-03-19): Fixed chat template for thinking toggle. Re-download if you experience infinite loops.

Update (2026-03-18): Models have been updated to v2.1.0 with proper tokenizer and fixed configs. If you downloaded before this date, please re-download for full MLX Studio compatibility.

MLX Studio — the only app that natively supports JANG models

Early Adoption: LM Studio, Ollama, oMLX, Inferencer do not support JANG yet. Use MLX Studio or pip install "jang[mlx]". Ask your favorite app's creators to add JANG support!

MiniMax-M2.5 — JANG_3M (3.07-bit, 8-bit attention)

JANG — Jang Adaptive N-bit Grading | Mixed-Precision Quantization for Apple Silicon

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Results (200-question MMLU)

Model	MMLU	Size	Speed
JANG_3M (3.07-bit)	74.5%	82 GB	48.3 tok/s
JANG_2L (2.10-bit)	74%	63 GB	50.9 tok/s
MLX 4-bit	26.5%	91 GB	~50 tok/s
MLX 3-bit	24.5%	—	—
MLX 2-bit	25%	—	—

JANG is the ONLY working quantization for MiniMax. MLX uniform is broken at ALL bit levels (~25% = random chance).

Per-Subject Scores (200 questions, temp=0.0, thinking ON)

Subject	Score
Abstract Algebra	10/20
Anatomy	15/20
Astronomy	18/20
College CS	10/20
College Physics	17/20
HS Biology	18/20
HS Chemistry	16/20
HS Mathematics	12/20
Logical Fallacies	16/20
World Religions	17/20
Total	149/200 = 74.5%

Specs

Metric	Value
Source	MiniMax-M2.5
Architecture	MoE (256 experts, 8 active), standard attention, 62 layers
Profile	JANG_3M (CRITICAL=8, IMPORTANT=3, COMPRESS=3)
Average bits	3.07
GPU Memory	88.5 GB
Disk Size	82 GB
Speed	48.3 tok/s (M4 Ultra 256 GB)
group_size	128 (experts) / 64 (router)
Temperature	1.0 required (greedy causes loops without thinking)
Format	v2 (MLX-native, instant load)

Important Notes

Temperature must be 1.0 — greedy decoding (temp=0) causes infinite thinking loops
top_p=0.95, top_k=40 recommended
Thinking can be toggled via enable_thinking=True/False in chat template
enable_thinking=False + temp=0.0 works for MMLU benchmarks

Note: MiniMax-M2.5 is a text-only model (no vision encoder). Use load_jang_model() for inference.

Install

pip install "jang[mlx]"

Quick Start

from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.5-JANG_3M")
sampler = make_sampler(temp=1.0, top_p=0.95)

tokens = tokenizer.encode("What is photosynthesis?")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    t = tok.item() if hasattr(tok, 'item') else int(tok)
    print(tokenizer.decode([t]), end="", flush=True)
    if t == tokenizer.eos_token_id:
        break

한국어

MiniMax-M2.5 — JANG_3M

JANG은 MiniMax에서 유일하게 작동하는 양자화 포맷입니다. MLX 균일 양자화는 모든 비트 수준에서 깨져 있습니다 (~25% = 무작위).

모델	MMLU	크기	속도
JANG_3M	74.5%	82 GB	48.3 tok/s
JANG_2L	74%	63 GB	50.9 tok/s

JANG은 MoE 라우터를 8비트로 보호하면서 전문가 MLP를 3비트로 압축합니다.

GitHub · HuggingFace · MLX Studio · PyPI

장진호 제작 · Created by Jinho Jang — jangq.ai · @dealignai

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/MiniMax-M2.5-JANG_3M

Base model

MiniMaxAI/MiniMax-M2.5

Finetuned

(27)

this model

JANGQ-AI
/

MiniMax-M2.5-JANG_3M