Qwen3.5-397B-A17B-TurboQuant-MLX-2bit

2-bit MLX weight-quantized build of Qwen/Qwen3.5-397B-A17B (397B total / 17B active Sparse MoE, multimodal) — re-quantized from the 4-bit TurboQuant MLX checkpoint for maximum compression. Optimized for Apple Silicon via MLX.

This is an experimental extreme-compression variant intended for running a ~400B MoE model on high-end consumer Apple Silicon. Expect noticeable quality degradation vs 4-bit — test on your workload before relying on it.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-2bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Give me a one-sentence description of MoE routing."}],
    add_generation_prompt=True,
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=True))

Model Specs

Property	Value
Base model	Qwen/Qwen3.5-397B-A17B
Architecture	Sparse Mixture-of-Experts (MoE)
Total parameters	397B
Active per token	17B
Modalities	Image + Text → Text (`image-text-to-text`)
Context window	256K tokens
Weight quantization	2-bit MLX (re-quantized from 4-bit TurboQuant)
Approx. disk footprint	~135 GB
License	Apache 2.0

RotorQuant vs TurboQuant

Aspect	TurboQuant (this repo)	RotorQuant
Rotation	Randomized Hadamard (static)	Learned orthogonal rotors (data-calibrated)
Calibration	Zero-shot	~512 sample calibration pass
Accuracy @ 2-bit	~93–95% of FP16 baseline (task-dependent)	~95–97% of FP16 baseline (task-dependent)
Best for	Squeezing the model into small VRAM	Squeezing the model in with the best quality

Memory Estimates (2-bit MLX)

Context	Active memory (approx.)
8K	~143 GB
32K	~153 GB
128K	~183 GB
256K	~213 GB

Hardware Requirements

Minimum: Apple Silicon with 192 GB unified memory for short/medium contexts
Recommended: 256 GB+ unified memory for full 256K context
Fits on top-end Mac Studio M-series configurations; does not fit on 96 GB or 128 GB Macs

Caveats

Re-quantized from the 4-bit TurboQuant MLX checkpoint (not directly from FP16)
Expect visible regressions on multi-step reasoning, code generation, and multilingual tasks vs 4-bit
For production use, prefer the 4-bit or higher variants when your hardware allows

Model tree for majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-2bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(68)

this model

majentik
/

Qwen3.5-397B-A17B-TurboQuant-MLX-2bit