Qwen3.5-397B-A17B-RotorQuant-MLX-2bit

2-bit MLX weight-quantized build of Qwen/Qwen3.5-397B-A17B (397B total / 17B active Sparse MoE, multimodal) — re-quantized from the 4-bit RotorQuant MLX checkpoint for maximum compression. Optimized for Apple Silicon via MLX.

An experimental extreme-compression variant: the learned rotors from RotorQuant's calibration pass help preserve weight structure significantly better than static rotations at this bit-width. It is the highest-quality 2-bit build of Qwen3.5-397B-A17B in the Majentik suite.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-2bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Describe what a 2-bit weight means in one sentence."}],
    add_generation_prompt=True,
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=True))

Model Specs

Property Value
Base model Qwen/Qwen3.5-397B-A17B
Architecture Sparse Mixture-of-Experts (MoE)
Total parameters 397B
Active per token 17B
Modalities Image + Text → Text (image-text-to-text)
Context window 256K tokens
Weight quantization 2-bit MLX (re-quantized from 4-bit RotorQuant)
Approx. disk footprint ~135 GB
License Apache 2.0

RotorQuant vs TurboQuant

Aspect RotorQuant (this repo) TurboQuant
Rotation Learned orthogonal rotors (data-calibrated) Randomized Hadamard (static)
Calibration ~512 sample calibration pass Zero-shot
Accuracy @ 2-bit ~95–97% of FP16 baseline (task-dependent) ~93–95% of FP16 baseline (task-dependent)
Best for Squeezing the model in with the best quality Squeezing the model into small VRAM

Memory Estimates (2-bit MLX)

Context Active memory (approx.)
8K ~143 GB
32K ~153 GB
128K ~183 GB
256K ~213 GB

Hardware Requirements

  • Minimum: Apple Silicon with 192 GB unified memory for short/medium contexts
  • Recommended: 256 GB+ unified memory for full 256K context
  • Fits on top-end Mac Studio M-series configurations; does not fit on 96 GB or 128 GB Macs

Caveats

  • Re-quantized from the 4-bit RotorQuant MLX checkpoint (not directly from FP16)
  • Still the preferred 2-bit option — learned rotors meaningfully outperform Hadamard rotations at extreme bit-widths
  • For production use, prefer the 4-bit or higher variants when your hardware allows

See Also

Downloads last month
-
Safetensors
Model size
38B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-2bit

Quantized
(73)
this model