Qwen3.5-27B-RotorQuant-MLX-8bit

MLX 8-bit weight-quantized variant of Qwen/Qwen3.5-27B with RotorQuant KV cache compression for efficient inference on Apple Silicon.

Overview

This model combines two complementary compression techniques:

  • MLX 8-bit weight quantization — reduces model size from ~54GB to ~27GB
  • RotorQuant KV cache compression — compresses key-value caches during inference using Clifford algebra block-diagonal rotations, enabling longer contexts with less VRAM

Quickstart

from mlx_lm import load, generate
from turboquant import IsoQuantCache

model, tokenizer = load("majentik/Qwen3.5-27B-RotorQuant-MLX-8bit")

# Standard generation
prompt = "Explain the theory of relativity"
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)

Specifications

Property Value
Base Model Qwen/Qwen3.5-27B
Parameters 27B
Weight Quantization MLX 8-bit affine
KV Cache Method RotorQuant (Clifford algebra block-diagonal rotations)
Model Size ~27 GB
Context Length 262K (native), 1M+ (extended)
Platform Apple Silicon (M1/M2/M3/M4/M5)

What is RotorQuant?

RotorQuant uses Clifford algebra block-diagonal rotations for KV cache quantization, achieving superior efficiency compared to vector-quantization-based approaches like TurboQuant:

Metric RotorQuant TurboQuant
Prefill speed 5.3x faster baseline
Decode speed 28% faster baseline
Quantizer parameters 44x fewer baseline
Perplexity 6.91 7.07

RotorQuant compresses the KV cache to ~3-bit effective precision while maintaining lower perplexity than TurboQuant's 4-bit approach, thanks to the mathematical efficiency of geometric algebra rotations.

Thinking Mode

Qwen3.5-27B generates extended reasoning before responses by default. The combination of weight quantization and KV cache compression is especially valuable here — thinking tokens consume significant memory that is reduced by both techniques working together.

Memory Estimate

Configuration Model Weights KV Cache (128K ctx) Total
FP16 (baseline) ~54 GB ~13 GB ~67 GB
MLX 8-bit + RotorQuant ~27 GB ~1.3 GB ~28.3 GB

See Also

Downloads last month
93
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.5-27B-RotorQuant-MLX-8bit

Base model

Qwen/Qwen3.5-27B
Quantized
(191)
this model