Qwen3.5-27B-RotorQuant-MLX-4bit
MLX 4-bit weight-quantized variant of Qwen/Qwen3.5-27B with RotorQuant KV cache compression for efficient inference on Apple Silicon.
Overview
This model combines two complementary compression techniques:
- MLX 4-bit weight quantization (affine, group size 64) — reduces model size from ~54GB to ~15GB
- RotorQuant KV cache compression — compresses key-value caches during inference using Clifford algebra block-diagonal rotations, enabling longer contexts with less VRAM
Quickstart
from mlx_lm import load, generate
from turboquant import IsoQuantCache
model, tokenizer = load("majentik/Qwen3.5-27B-RotorQuant-MLX-4bit")
# Standard generation
prompt = "Explain the theory of relativity"
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
Specifications
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Weight Quantization | MLX 4-bit affine (group size 64) |
| KV Cache Method | RotorQuant (Clifford algebra block-diagonal rotations) |
| Model Size | ~15 GB |
| Context Length | 262K (native), 1M+ (extended) |
| Platform | Apple Silicon (M1/M2/M3/M4/M5) |
What is RotorQuant?
RotorQuant uses Clifford algebra block-diagonal rotations for KV cache quantization, achieving superior efficiency compared to vector-quantization-based approaches like TurboQuant:
| Metric | RotorQuant | TurboQuant |
|---|---|---|
| Prefill speed | 5.3x faster | baseline |
| Decode speed | 28% faster | baseline |
| Quantizer parameters | 44x fewer | baseline |
| Perplexity | 6.91 | 7.07 |
RotorQuant compresses the KV cache to ~3-bit effective precision while maintaining lower perplexity than TurboQuant's 4-bit approach, thanks to the mathematical efficiency of geometric algebra rotations.
Thinking Mode
Qwen3.5-27B generates extended reasoning before responses by default. The combination of weight quantization and KV cache compression is especially valuable here — thinking tokens consume significant memory that is reduced by both techniques working together.
Memory Estimate
| Configuration | Model Weights | KV Cache (128K ctx) | Total |
|---|---|---|---|
| FP16 (baseline) | ~54 GB | ~13 GB | ~67 GB |
| MLX 4-bit + RotorQuant | ~15 GB | ~1.3 GB | ~16.3 GB |
See Also
- Downloads last month
- 112
4-bit
Model tree for majentik/Qwen3.5-27B-RotorQuant-MLX-4bit
Base model
Qwen/Qwen3.5-27B