Qwen3.5-27B-RotorQuant-MLX-2bit / README.md

Add MLX 2-bit quantized Qwen3.5-27B with RotorQuant KV cache

a4d8e67 verified 5 days ago

4.31 kB

library_name: mlx
license: apache-2.0
base_model: Qwen/Qwen3.5-27B
tags:
  - mlx
  - mlx-lm
  - rotorquant
  - 2bit
  - quantized
  - kv-cache
  - qwen3.5
  - thinking
language:
  - en
pipeline_tag: text-generation

Qwen3.5-27B-RotorQuant-MLX-2bit

MLX 2-bit weight quantization + RotorQuant 2-bit KV cache compression for Qwen/Qwen3.5-27B.

Dual compression for Apple Silicon: both the model weights and the KV cache are quantized to 2-bit, enabling long-context inference on memory-constrained Macs.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

This variant applies two layers of compression:

MLX 2-bit weight quantization — reduces the 27B model from 54 GB (BF16) to approximately **8 GB**, making it loadable on Apple Silicon devices with limited unified memory.
RotorQuant 2-bit KV cache — rotation-based isotropic quantization compresses the key-value cache with better quality and speed than standard approaches.

RotorQuant Advantages

Metric	RotorQuant 2-bit	Standard 2-bit
Prefill speed	5.3x faster	Baseline
Decode speed	28% faster	Baseline
Perplexity	6.91	7.07

RotorQuant achieves lower perplexity (better quality) while also being faster — making it the preferred 2-bit KV cache method when quality matters.

Specifications

Property	Value
Base model	Qwen/Qwen3.5-27B
Parameters	27B
Architecture	Hybrid Transformer
Native context	262,144 tokens
Thinking mode	Yes
Weight quantization	MLX 2-bit
KV cache method	RotorQuant 2-bit (IsoQuant)
KV cache compression	~10x vs FP16
Runtime	MLX (Apple Silicon)

Memory Estimates

Component	Estimate
Model weights (MLX 2-bit)	~8 GB
KV cache at 128K context (2-bit RotorQuant)	~1.3 GB
Total at 128K context	~9.3 GB
Comparison: BF16 weights + FP16 KV at 128K	~66.8 GB

Quickstart

from mlx_lm import load, generate
from turboquant import IsoQuantCache

model_id = "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"

model, tokenizer = load(model_id)

# Apply 2-bit RotorQuant KV cache compression
cache = IsoQuantCache(bits=2)

prompt = "Explain the Riemann hypothesis in simple terms."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(
    model,
    tokenizer,
    prompt=text,
    max_tokens=2048,
    kv_cache=cache,
)
print(response)

Quality Notes

2-bit weights + 2-bit KV cache is the most aggressive quantization combination, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
For higher quality on Apple Silicon, consider 4-bit weight variants with 4-bit KV cache.
Thinking mode reasoning quality may be more sensitive to quantization since the model relies on both weight precision and cached reasoning tokens for its final answer.
Best suited for: prototyping, development, long-context exploration, and scenarios where running the model at all matters more than peak quality.

References

RotorQuant — Rotation-based isotropic KV cache quantization
MLX — Apple's machine learning framework
mlx-lm — LLM inference with MLX
Qwen3.5-27B base model

majentik
/

Qwen3.5-27B-RotorQuant-MLX-2bit