Qwen3.5-27B-TurboQuant-MLX-2bit
MLX 2-bit weight quantization + TurboQuant 2-bit KV cache compression for Qwen/Qwen3.5-27B.
Dual compression for Apple Silicon: both the model weights and the KV cache are quantized to 2-bit, enabling long-context inference on memory-constrained Macs.
Overview
Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.
This variant applies two layers of compression:
- MLX 2-bit weight quantization — reduces the 27B model from
54 GB (BF16) to approximately **8 GB**, making it loadable on Apple Silicon devices with limited unified memory. - TurboQuant 2-bit KV cache — compresses the key-value cache by approximately 8x compared to FP16, enabling long-context inference without running out of memory.
At 2-bit precision, both weight and cache quantization are aggressive — expect some quality degradation compared to 4-bit variants, but this combination enables running a 27B thinking model with long context on hardware that would otherwise be unable to fit it.
Specifications
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Architecture | Hybrid Transformer |
| Native context | 262,144 tokens |
| Thinking mode | Yes |
| Weight quantization | MLX 2-bit |
| KV cache method | TurboQuant 2-bit |
| KV cache compression | ~8x vs FP16 |
| Runtime | MLX (Apple Silicon) |
Memory Estimates
| Component | Estimate |
|---|---|
| Model weights (MLX 2-bit) | ~8 GB |
| KV cache at 128K context (2-bit TurboQuant) | ~1.6 GB |
| Total at 128K context | ~9.6 GB |
| Comparison: BF16 weights + FP16 KV at 128K | ~66.8 GB |
Quickstart
from mlx_lm import load, generate
from turboquant import TurboQuantCache
model_id = "majentik/Qwen3.5-27B-TurboQuant-MLX-2bit"
model, tokenizer = load(model_id)
# Apply 2-bit KV cache compression
cache = TurboQuantCache(bits=2)
prompt = "Explain the Riemann hypothesis in simple terms."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(
model,
tokenizer,
prompt=text,
max_tokens=2048,
kv_cache=cache,
)
print(response)
Quality Notes
- 2-bit weights + 2-bit KV cache is the most aggressive quantization combination. Use this when memory is the primary constraint and some quality loss is acceptable.
- For higher quality on Apple Silicon, consider 4-bit weight variants with 4-bit KV cache.
- Thinking mode reasoning quality may be more sensitive to quantization since the model relies on both weight precision and cached reasoning tokens for its final answer.
- Best suited for: prototyping, development, long-context exploration, and scenarios where running the model at all matters more than peak quality.
References
- TurboQuant: Efficient KV Cache Quantization
- MLX — Apple's machine learning framework
- mlx-lm — LLM inference with MLX
- Qwen3.5-27B base model
See Also
- majentik/Qwen3.5-27B-RotorQuant-MLX-2bit — MLX 2-bit weights + RotorQuant KV cache
- majentik/Qwen3.5-27B-TurboQuant-2bit — TurboQuant 2-bit KV cache only (transformers)
- majentik/Qwen3.5-27B-RotorQuant-2bit — RotorQuant 2-bit KV cache only (transformers)
- Downloads last month
- 532
2-bit
Model tree for majentik/Qwen3.5-27B-TurboQuant-MLX-2bit
Base model
Qwen/Qwen3.5-27B