Qwen3.5-27B-TurboQuant -- TurboQuant KV Cache Compression
Qwen3.5-27B with TurboQuant KV cache compression applied. TurboQuant is an online vector quantization method that compresses KV caches to 2/3/4-bit precision with no training or calibration data required. At 4-bit, compression is effectively lossless (bit-identical prefill logits to the unquantized model).
The base model is Qwen/Qwen3.5-27B, a 27B parameter hybrid transformer combining gated delta networks with sparse mixture-of-experts. It supports 262K native context with extension to 1M+ tokens and operates in thinking mode by default.
What is TurboQuant?
TurboQuant is an online vector quantization algorithm introduced in "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate". It compresses KV cache entries on-the-fly during inference using two key operations:
- Random rotation -- applies a randomized orthogonal transformation to decorrelate KV cache dimensions before quantization.
- Lloyd-Max scalar quantization -- applies optimal non-uniform scalar quantization to each rotated dimension, minimizing mean squared error for the observed distribution.
Because TurboQuant operates entirely online (no pre-training, calibration set, or fine-tuning), it can be applied to any pretrained model at inference time. The method achieves near-optimal distortion-rate performance, meaning it approaches the theoretical minimum quantization error for a given bit budget.
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model = AutoModelForCausalLM.from_pretrained(
"majentik/Qwen3.5-27B-TurboQuant",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("majentik/Qwen3.5-27B-TurboQuant")
# Apply chat template (Qwen3.5 supports thinking mode)
messages = [{"role": "user", "content": "Explain quantum computing"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# 4-bit TurboQuant cache -- lossless compression
cache = TurboQuantCache(bits=4)
output = model.generate(**inputs, max_new_tokens=2048, past_key_values=cache, use_cache=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Configuration
| Bit Width | Quality | Compression | Recommended Use |
|---|---|---|---|
| 4-bit | Lossless (bit-identical prefill logits) | ~4x KV cache | Default choice -- no quality loss |
| 3-bit | Near-lossless | ~5.3x KV cache | Memory-constrained deployments |
| 2-bit | Slight degradation | ~8x KV cache | Extreme memory constraints |
Select the bit width when constructing the cache:
cache = TurboQuantCache(bits=4) # Lossless
cache = TurboQuantCache(bits=3) # Near-lossless
cache = TurboQuantCache(bits=2) # Maximum compression
Memory Savings
Qwen3.5-27B has substantial KV caches due to its 27B parameter count and deep architecture. TurboQuant provides significant VRAM savings, especially at long context lengths where KV cache dominates memory usage.
| Context Length | FP16 KV Cache | 4-bit TurboQuant | 3-bit TurboQuant | 2-bit TurboQuant |
|---|---|---|---|---|
| 8K | ~3.4 GB | ~0.85 GB | ~0.64 GB | ~0.43 GB |
| 32K | ~13.5 GB | ~3.4 GB | ~2.5 GB | ~1.7 GB |
| 128K | ~54 GB | ~13.5 GB | ~10.2 GB | ~6.8 GB |
| 262K (native) | ~110 GB | ~27.5 GB | ~20.8 GB | ~13.8 GB |
Estimates based on Qwen3.5-27B KV cache dimensions. Actual savings depend on model configuration and batch size.
Thinking Mode
Qwen3.5-27B generates extended chain-of-thought reasoning before producing its final response. These thinking tokens can consume substantial KV cache memory -- often thousands of tokens of internal reasoning before a single output token is emitted. TurboQuant is especially valuable here because:
- Thinking tokens are generated autoregressively and cached, so KV cache grows rapidly during the reasoning phase.
- At 4-bit with no quality loss, you get 4x more reasoning capacity within the same VRAM budget.
- This enables longer, more thorough reasoning chains without hitting out-of-memory errors.
See Also
Model tree for majentik/Qwen3.5-27B-TurboQuant
Base model
Qwen/Qwen3.5-27B