Qwen3.5-27B-TurboQuant -- TurboQuant KV Cache Compression

Qwen3.5-27B with TurboQuant KV cache compression applied. TurboQuant is an online vector quantization method that compresses KV caches to 2/3/4-bit precision with no training or calibration data required. At 4-bit, compression is effectively lossless (bit-identical prefill logits to the unquantized model).

The base model is Qwen/Qwen3.5-27B, a 27B parameter hybrid transformer combining gated delta networks with sparse mixture-of-experts. It supports 262K native context with extension to 1M+ tokens and operates in thinking mode by default.

What is TurboQuant?

TurboQuant is an online vector quantization algorithm introduced in "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate". It compresses KV cache entries on-the-fly during inference using two key operations:

  1. Random rotation -- applies a randomized orthogonal transformation to decorrelate KV cache dimensions before quantization.
  2. Lloyd-Max scalar quantization -- applies optimal non-uniform scalar quantization to each rotated dimension, minimizing mean squared error for the observed distribution.

Because TurboQuant operates entirely online (no pre-training, calibration set, or fine-tuning), it can be applied to any pretrained model at inference time. The method achieves near-optimal distortion-rate performance, meaning it approaches the theoretical minimum quantization error for a given bit budget.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model = AutoModelForCausalLM.from_pretrained(
    "majentik/Qwen3.5-27B-TurboQuant",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("majentik/Qwen3.5-27B-TurboQuant")

# Apply chat template (Qwen3.5 supports thinking mode)
messages = [{"role": "user", "content": "Explain quantum computing"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# 4-bit TurboQuant cache -- lossless compression
cache = TurboQuantCache(bits=4)
output = model.generate(**inputs, max_new_tokens=2048, past_key_values=cache, use_cache=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Configuration

Bit Width Quality Compression Recommended Use
4-bit Lossless (bit-identical prefill logits) ~4x KV cache Default choice -- no quality loss
3-bit Near-lossless ~5.3x KV cache Memory-constrained deployments
2-bit Slight degradation ~8x KV cache Extreme memory constraints

Select the bit width when constructing the cache:

cache = TurboQuantCache(bits=4)  # Lossless
cache = TurboQuantCache(bits=3)  # Near-lossless
cache = TurboQuantCache(bits=2)  # Maximum compression

Memory Savings

Qwen3.5-27B has substantial KV caches due to its 27B parameter count and deep architecture. TurboQuant provides significant VRAM savings, especially at long context lengths where KV cache dominates memory usage.

Context Length FP16 KV Cache 4-bit TurboQuant 3-bit TurboQuant 2-bit TurboQuant
8K ~3.4 GB ~0.85 GB ~0.64 GB ~0.43 GB
32K ~13.5 GB ~3.4 GB ~2.5 GB ~1.7 GB
128K ~54 GB ~13.5 GB ~10.2 GB ~6.8 GB
262K (native) ~110 GB ~27.5 GB ~20.8 GB ~13.8 GB

Estimates based on Qwen3.5-27B KV cache dimensions. Actual savings depend on model configuration and batch size.

Thinking Mode

Qwen3.5-27B generates extended chain-of-thought reasoning before producing its final response. These thinking tokens can consume substantial KV cache memory -- often thousands of tokens of internal reasoning before a single output token is emitted. TurboQuant is especially valuable here because:

  • Thinking tokens are generated autoregressively and cached, so KV cache grows rapidly during the reasoning phase.
  • At 4-bit with no quality loss, you get 4x more reasoning capacity within the same VRAM budget.
  • This enables longer, more thorough reasoning chains without hitting out-of-memory errors.

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.5-27B-TurboQuant

Base model

Qwen/Qwen3.5-27B
Finetuned
(243)
this model

Paper for majentik/Qwen3.5-27B-TurboQuant