Qwen3.5-27B-RotorQuant-2bit

2-bit KV cache compression for Qwen/Qwen3.5-27B using RotorQuant.

This is a KV-cache-only repository. It contains no model weight files โ€” only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.

RotorQuant Advantages

Metric RotorQuant 2-bit Standard 2-bit
Prefill speed 5.3x faster Baseline
Decode speed 28% faster Baseline
Perplexity 6.91 7.07

RotorQuant achieves lower perplexity (better quality) while also being faster โ€” a rare combination at aggressive quantization levels.

Specifications

Property Value
Base model Qwen/Qwen3.5-27B
Parameters 27B
Architecture Hybrid Transformer
Native context 262,144 tokens
Thinking mode Yes
KV cache method RotorQuant 2-bit (IsoQuant)
KV cache compression ~10x vs FP16
Weights Original (FP16/BF16, loaded separately)

Memory Estimates

Component Estimate
Model weights (BF16) ~54 GB
KV cache at 128K context (2-bit RotorQuant) ~1.3 GB
KV cache at 128K context (FP16, baseline) ~12.8 GB

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "Qwen/Qwen3.5-27B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Apply 2-bit RotorQuant KV cache compression
cache = IsoQuantCache(bits=2)

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quality Notes

  • 2-bit is aggressive quantization, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
  • Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
  • For higher quality with moderate compression, consider 4-bit KV cache variants.
  • Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.

References

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for majentik/Qwen3.5-27B-RotorQuant-2bit

Base model

Qwen/Qwen3.5-27B
Finetuned
(240)
this model