majentik
/

Qwen3.5-27B-RotorQuant-2bit

+---
+library_name: transformers
+license: apache-2.0
+base_model: Qwen/Qwen3.5-27B
+tags:
+  - rotorquant
+  - 2bit
+  - kv-cache
+  - quantized
+  - qwen3.5
+  - thinking
+language:
+  - en
+pipeline_tag: text-generation
+---
+# Qwen3.5-27B-RotorQuant-2bit
+**2-bit KV cache compression** for [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) using [RotorQuant](https://github.com/scrya-com/rotorquant).
+> This is a **KV-cache-only** repository. It contains no model weight files — only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.
+## Overview
+Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.
+RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.
+### RotorQuant Advantages
+| Metric | RotorQuant 2-bit | Standard 2-bit |
+|---|---|---|
+| Prefill speed | **5.3x faster** | Baseline |
+| Decode speed | **28% faster** | Baseline |
+| Perplexity | **6.91** | 7.07 |
+RotorQuant achieves lower perplexity (better quality) while also being faster — a rare combination at aggressive quantization levels.
+## Specifications
+| Property | Value |
+|---|---|
+| Base model | Qwen/Qwen3.5-27B |
+| Parameters | 27B |
+| Architecture | Hybrid Transformer |
+| Native context | 262,144 tokens |
+| Thinking mode | Yes |
+| KV cache method | RotorQuant 2-bit (IsoQuant) |
+| KV cache compression | ~10x vs FP16 |
+| Weights | Original (FP16/BF16, loaded separately) |
+## Memory Estimates
+| Component | Estimate |
+|---|---|
+| Model weights (BF16) | ~54 GB |
+| KV cache at 128K context (2-bit RotorQuant) | ~1.3 GB |
+| KV cache at 128K context (FP16, baseline) | ~12.8 GB |
+## Quickstart
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from turboquant import IsoQuantCache
+model_id = "Qwen/Qwen3.5-27B"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+# Apply 2-bit RotorQuant KV cache compression
+cache = IsoQuantCache(bits=2)
+messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    inputs,
+    max_new_tokens=2048,
+    past_key_values=cache,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Quality Notes
+- **2-bit is aggressive quantization**, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
+- Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
+- For higher quality with moderate compression, consider 4-bit KV cache variants.
+- Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.
+## References
+- [RotorQuant](https://github.com/scrya-com/rotorquant) — Rotation-based isotropic KV cache quantization
+- [Qwen3.5-27B base model](https://huggingface.co/Qwen/Qwen3.5-27B)
+## See Also
+- [majentik/Qwen3.5-27B-TurboQuant-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-2bit) — TurboQuant 2-bit KV cache variant
+- [majentik/Qwen3.5-27B-TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-MLX-2bit) — MLX 2-bit weights + TurboQuant KV cache
+- [majentik/Qwen3.5-27B-RotorQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-RotorQuant-MLX-2bit) — MLX 2-bit weights + RotorQuant KV cache