| --- |
| library_name: transformers |
| license: apache-2.0 |
| base_model: Qwen/Qwen3.5-27B |
| tags: |
| - rotorquant |
| - 2bit |
| - kv-cache |
| - quantized |
| - qwen3.5 |
| - thinking |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # Qwen3.5-27B-RotorQuant-2bit |
|
|
| **2-bit KV cache compression** for [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) using [RotorQuant](https://github.com/scrya-com/rotorquant). |
|
|
| > This is a **KV-cache-only** repository. It contains no model weight files β only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights. |
|
|
| ## Overview |
|
|
| Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory. |
|
|
| RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width. |
|
|
| ### RotorQuant Advantages |
|
|
| | Metric | RotorQuant 2-bit | Standard 2-bit | |
| |---|---|---| |
| | Prefill speed | **5.3x faster** | Baseline | |
| | Decode speed | **28% faster** | Baseline | |
| | Perplexity | **6.91** | 7.07 | |
|
|
| RotorQuant achieves lower perplexity (better quality) while also being faster β a rare combination at aggressive quantization levels. |
|
|
| ## Specifications |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | Qwen/Qwen3.5-27B | |
| | Parameters | 27B | |
| | Architecture | Hybrid Transformer | |
| | Native context | 262,144 tokens | |
| | Thinking mode | Yes | |
| | KV cache method | RotorQuant 2-bit (IsoQuant) | |
| | KV cache compression | ~10x vs FP16 | |
| | Weights | Original (FP16/BF16, loaded separately) | |
|
|
| ## Memory Estimates |
|
|
| | Component | Estimate | |
| |---|---| |
| | Model weights (BF16) | ~54 GB | |
| | KV cache at 128K context (2-bit RotorQuant) | ~1.3 GB | |
| | KV cache at 128K context (FP16, baseline) | ~12.8 GB | |
|
|
| ## Quickstart |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from turboquant import IsoQuantCache |
| |
| model_id = "Qwen/Qwen3.5-27B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") |
| |
| # Apply 2-bit RotorQuant KV cache compression |
| cache = IsoQuantCache(bits=2) |
| |
| messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}] |
| inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) |
| |
| outputs = model.generate( |
| inputs, |
| max_new_tokens=2048, |
| past_key_values=cache, |
| ) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Quality Notes |
|
|
| - **2-bit is aggressive quantization**, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07). |
| - Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential. |
| - For higher quality with moderate compression, consider 4-bit KV cache variants. |
| - Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer. |
|
|
| ## References |
|
|
| - [RotorQuant](https://github.com/scrya-com/rotorquant) β Rotation-based isotropic KV cache quantization |
| - [Qwen3.5-27B base model](https://huggingface.co/Qwen/Qwen3.5-27B) |
|
|
| ## See Also |
|
|
| - [majentik/Qwen3.5-27B-TurboQuant-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-2bit) β TurboQuant 2-bit KV cache variant |
| - [majentik/Qwen3.5-27B-TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-MLX-2bit) β MLX 2-bit weights + TurboQuant KV cache |
| - [majentik/Qwen3.5-27B-RotorQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-RotorQuant-MLX-2bit) β MLX 2-bit weights + RotorQuant KV cache |
|
|