majentik commited on
Commit
7d1728a
·
verified ·
1 Parent(s): 9752a56

Add model card

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: Qwen/Qwen3.5-27B
4
+ tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - efficient-inference
8
+ - qwen3.5
9
+ - thinking-model
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # Qwen3.5-27B-RotorQuant -- RotorQuant KV Cache Compression
14
+
15
+ Qwen3.5-27B with **RotorQuant** KV cache compression applied. RotorQuant uses block-diagonal rotations derived from Clifford algebra to compress KV caches with substantially better speed and efficiency than prior methods. At 3-bit precision, it achieves approximately 10x KV cache compression while maintaining strong output quality.
16
+
17
+ The base model is [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B), a 27B parameter hybrid transformer combining gated delta networks with sparse mixture-of-experts. It supports 262K native context with extension to 1M+ tokens and operates in thinking mode by default.
18
+
19
+ ## What is RotorQuant?
20
+
21
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression framework that replaces the dense random rotation used in methods like TurboQuant with **block-diagonal rotations** grounded in Clifford algebra. This architectural choice yields major practical advantages:
22
+
23
+ - **28% faster decode** and **5.3x faster prefill** compared to TurboQuant
24
+ - **44x fewer parameters** (128 vs 16,384) for the rotation matrices
25
+ - **O(d) complexity** vs O(d log d) for the rotation step
26
+ - **Lower perplexity**: 6.91 vs 7.07 (TurboQuant) on standard benchmarks
27
+
28
+ RotorQuant ships three backend implementations, each offering a different speed/quality tradeoff:
29
+
30
+ | Backend | Algebra | Best For |
31
+ |---------|---------|----------|
32
+ | **PlanarQuant** | 2D Givens rotations | Fastest inference -- production deployments |
33
+ | **IsoQuant** | 4D quaternion rotations | Balanced speed and quality |
34
+ | **RotorQuant** | 3D Clifford rotors | Research and maximum quality |
35
+
36
+ ## Quickstart
37
+
38
+ ```python
39
+ import torch
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ from turboquant import IsoQuantCache
42
+
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ "majentik/Qwen3.5-27B-RotorQuant",
45
+ torch_dtype=torch.bfloat16,
46
+ device_map="auto",
47
+ )
48
+ tokenizer = AutoTokenizer.from_pretrained("majentik/Qwen3.5-27B-RotorQuant")
49
+
50
+ # Apply chat template (Qwen3.5 supports thinking mode)
51
+ messages = [{"role": "user", "content": "Explain quantum computing"}]
52
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
53
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
54
+
55
+ # 3-bit IsoQuant cache -- recommended setting (~10x KV compression)
56
+ cache = IsoQuantCache(bits=3)
57
+ output = model.generate(**inputs, max_new_tokens=2048, past_key_values=cache, use_cache=True)
58
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
59
+ ```
60
+
61
+ ### Switching Backends
62
+
63
+ ```python
64
+ from turboquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache
65
+
66
+ # Fastest -- 2D Givens rotations (production)
67
+ cache = PlanarQuantCache(bits=3)
68
+
69
+ # Balanced -- 4D quaternion rotations
70
+ cache = IsoQuantCache(bits=3)
71
+
72
+ # Research -- 3D Clifford rotors (highest quality)
73
+ cache = RotorQuantCache(bits=3)
74
+ ```
75
+
76
+ ## Configuration
77
+
78
+ | Bit Width | Quality | Compression | Recommended Use |
79
+ |-----------|---------|-------------|-----------------|
80
+ | **4-bit** | Near-lossless | ~4x KV cache | Quality-sensitive applications |
81
+ | **3-bit** | Strong (ppl 6.91) | ~10x KV cache | **Recommended default** -- best quality/compression tradeoff |
82
+ | **2-bit** | Moderate degradation | ~16x KV cache | Extreme memory constraints |
83
+
84
+ The 3-bit setting is recommended as the default. It provides approximately 10x KV cache compression with a perplexity of 6.91, which is lower (better) than TurboQuant's 7.07 at the same bit width.
85
+
86
+ ## Memory Savings
87
+
88
+ Qwen3.5-27B has substantial KV caches due to its 27B parameter count. RotorQuant's 3-bit mode provides approximately 10x compression, making long-context inference practical on fewer GPUs.
89
+
90
+ | Context Length | FP16 KV Cache | 4-bit RotorQuant | 3-bit RotorQuant | 2-bit RotorQuant |
91
+ |---------------|---------------|-------------------|-------------------|-------------------|
92
+ | 8K | ~3.4 GB | ~0.85 GB | ~0.34 GB | ~0.21 GB |
93
+ | 32K | ~13.5 GB | ~3.4 GB | ~1.35 GB | ~0.84 GB |
94
+ | 128K | ~54 GB | ~13.5 GB | ~5.4 GB | ~3.4 GB |
95
+ | 262K (native) | ~110 GB | ~27.5 GB | ~11 GB | ~6.9 GB |
96
+
97
+ *Estimates based on Qwen3.5-27B KV cache dimensions. Actual savings depend on model configuration and batch size.*
98
+
99
+ ## Performance vs TurboQuant
100
+
101
+ | Metric | RotorQuant | TurboQuant |
102
+ |--------|------------|------------|
103
+ | Decode speed | **28% faster** | Baseline |
104
+ | Prefill speed | **5.3x faster** | Baseline |
105
+ | Rotation parameters | **128** | 16,384 |
106
+ | Rotation complexity | **O(d)** | O(d log d) |
107
+ | Perplexity (3-bit) | **6.91** | 7.07 |
108
+
109
+ ## Thinking Mode
110
+
111
+ Qwen3.5-27B generates extended chain-of-thought reasoning before producing its final response. These thinking tokens can consume substantial KV cache memory -- often thousands of tokens of internal reasoning before a single output token is emitted. RotorQuant is especially valuable here because:
112
+
113
+ - Thinking tokens are generated autoregressively and cached, so KV cache grows rapidly during the reasoning phase.
114
+ - At 3-bit with ~10x compression, you can sustain much longer reasoning chains within the same VRAM budget.
115
+ - The 5.3x faster prefill directly accelerates the initial prompt processing, which matters for long system prompts and multi-turn conversations.
116
+ - The 28% faster decode speeds up the token-by-token generation during both thinking and response phases.
117
+
118
+ ## See Also
119
+
120
+ - [Base model: Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)
121
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
122
+ - [TurboQuant paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874)