majentik commited on
Commit
d39850a
·
verified ·
1 Parent(s): 87f9f46

Add model card

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: Qwen/Qwen3.5-27B
5
+ tags:
6
+ - rotorquant
7
+ - 2bit
8
+ - kv-cache
9
+ - quantized
10
+ - qwen3.5
11
+ - thinking
12
+ language:
13
+ - en
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # Qwen3.5-27B-RotorQuant-2bit
18
+
19
+ **2-bit KV cache compression** for [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) using [RotorQuant](https://github.com/scrya-com/rotorquant).
20
+
21
+ > This is a **KV-cache-only** repository. It contains no model weight files — only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.
22
+
23
+ ## Overview
24
+
25
+ Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.
26
+
27
+ RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.
28
+
29
+ ### RotorQuant Advantages
30
+
31
+ | Metric | RotorQuant 2-bit | Standard 2-bit |
32
+ |---|---|---|
33
+ | Prefill speed | **5.3x faster** | Baseline |
34
+ | Decode speed | **28% faster** | Baseline |
35
+ | Perplexity | **6.91** | 7.07 |
36
+
37
+ RotorQuant achieves lower perplexity (better quality) while also being faster — a rare combination at aggressive quantization levels.
38
+
39
+ ## Specifications
40
+
41
+ | Property | Value |
42
+ |---|---|
43
+ | Base model | Qwen/Qwen3.5-27B |
44
+ | Parameters | 27B |
45
+ | Architecture | Hybrid Transformer |
46
+ | Native context | 262,144 tokens |
47
+ | Thinking mode | Yes |
48
+ | KV cache method | RotorQuant 2-bit (IsoQuant) |
49
+ | KV cache compression | ~10x vs FP16 |
50
+ | Weights | Original (FP16/BF16, loaded separately) |
51
+
52
+ ## Memory Estimates
53
+
54
+ | Component | Estimate |
55
+ |---|---|
56
+ | Model weights (BF16) | ~54 GB |
57
+ | KV cache at 128K context (2-bit RotorQuant) | ~1.3 GB |
58
+ | KV cache at 128K context (FP16, baseline) | ~12.8 GB |
59
+
60
+ ## Quickstart
61
+
62
+ ```python
63
+ from transformers import AutoModelForCausalLM, AutoTokenizer
64
+ from turboquant import IsoQuantCache
65
+
66
+ model_id = "Qwen/Qwen3.5-27B"
67
+
68
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
69
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
70
+
71
+ # Apply 2-bit RotorQuant KV cache compression
72
+ cache = IsoQuantCache(bits=2)
73
+
74
+ messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
75
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
76
+
77
+ outputs = model.generate(
78
+ inputs,
79
+ max_new_tokens=2048,
80
+ past_key_values=cache,
81
+ )
82
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
83
+ ```
84
+
85
+ ## Quality Notes
86
+
87
+ - **2-bit is aggressive quantization**, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
88
+ - Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
89
+ - For higher quality with moderate compression, consider 4-bit KV cache variants.
90
+ - Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.
91
+
92
+ ## References
93
+
94
+ - [RotorQuant](https://github.com/scrya-com/rotorquant) — Rotation-based isotropic KV cache quantization
95
+ - [Qwen3.5-27B base model](https://huggingface.co/Qwen/Qwen3.5-27B)
96
+
97
+ ## See Also
98
+
99
+ - [majentik/Qwen3.5-27B-TurboQuant-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-2bit) — TurboQuant 2-bit KV cache variant
100
+ - [majentik/Qwen3.5-27B-TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-MLX-2bit) — MLX 2-bit weights + TurboQuant KV cache
101
+ - [majentik/Qwen3.5-27B-RotorQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-RotorQuant-MLX-2bit) — MLX 2-bit weights + RotorQuant KV cache