majentik commited on
Commit
1b31b2f
Β·
verified Β·
1 Parent(s): ade072c

Update KV-cache card with accurate template and fork requirements

Browse files
Files changed (1) hide show
  1. README.md +90 -56
README.md CHANGED
@@ -1,101 +1,135 @@
1
  ---
 
2
  base_model: deepseek-ai/DeepSeek-V3.2
3
- library_name: transformers
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
7
  - deepseek
8
  - moe
9
  - quantized
10
- - text-generation
11
- - mixture-of-experts
12
- license: mit
13
  pipeline_tag: text-generation
14
  ---
15
 
16
  # DeepSeek-V3.2-RotorQuant
17
 
18
- **KV cache quantization for DeepSeek-V3.2 using RotorQuant compression.**
19
 
20
- This repository provides RotorQuant-compressed KV cache configurations for [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2), one of the most capable open-weight large language models available. RotorQuant achieves 5.3x faster prefill and 28% faster decode while maintaining near-lossless quality.
21
 
22
- ## Overview
23
 
24
- | Attribute | Value |
25
- |-----------|-------|
26
- | Base model | [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) |
27
- | Architecture | Mixture of Experts (MoE) |
28
- | Total parameters | ~671B |
29
- | Compression | RotorQuant KV cache |
30
- | Perplexity | 6.91 (vs 7.07 baseline) |
31
- | Prefill speedup | 5.3x |
32
- | Decode speedup | 1.28x (28% faster) |
33
- | License | MIT |
34
- | Task | Text generation |
35
 
36
  ## Quickstart
37
 
 
 
 
 
 
 
 
 
 
 
38
  ```python
 
39
  from transformers import AutoModelForCausalLM, AutoTokenizer
40
- from turboquant import IsoQuantCache
41
-
42
- model_id = "deepseek-ai/DeepSeek-V3.2"
43
 
44
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
45
  model = AutoModelForCausalLM.from_pretrained(
46
- model_id,
47
- trust_remote_code=True,
48
  device_map="auto",
49
- torch_dtype="auto",
50
  )
51
 
52
- # Apply RotorQuant KV cache compression
53
- cache = IsoQuantCache(
54
- model,
55
- residual_length=128,
56
- )
57
 
58
- inputs = tokenizer("Explain mixture of experts architectures.", return_tensors="pt").to(model.device)
59
  outputs = model.generate(
60
  **inputs,
 
61
  past_key_values=cache,
62
- max_new_tokens=512,
63
  )
64
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ```
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ## What is RotorQuant?
68
 
69
- [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache quantization method that applies rotation-based transformations to compress the key-value cache during autoregressive generation. It achieves substantial speedups in both prefill and decode stages while actually improving perplexity in some configurations.
 
 
70
 
71
- ### Performance Comparison
 
 
 
72
 
73
- | Metric | Baseline (FP16 KV) | RotorQuant |
74
- |--------|-------------------|------------|
75
- | Perplexity | 7.07 | 6.91 |
76
- | Prefill speed | 1.0x | 5.3x |
77
- | Decode speed | 1.0x | 1.28x |
78
- | KV cache memory | 100% | Substantially reduced |
79
 
80
- ### Why KV Cache Compression Matters for DeepSeek-V3.2
81
 
82
- DeepSeek-V3.2 is a massive 671B-parameter MoE model. At long context lengths, the KV cache can consume hundreds of gigabytes of memory. RotorQuant makes it feasible to serve this model with substantially reduced hardware requirements and faster throughput, especially for long-context workloads.
 
 
 
 
 
 
 
 
83
 
84
- ## Memory Estimates
85
 
86
- | Configuration | Approximate Size |
87
- |---------------|-----------------|
88
- | FP16 weights | ~1.3 TB |
89
- | FP8 weights (base) | ~671 GB |
90
- | KV cache (FP16, 128K context) | Very large -- scales with sequence length |
91
- | KV cache (RotorQuant) | Substantial reduction vs FP16 cache |
92
 
93
- Note: RotorQuant compresses the KV cache only. Model weights remain in their original precision. For weight quantization, see the MLX variants below.
 
94
 
95
  ## See Also
96
 
97
- - [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) -- Base model
98
- - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) -- Source code and documentation
99
- - [majentik/DeepSeek-V3.2-TurboQuant](https://huggingface.co/majentik/DeepSeek-V3.2-TurboQuant) -- Alternative KV cache compression
100
- - [majentik/DeepSeek-V3.2-RotorQuant-MLX-2bit](https://huggingface.co/majentik/DeepSeek-V3.2-RotorQuant-MLX-2bit) -- MLX 2-bit weight quant + RotorQuant
101
- - [majentik/DeepSeek-V3.2-RotorQuant-MLX-1bit](https://huggingface.co/majentik/DeepSeek-V3.2-RotorQuant-MLX-1bit) -- MLX 1-bit weight quant + RotorQuant
 
1
  ---
2
+ license: mit
3
  base_model: deepseek-ai/DeepSeek-V3.2
 
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
7
  - deepseek
8
  - moe
9
  - quantized
10
+ library_name: transformers
 
 
11
  pipeline_tag: text-generation
12
  ---
13
 
14
  # DeepSeek-V3.2-RotorQuant
15
 
16
+ **RotorQuant KV cache compression** for [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2).
17
 
18
+ This is a **documentation repository** that explains how to combine DeepSeek-V3.2's weights with RotorQuant inference-time KV cache compression. No weights are stored here β€” use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
19
 
20
+ ## What is this?
21
 
22
+ KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β€” so the same base weights can be used with or without compression.
23
+
24
+ | Technique | Where it's applied | Savings |
25
+ |-----------|-------------------|---------|
26
+ | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
27
+ | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |
28
+
29
+ Both can be combined for maximum efficiency.
 
 
 
30
 
31
  ## Quickstart
32
 
33
+ ### Option A β€” Python / transformers
34
+
35
+ Install the `rotorquant` package:
36
+
37
+ ```bash
38
+ pip install rotorquant
39
+ ```
40
+
41
+ Then use it with the base model:
42
+
43
  ```python
44
+ import torch
45
  from transformers import AutoModelForCausalLM, AutoTokenizer
46
+ from rotorquant import IsoQuantCache
 
 
47
 
48
+ tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3.2", trust_remote_code=True)
49
  model = AutoModelForCausalLM.from_pretrained(
50
+ "deepseek-ai/DeepSeek-V3.2",
51
+ torch_dtype=torch.bfloat16,
52
  device_map="auto",
53
+ trust_remote_code=True,
54
  )
55
 
56
+ # Apply RotorQuant to the KV cache
57
+ cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression
 
 
 
58
 
59
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
60
  outputs = model.generate(
61
  **inputs,
62
+ max_new_tokens=128,
63
  past_key_values=cache,
64
+ use_cache=True,
65
  )
66
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
67
+ ```
68
+
69
+
70
+ ### Option B β€” llama.cpp / LM Studio / Ollama (with fork)
71
+
72
+ RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require:
73
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
74
+
75
+ Once built:
76
+
77
+ ```bash
78
+ llama-cli -m DeepSeek-V3.2.gguf \
79
+ --cache-type-k iso3 --cache-type-v iso3 \
80
+ -ngl 99 -fa \
81
+ -p "Hello"
82
  ```
83
 
84
+ For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization.
85
+
86
+ ## Model Specifications
87
+
88
+ | Property | Value |
89
+ |----------|-------|
90
+ | Base Model | [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) |
91
+ | Architecture | Sparse MoE |
92
+ | Parameters | 671B total (MoE) |
93
+ | Context Length | 128K |
94
+ | BF16 Size | ~1340 GB |
95
+ | Modalities | Text |
96
+ | License | mit |
97
+
98
  ## What is RotorQuant?
99
 
100
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β€” a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.
101
+
102
+ **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β€” results vary by model and hardware):
103
 
104
+ - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s)
105
+ - Decode: 119 tok/s (vs TurboQuant 93 tok/s)
106
+ - Perplexity: 6.91 (vs TurboQuant 7.07)
107
+ - Parameters: 4 per rotor (vs TurboQuant 16,384)
108
 
109
+ > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on DeepSeek-V3.2 will differ. Please open a discussion if you have independent results.
 
 
 
 
 
110
 
111
+ ## Current Ecosystem Support
112
 
113
+ | Runtime | RotorQuant Support | Notes |
114
+ |---------|----------------------|-------|
115
+ | Python transformers + `rotorquant` | βœ… Full | Drop-in cache class |
116
+ | llama.cpp upstream | ❌ Not merged | Use fork below |
117
+ | llama-cpp-turboquant fork | βœ… `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
118
+ | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
119
+ | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
120
+ | vLLM | ❌ Not supported | β€” |
121
+ | koboldcpp | ❌ Not supported | β€” |
122
 
123
+ ## Pre-quantized weight variants
124
 
125
+ If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:
 
 
 
 
 
126
 
127
+ - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=DeepSeek-V3.2+MLX)
128
+ - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=DeepSeek-V3.2+GGUF)
129
 
130
  ## See Also
131
 
132
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
133
+ - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
134
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
135
+ - [Base model: deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)