majentik commited on
Commit
aa9c604
Β·
verified Β·
1 Parent(s): c94b986

Update KV-cache card with accurate template and fork requirements

Browse files
Files changed (1) hide show
  1. README.md +98 -48
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
 
2
  base_model: openai/gpt-oss-20b
3
- library_name: transformers
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
@@ -8,80 +8,130 @@ tags:
8
  - openai
9
  - moe
10
  - quantized
11
- license: apache-2.0
12
  pipeline_tag: text-generation
13
  ---
14
 
15
- # GPT-OSS-20B - RotorQuant KV Cache
16
 
17
- **RotorQuant KV-cache quantization** applied to [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b). RotorQuant uses block-diagonal rotations (Clifford algebra) to compress the KV cache, delivering 5.3x faster prefill and 28% faster decode compared to TurboQuant with equivalent memory savings.
18
 
19
- This repository provides the RotorQuant KV-cache configuration for GPT-OSS-20B, OpenAI's first open-weights release in years (Apache 2.0). The model weights remain at their original precision; only the key-value cache is quantized at runtime. GPT-OSS-20B is a Mixture-of-Experts model that rivals o3-mini on reasoning benchmarks and is ideal for local and edge deployment.
20
 
21
- ## Model Specifications
22
 
23
- | Property | Value |
24
- |---|---|
25
- | **Base Model** | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) |
26
- | **Parameters** | 20 billion (MoE) |
27
- | **Architecture** | Mixture-of-Experts (MoE) Transformer |
28
- | **License** | Apache 2.0 (commercial use OK) |
29
- | **Quantization** | RotorQuant KV-cache only (weights unchanged) |
30
- | **Downloads** | 6M+ on HuggingFace |
31
 
32
  ## Quickstart
33
 
 
 
 
 
 
 
 
 
 
 
34
  ```python
35
- from rotorquant import IsoQuantCache
36
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- model_id = "openai/gpt-oss-20b"
39
 
40
- tokenizer = AutoTokenizer.from_pretrained(model_id)
41
- model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
42
 
43
- # Apply RotorQuant KV-cache quantization
44
- cache = IsoQuantCache(model)
45
 
46
- inputs = tokenizer("Explain the theory of relativity.", return_tensors="pt").to(model.device)
47
- outputs = model.generate(**inputs, past_key_values=cache)
48
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
49
  ```
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## What is RotorQuant?
52
 
53
- [RotorQuant](https://github.com/scrya-com/rotorquant) applies block-diagonal rotations (Clifford algebra) for KV cache compression. It provides equivalent memory savings to TurboQuant while dramatically improving throughput.
 
 
 
 
 
 
 
54
 
55
- Key advantages over TurboQuant:
56
- - **5.3x faster prefill**
57
- - **28% faster decode**
58
- - Equivalent memory savings
59
- - Slightly better perplexity
60
 
61
- ## KV-Cache Quantization Comparison
62
 
63
- | Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
64
- |---|---|---|---|---|
65
- | **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
66
- | **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |
 
 
 
 
 
67
 
68
- ## Memory Estimates (GPT-OSS-20B)
69
 
70
- | Precision | Approximate Size |
71
- |---|---|
72
- | BF16 (original) | ~40 GB |
73
- | 8-bit quantized | ~20 GB |
74
- | 4-bit quantized | ~12 GB |
75
- | 2-bit quantized | ~6 GB |
76
 
77
- Note: These estimates are for weight quantization. This repository applies KV-cache quantization only, so model weight memory remains at the precision you load the model in. The KV-cache memory savings are realized during generation.
 
78
 
79
  ## See Also
80
 
81
- - [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) -- Base model
82
- - [majentik/gpt-oss-20b-TurboQuant](https://huggingface.co/majentik/gpt-oss-20b-TurboQuant) -- TurboQuant KV-cache variant
83
- - [majentik/gpt-oss-20b-RotorQuant-MLX-8bit](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-MLX-8bit) -- MLX 8-bit variant
84
- - [majentik/gpt-oss-20b-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-MLX-4bit) -- MLX 4-bit variant
85
- - [majentik/gpt-oss-20b-RotorQuant-MLX-2bit](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-MLX-2bit) -- MLX 2-bit variant
86
- - [majentik/gpt-oss-20b-RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-GGUF-Q4_K_M) -- GGUF Q4_K_M variant
87
  - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  base_model: openai/gpt-oss-20b
 
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
 
8
  - openai
9
  - moe
10
  - quantized
11
+ library_name: transformers
12
  pipeline_tag: text-generation
13
  ---
14
 
15
+ # gpt-oss-20b-RotorQuant
16
 
17
+ **RotorQuant KV cache compression** for [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b).
18
 
19
+ This is a **documentation repository** that explains how to combine gpt-oss-20b's weights with RotorQuant inference-time KV cache compression. No weights are stored here β€” use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
20
 
21
+ ## What is this?
22
 
23
+ KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β€” so the same base weights can be used with or without compression.
24
+
25
+ | Technique | Where it's applied | Savings |
26
+ |-----------|-------------------|---------|
27
+ | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
28
+ | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |
29
+
30
+ Both can be combined for maximum efficiency.
31
 
32
  ## Quickstart
33
 
34
+ ### Option A β€” Python / transformers
35
+
36
+ Install the `rotorquant` package:
37
+
38
+ ```bash
39
+ pip install rotorquant
40
+ ```
41
+
42
+ Then use it with the base model:
43
+
44
  ```python
45
+ import torch
46
  from transformers import AutoModelForCausalLM, AutoTokenizer
47
+ from rotorquant import IsoQuantCache
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b", trust_remote_code=True)
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ "openai/gpt-oss-20b",
52
+ torch_dtype=torch.bfloat16,
53
+ device_map="auto",
54
+ trust_remote_code=True,
55
+ )
56
+
57
+ # Apply RotorQuant to the KV cache
58
+ cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression
59
+
60
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
61
+ outputs = model.generate(
62
+ **inputs,
63
+ max_new_tokens=128,
64
+ past_key_values=cache,
65
+ use_cache=True,
66
+ )
67
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
68
+ ```
69
+
70
 
71
+ ### Option B β€” llama.cpp / LM Studio / Ollama (with fork)
72
 
73
+ RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require:
74
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
75
 
76
+ Once built:
 
77
 
78
+ ```bash
79
+ llama-cli -m gpt-oss-20b.gguf \
80
+ --cache-type-k iso3 --cache-type-v iso3 \
81
+ -ngl 99 -fa \
82
+ -p "Hello"
83
  ```
84
 
85
+ For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization.
86
+
87
+ ## Model Specifications
88
+
89
+ | Property | Value |
90
+ |----------|-------|
91
+ | Base Model | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) |
92
+ | Architecture | Sparse MoE |
93
+ | Parameters | 20B total (MoE) |
94
+ | Context Length | 128K |
95
+ | BF16 Size | ~40 GB |
96
+ | Modalities | Text |
97
+ | License | apache-2.0 |
98
+
99
  ## What is RotorQuant?
100
 
101
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β€” a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.
102
+
103
+ **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β€” results vary by model and hardware):
104
+
105
+ - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s)
106
+ - Decode: 119 tok/s (vs TurboQuant 93 tok/s)
107
+ - Perplexity: 6.91 (vs TurboQuant 7.07)
108
+ - Parameters: 4 per rotor (vs TurboQuant 16,384)
109
 
110
+ > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on gpt-oss-20b will differ. Please open a discussion if you have independent results.
 
 
 
 
111
 
112
+ ## Current Ecosystem Support
113
 
114
+ | Runtime | RotorQuant Support | Notes |
115
+ |---------|----------------------|-------|
116
+ | Python transformers + `rotorquant` | βœ… Full | Drop-in cache class |
117
+ | llama.cpp upstream | ❌ Not merged | Use fork below |
118
+ | llama-cpp-turboquant fork | βœ… `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
119
+ | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
120
+ | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
121
+ | vLLM | ❌ Not supported | β€” |
122
+ | koboldcpp | ❌ Not supported | β€” |
123
 
124
+ ## Pre-quantized weight variants
125
 
126
+ If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:
 
 
 
 
 
127
 
128
+ - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=gpt-oss-20b+MLX)
129
+ - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=gpt-oss-20b+GGUF)
130
 
131
  ## See Also
132
 
 
 
 
 
 
 
133
  - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
134
+ - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
135
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
136
+ - [Base model: openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
137
+ - [gpt-oss-20b announcement](https://openai.com/blog/gpt-oss)