majentik commited on
Commit
b329651
Β·
verified Β·
1 Parent(s): 4d7089a

Update KV-cache card with accurate template and fork requirements

Browse files
Files changed (1) hide show
  1. README.md +100 -50
README.md CHANGED
@@ -1,87 +1,137 @@
1
  ---
 
2
  base_model: google/gemma-4-E2B
3
- library_name: transformers
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
7
  - gemma
8
  - gemma4
9
- - multimodal
10
  - quantized
11
- license: apache-2.0
12
  pipeline_tag: image-text-to-text
13
  ---
14
 
15
- # Gemma 4 E2B - RotorQuant KV Cache
16
 
17
- **RotorQuant KV-cache quantization** applied to [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B), enabling dramatically reduced memory usage during inference without modifying model weights. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant.
18
 
19
- This repository provides the RotorQuant KV-cache configuration for Gemma 4 E2B. The model weights remain at their original precision; only the key-value cache is quantized at runtime.
20
 
21
- ## Model Specifications
22
 
23
- | Property | Value |
24
- |---|---|
25
- | **Base Model** | [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) |
26
- | **Parameters** | ~2 billion |
27
- | **Architecture** | Dense transformer |
28
- | **Modality** | Multimodal: image + text input, text output |
29
- | **License** | Apache 2.0 |
30
- | **Quantization** | RotorQuant KV-cache only (weights unchanged) |
31
 
32
  ## Quickstart
33
 
 
 
 
 
 
 
 
 
 
 
34
  ```python
35
- from rotorquant import RotorQuantCache
36
- from transformers import AutoModelForImageTextToText, AutoProcessor
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- model_id = "google/gemma-4-E2B"
39
 
40
- processor = AutoProcessor.from_pretrained(model_id)
41
- model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
42
 
43
- # Apply RotorQuant KV-cache quantization
44
- cache = RotorQuantCache(model)
45
 
46
- inputs = processor("Once upon a time", return_tensors="pt").to(model.device)
47
- outputs = model.generate(**inputs, past_key_values=cache)
48
- print(processor.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
49
  ```
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## What is RotorQuant?
52
 
53
- [RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Instead of quantizing the model weights, RotorQuant targets the memory bottleneck of the KV cache, which grows linearly with sequence length and batch size.
 
 
 
 
 
 
 
54
 
55
- Key advantages over TurboQuant:
56
- - **5.3x faster prefill**
57
- - **28% faster decode**
58
- - **No weight modification** -- model weights stay at original precision
59
- - **Reduced inference memory** -- KV cache is compressed significantly
60
- - **Longer context windows** -- fit more tokens in the same GPU memory
61
 
62
- ## KV-Cache Quantization Comparison
63
 
64
- | Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
65
- |---|---|---|---|---|
66
- | **TurboQuant** | Baseline | Baseline | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
67
- | **RotorQuant** | 5.3x faster | 28% faster | High | [GitHub](https://github.com/scrya-com/rotorquant) |
 
 
 
 
 
68
 
69
- ## Memory Estimates (Gemma 4 E2B)
70
 
71
- | Precision | Approximate Size |
72
- |---|---|
73
- | FP16 (original) | ~4 GB |
74
- | 8-bit quantized | ~2 GB |
75
- | 4-bit quantized | ~1.2 GB |
76
- | 2-bit quantized | ~0.6 GB |
77
 
78
- Note: These estimates are for weight quantization. This repository applies KV-cache quantization only, so model weight memory remains at the precision you load the model in. The KV-cache memory savings are realized during generation.
 
79
 
80
  ## See Also
81
 
82
- - [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) -- Base model
83
- - [majentik/gemma-4-E2B-TurboQuant](https://huggingface.co/majentik/gemma-4-E2B-TurboQuant) -- TurboQuant KV-cache variant
84
- - [majentik/gemma-4-E2B-RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-MLX-8bit) -- MLX 8-bit weight-quantized variant
85
- - [majentik/gemma-4-E2B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-MLX-4bit) -- MLX 4-bit weight-quantized variant
86
- - [majentik/gemma-4-E2B-RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-MLX-2bit) -- MLX 2-bit weight-quantized variant
87
  - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  base_model: google/gemma-4-E2B
 
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
7
  - gemma
8
  - gemma4
9
+ - edge
10
  - quantized
11
+ library_name: transformers
12
  pipeline_tag: image-text-to-text
13
  ---
14
 
15
+ # gemma-4-E2B-RotorQuant
16
 
17
+ **RotorQuant KV cache compression** for [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B).
18
 
19
+ This is a **documentation repository** that explains how to combine gemma-4-E2B's weights with RotorQuant inference-time KV cache compression. No weights are stored here β€” use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
20
 
21
+ ## What is this?
22
 
23
+ KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β€” so the same base weights can be used with or without compression.
24
+
25
+ | Technique | Where it's applied | Savings |
26
+ |-----------|-------------------|---------|
27
+ | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
28
+ | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |
29
+
30
+ Both can be combined for maximum efficiency.
31
 
32
  ## Quickstart
33
 
34
+ ### Option A β€” Python / transformers
35
+
36
+ Install the `rotorquant` package:
37
+
38
+ ```bash
39
+ pip install rotorquant
40
+ ```
41
+
42
+ Then use it with the base model:
43
+
44
  ```python
45
+ import torch
46
+ from transformers import AutoModelForCausalLM, AutoTokenizer
47
+ from rotorquant import IsoQuantCache
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-E2B", trust_remote_code=True)
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ "google/gemma-4-E2B",
52
+ torch_dtype=torch.bfloat16,
53
+ device_map="auto",
54
+ trust_remote_code=True,
55
+ )
56
+
57
+ # Apply RotorQuant to the KV cache
58
+ cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression
59
+
60
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
61
+ outputs = model.generate(
62
+ **inputs,
63
+ max_new_tokens=128,
64
+ past_key_values=cache,
65
+ use_cache=True,
66
+ )
67
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
68
+ ```
69
 
 
70
 
71
+ ### Option B β€” llama.cpp / LM Studio / Ollama (with fork)
 
72
 
73
+ RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require:
74
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
75
 
76
+ Once built:
77
+
78
+ ```bash
79
+ llama-cli -m gemma-4-E2B.gguf \
80
+ --cache-type-k iso3 --cache-type-v iso3 \
81
+ -ngl 99 -fa \
82
+ -p "Hello"
83
  ```
84
 
85
+ For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization.
86
+
87
+ ## Model Specifications
88
+
89
+ | Property | Value |
90
+ |----------|-------|
91
+ | Base Model | [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) |
92
+ | Architecture | Dense transformer (Edge optimised) |
93
+ | Parameters | ~2B |
94
+ | Context Length | 128K |
95
+ | BF16 Size | ~4 GB |
96
+ | Modalities | Text + Image |
97
+ | License | apache-2.0 |
98
+
99
  ## What is RotorQuant?
100
 
101
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β€” a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.
102
+
103
+ **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β€” results vary by model and hardware):
104
+
105
+ - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s)
106
+ - Decode: 119 tok/s (vs TurboQuant 93 tok/s)
107
+ - Perplexity: 6.91 (vs TurboQuant 7.07)
108
+ - Parameters: 4 per rotor (vs TurboQuant 16,384)
109
 
110
+ > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on gemma-4-E2B will differ. Please open a discussion if you have independent results.
 
 
 
 
 
111
 
112
+ ## Current Ecosystem Support
113
 
114
+ | Runtime | RotorQuant Support | Notes |
115
+ |---------|----------------------|-------|
116
+ | Python transformers + `rotorquant` | βœ… Full | Drop-in cache class |
117
+ | llama.cpp upstream | ❌ Not merged | Use fork below |
118
+ | llama-cpp-turboquant fork | βœ… `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
119
+ | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
120
+ | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
121
+ | vLLM | ❌ Not supported | β€” |
122
+ | koboldcpp | ❌ Not supported | β€” |
123
 
124
+ ## Pre-quantized weight variants
125
 
126
+ If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:
 
 
 
 
 
127
 
128
+ - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=gemma-4-E2B+MLX)
129
+ - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=gemma-4-E2B+GGUF)
130
 
131
  ## See Also
132
 
 
 
 
 
 
133
  - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
134
+ - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
135
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
136
+ - [Base model: google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)
137
+ - [gemma-4-E2B announcement](https://blog.google/technology/developers/gemma-4/)