majentik commited on
Commit
21a1d2f
·
verified ·
1 Parent(s): 2680b97

Add model card

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: google/gemma-4-31B
3
+ library_name: transformers
4
+ tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - gemma
8
+ - gemma4
9
+ - multimodal
10
+ - quantized
11
+ license: apache-2.0
12
+ pipeline_tag: image-text-to-text
13
+ ---
14
+
15
+ # Gemma 4 31B - RotorQuant KV Cache
16
+
17
+ **RotorQuant KV-cache quantization** applied to [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B), delivering 5.3x faster prefill and 28% faster decode compared to TurboQuant while maintaining equivalent memory savings.
18
+
19
+ This repository provides the RotorQuant KV-cache configuration for Gemma 4 31B. The model weights remain at their original precision; only the key-value cache is quantized at runtime.
20
+
21
+ ## Model Specifications
22
+
23
+ | Property | Value |
24
+ |---|---|
25
+ | **Base Model** | [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) |
26
+ | **Parameters** | 31 billion (dense transformer) |
27
+ | **Architecture** | Dense transformer (not MoE) |
28
+ | **Modality** | Multimodal: image + text input, text output |
29
+ | **License** | Apache 2.0 |
30
+ | **Quantization** | RotorQuant KV-cache only (weights unchanged) |
31
+
32
+ ## Quickstart
33
+
34
+ ```python
35
+ from rotorquant import RotorQuantCache
36
+ from transformers import AutoModelForImageTextToText, AutoProcessor
37
+
38
+ model_id = "google/gemma-4-31B"
39
+
40
+ processor = AutoProcessor.from_pretrained(model_id)
41
+ model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
42
+
43
+ # Apply RotorQuant KV-cache quantization
44
+ cache = RotorQuantCache(model)
45
+
46
+ inputs = processor("Describe this image.", images=image, return_tensors="pt").to(model.device)
47
+ outputs = model.generate(**inputs, past_key_values=cache)
48
+ print(processor.decode(outputs[0], skip_special_tokens=True))
49
+ ```
50
+
51
+ ## What is RotorQuant?
52
+
53
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method that builds on the foundations of cache compression while achieving significantly better throughput. It compresses the key-value cache used during autoregressive generation without modifying model weights.
54
+
55
+ Key benefits:
56
+ - **5.3x faster prefill** compared to TurboQuant
57
+ - **28% faster decode** compared to TurboQuant
58
+ - **No weight modification** -- model weights stay at original precision
59
+ - **Reduced inference memory** -- KV cache is compressed significantly
60
+ - **Longer context windows** -- fit more tokens in the same GPU memory
61
+
62
+ ## KV-Cache Quantization Comparison
63
+
64
+ | Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
65
+ |---|---|---|---|---|
66
+ | **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
67
+ | **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |
68
+
69
+ ## Memory Estimates (Gemma 4 31B)
70
+
71
+ | Precision | Approximate Size |
72
+ |---|---|
73
+ | FP16 (original) | ~62 GB |
74
+ | 8-bit quantized | ~31 GB |
75
+ | 4-bit quantized | ~17 GB |
76
+ | 2-bit quantized | ~9 GB |
77
+
78
+ Note: These estimates are for weight quantization. This repository applies KV-cache quantization only, so model weight memory remains at the precision you load the model in. The KV-cache memory savings are realized during generation.
79
+
80
+ ## See Also
81
+
82
+ - [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) -- Base model
83
+ - [majentik/gemma-4-31B-TurboQuant](https://huggingface.co/majentik/gemma-4-31B-TurboQuant) -- TurboQuant KV-cache variant
84
+ - [majentik/gemma-4-31B-RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma-4-31B-RotorQuant-MLX-8bit) -- MLX 8-bit weight-quantized variant
85
+ - [majentik/gemma-4-31B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-31B-RotorQuant-MLX-4bit) -- MLX 4-bit weight-quantized variant
86
+ - [majentik/gemma-4-31B-RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma-4-31B-RotorQuant-MLX-2bit) -- MLX 2-bit weight-quantized variant
87
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)