majentik commited on
Commit
7ae9b34
·
verified ·
1 Parent(s): 9e39f84

Add model card

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: google/gemma-4-31B-it
3
+ library_name: transformers
4
+ tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - gemma
8
+ - gemma4
9
+ - multimodal
10
+ - quantized
11
+ license: apache-2.0
12
+ pipeline_tag: image-text-to-text
13
+ ---
14
+
15
+ # Gemma 4 31B-it - RotorQuant KV Cache
16
+
17
+ **RotorQuant KV-cache quantization** applied to [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it), enabling dramatically reduced memory usage during inference without modifying model weights. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant.
18
+
19
+ This repository provides the RotorQuant KV-cache configuration for Gemma 4 31B-it. The model weights remain at their original precision; only the key-value cache is quantized at runtime.
20
+
21
+ ## Model Specifications
22
+
23
+ | Property | Value |
24
+ |---|---|
25
+ | **Base Model** | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
26
+ | **Parameters** | 31 billion |
27
+ | **Architecture** | Dense transformer |
28
+ | **Modality** | Multimodal: image + text input, text output |
29
+ | **License** | Apache 2.0 |
30
+ | **Quantization** | RotorQuant KV-cache only (weights unchanged) |
31
+
32
+ ## Quickstart
33
+
34
+ ```python
35
+ from rotorquant import RotorQuantCache
36
+ from transformers import AutoModelForImageTextToText, AutoProcessor
37
+
38
+ model_id = "google/gemma-4-31B-it"
39
+
40
+ processor = AutoProcessor.from_pretrained(model_id)
41
+ model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
42
+
43
+ # Apply RotorQuant KV-cache quantization
44
+ cache = RotorQuantCache(model)
45
+
46
+ inputs = processor("Describe this image.", images=image, return_tensors="pt").to(model.device)
47
+ outputs = model.generate(**inputs, past_key_values=cache)
48
+ print(processor.decode(outputs[0], skip_special_tokens=True))
49
+ ```
50
+
51
+ ## What is RotorQuant?
52
+
53
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Instead of quantizing the model weights, RotorQuant targets the memory bottleneck of the KV cache, which grows linearly with sequence length and batch size.
54
+
55
+ Key advantages over TurboQuant:
56
+ - **5.3x faster prefill**
57
+ - **28% faster decode**
58
+ - **No weight modification** -- model weights stay at original precision
59
+ - **Reduced inference memory** -- KV cache is compressed significantly
60
+ - **Longer context windows** -- fit more tokens in the same GPU memory
61
+
62
+ ## KV-Cache Quantization Comparison
63
+
64
+ | Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
65
+ |---|---|---|---|---|
66
+ | **TurboQuant** | Baseline | Baseline | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
67
+ | **RotorQuant** | 5.3x faster | 28% faster | High | [GitHub](https://github.com/scrya-com/rotorquant) |
68
+
69
+ ## Memory Estimates (Gemma 4 31B-it)
70
+
71
+ | Precision | Approximate Size |
72
+ |---|---|
73
+ | FP16 (original) | ~62 GB |
74
+ | 8-bit quantized | ~31 GB |
75
+ | 4-bit quantized | ~17 GB |
76
+ | 2-bit quantized | ~9 GB |
77
+
78
+ Note: These estimates are for weight quantization. This repository applies KV-cache quantization only, so model weight memory remains at the precision you load the model in. The KV-cache memory savings are realized during generation.
79
+
80
+ ## See Also
81
+
82
+ - [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) -- Base model
83
+ - [majentik/gemma-4-31B-it-TurboQuant](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant) -- TurboQuant KV-cache variant
84
+ - [majentik/gemma-4-31B-it-RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma-4-31B-it-RotorQuant-MLX-8bit) -- MLX 8-bit weight-quantized variant
85
+ - [majentik/gemma-4-31B-it-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-31B-it-RotorQuant-MLX-4bit) -- MLX 4-bit weight-quantized variant
86
+ - [majentik/gemma-4-31B-it-RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma-4-31B-it-RotorQuant-MLX-2bit) -- MLX 2-bit weight-quantized variant
87
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)