majentik commited on
Commit
cbbc5a4
·
verified ·
1 Parent(s): c7073c3

Add model card

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: aisingapore/MERaLiON-3-10B
4
+ tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - efficient-inference
8
+ - meralion
9
+ - speech-to-text
10
+ - multimodal
11
+ - audio
12
+ - gemma-2
13
+ license: other
14
+ ---
15
+
16
+ # MERaLiON-3-10B-RotorQuant — RotorQuant KV Cache Compression
17
+
18
+ KV cache quantized variant of [aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B) using [RotorQuant](https://github.com/scrya-com/rotorquant) block-diagonal rotations. MERaLiON-3-10B is a multimodal audio-language model that uses google/gemma-2-9b as its decoder backbone.
19
+
20
+ This is not weight quantization — the model weights remain unchanged. RotorQuant compresses the KV cache at inference time using learned Clifford algebra rotations, enabling longer audio contexts and lower VRAM usage with no training or calibration required.
21
+
22
+ ## What is RotorQuant?
23
+
24
+ RotorQuant applies block-diagonal rotations (Clifford algebra) for online KV cache quantization during inference — no training or calibration required. It achieves **5.3x faster prefill** and **28% faster decode** compared to TurboQuant while using **44x fewer parameters**.
25
+
26
+ | Metric | RotorQuant | TurboQuant |
27
+ |--------|-----------|-----------|
28
+ | Perplexity | 6.91 | 7.07 |
29
+ | Decode Speed | 119 tok/s | 93 tok/s |
30
+ | Prefill Speed | 3,822 tok/s | 722 tok/s |
31
+ | Parameters | 128 | 16,384 |
32
+ | Complexity | O(d) | O(d log d) |
33
+
34
+ ## Quickstart
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
39
+ from rotorquant import IsoQuantCache
40
+ from datasets import load_dataset
41
+
42
+ model_id = "majentik/MERaLiON-3-10B-RotorQuant"
43
+
44
+ processor = AutoProcessor.from_pretrained(model_id)
45
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
46
+ model_id,
47
+ torch_dtype=torch.float16,
48
+ device_map="auto",
49
+ )
50
+
51
+ # Load audio sample
52
+ dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
53
+ sample = next(iter(dataset))
54
+ audio = sample["audio"]
55
+
56
+ # Process audio input
57
+ inputs = processor(
58
+ audio=audio["array"],
59
+ sampling_rate=audio["sampling_rate"],
60
+ return_tensors="pt",
61
+ ).to(model.device)
62
+
63
+ # Use RotorQuant KV cache (3-bit recommended)
64
+ cache = IsoQuantCache(bits=3)
65
+ output = model.generate(
66
+ **inputs,
67
+ past_key_values=cache,
68
+ use_cache=True,
69
+ max_new_tokens=256,
70
+ )
71
+ transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
72
+ print(transcription)
73
+ ```
74
+
75
+ ## Backends
76
+
77
+ - **PlanarQuant** (2D Givens rotations) — fastest, recommended for production
78
+ - **IsoQuant** (4D quaternion rotations) — balanced quality/speed
79
+ - **RotorQuant** (3D Clifford algebra) — research
80
+
81
+ ```python
82
+ from rotorquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache
83
+
84
+ # Production (fastest)
85
+ cache = PlanarQuantCache(bits=3)
86
+
87
+ # Balanced (recommended default)
88
+ cache = IsoQuantCache(bits=3)
89
+
90
+ # Research
91
+ cache = RotorQuantCache(bits=3)
92
+ ```
93
+
94
+ ## Configuration
95
+
96
+ | Bits | KV Cache Compression | Quality | Recommended For |
97
+ |------|---------------------|---------|-----------------|
98
+ | 3-bit | ~10x | Excellent | Production — best speed/quality tradeoff |
99
+ | 4-bit | ~5x | Near-lossless | Quality-critical applications |
100
+
101
+ ## Memory Savings
102
+
103
+ VRAM usage for the Gemma-2-9B decoder at different audio context lengths:
104
+
105
+ | Context Length | FP16 KV Cache | 3-bit RotorQuant | 4-bit RotorQuant |
106
+ |---------------|---------------|-------------------|-------------------|
107
+ | 8K | 0.9 GB | 0.09 GB | 0.18 GB |
108
+ | 32K | 3.6 GB | 0.36 GB | 0.72 GB |
109
+ | 64K | 7.2 GB | 0.72 GB | 1.44 GB |
110
+ | 128K | 14.4 GB | 1.44 GB | 2.88 GB |
111
+
112
+ ## See Also
113
+
114
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
115
+ - [TurboQuant variant](https://huggingface.co/majentik/MERaLiON-3-10B-TurboQuant)
116
+ - [Base model](https://huggingface.co/aisingapore/MERaLiON-3-10B)