majentik commited on
Commit
6170ff5
·
verified ·
1 Parent(s): dd9ff0a

Add model card

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
4
+ tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - efficient-inference
8
+ - meralion
9
+ - speech-to-text
10
+ - transcription
11
+ - translation
12
+ - multimodal
13
+ - audio
14
+ - whisper
15
+ - gemma-2
16
+ license: other
17
+ ---
18
+
19
+ # MERaLiON-2-10B-RotorQuant — RotorQuant KV Cache Compression
20
+
21
+ KV cache quantized variant of [aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B](https://huggingface.co/aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B) using [RotorQuant](https://github.com/scrya-com/rotorquant) block-diagonal rotations. MERaLiON-2-10B uses a Whisper encoder paired with a Gemma-2-9B-IT decoder for transcription, translation, and spoken language understanding.
22
+
23
+ This is not weight quantization — the model weights remain unchanged. RotorQuant compresses the KV cache at inference time using learned Clifford algebra rotations, enabling longer audio contexts and lower VRAM usage with no training or calibration required.
24
+
25
+ ## What is RotorQuant?
26
+
27
+ RotorQuant applies block-diagonal rotations (Clifford algebra) for online KV cache quantization during inference — no training or calibration required. It achieves **5.3x faster prefill** and **28% faster decode** compared to TurboQuant while using **44x fewer parameters**.
28
+
29
+ | Metric | RotorQuant | TurboQuant |
30
+ |--------|-----------|-----------|
31
+ | Perplexity | 6.91 | 7.07 |
32
+ | Decode Speed | 119 tok/s | 93 tok/s |
33
+ | Prefill Speed | 3,822 tok/s | 722 tok/s |
34
+ | Parameters | 128 | 16,384 |
35
+ | Complexity | O(d) | O(d log d) |
36
+
37
+ ## Quickstart
38
+
39
+ ```python
40
+ import torch
41
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
42
+ from rotorquant import IsoQuantCache
43
+ from datasets import load_dataset
44
+
45
+ model_id = "majentik/MERaLiON-2-10B-RotorQuant"
46
+
47
+ processor = AutoProcessor.from_pretrained(model_id)
48
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
49
+ model_id,
50
+ torch_dtype=torch.float16,
51
+ device_map="auto",
52
+ )
53
+
54
+ # Load audio sample
55
+ dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
56
+ sample = next(iter(dataset))
57
+ audio = sample["audio"]
58
+
59
+ # Process audio input
60
+ inputs = processor(
61
+ audio=audio["array"],
62
+ sampling_rate=audio["sampling_rate"],
63
+ return_tensors="pt",
64
+ ).to(model.device)
65
+
66
+ # Use RotorQuant KV cache (3-bit recommended)
67
+ cache = IsoQuantCache(bits=3)
68
+ output = model.generate(
69
+ **inputs,
70
+ past_key_values=cache,
71
+ use_cache=True,
72
+ max_new_tokens=256,
73
+ )
74
+ transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
75
+ print(transcription)
76
+ ```
77
+
78
+ ### Translation Example
79
+
80
+ ```python
81
+ # Translate spoken Mandarin to English text
82
+ inputs = processor(
83
+ audio=mandarin_audio["array"],
84
+ sampling_rate=mandarin_audio["sampling_rate"],
85
+ return_tensors="pt",
86
+ task="translate",
87
+ ).to(model.device)
88
+
89
+ cache = IsoQuantCache(bits=3)
90
+ output = model.generate(
91
+ **inputs,
92
+ past_key_values=cache,
93
+ use_cache=True,
94
+ max_new_tokens=256,
95
+ )
96
+ translation = processor.batch_decode(output, skip_special_tokens=True)[0]
97
+ print(translation)
98
+ ```
99
+
100
+ ## Backends
101
+
102
+ - **PlanarQuant** (2D Givens rotations) — fastest, recommended for production
103
+ - **IsoQuant** (4D quaternion rotations) — balanced quality/speed
104
+ - **RotorQuant** (3D Clifford algebra) — research
105
+
106
+ ```python
107
+ from rotorquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache
108
+
109
+ # Production (fastest)
110
+ cache = PlanarQuantCache(bits=3)
111
+
112
+ # Balanced (recommended default)
113
+ cache = IsoQuantCache(bits=3)
114
+
115
+ # Research
116
+ cache = RotorQuantCache(bits=3)
117
+ ```
118
+
119
+ ## Configuration
120
+
121
+ | Bits | KV Cache Compression | Quality | Recommended For |
122
+ |------|---------------------|---------|-----------------|
123
+ | 3-bit | ~10x | Excellent | Production — best speed/quality tradeoff |
124
+ | 4-bit | ~5x | Near-lossless | Quality-critical applications |
125
+
126
+ ## Memory Savings
127
+
128
+ VRAM usage for the Gemma-2-9B-IT decoder at different audio context lengths:
129
+
130
+ | Context Length | FP16 KV Cache | 3-bit RotorQuant | 4-bit RotorQuant |
131
+ |---------------|---------------|-------------------|-------------------|
132
+ | 8K | 0.9 GB | 0.09 GB | 0.18 GB |
133
+ | 32K | 3.6 GB | 0.36 GB | 0.72 GB |
134
+ | 64K | 7.2 GB | 0.72 GB | 1.44 GB |
135
+ | 128K | 14.4 GB | 1.44 GB | 2.88 GB |
136
+
137
+ ## See Also
138
+
139
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
140
+ - [TurboQuant variant](https://huggingface.co/majentik/MERaLiON-2-10B-TurboQuant)
141
+ - [Base model](https://huggingface.co/aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B)