File size: 4,471 Bytes
f5d5475 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | ---
library_name: transformers
base_model: aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B
tags:
- rotorquant
- kv-cache-quantization
- efficient-inference
- meralion
- speech-to-text
- transcription
- translation
- multimodal
- audio
- whisper
- gemma-2
license: other
---
# MERaLiON-2-3B-RotorQuant — RotorQuant KV Cache Compression
KV cache quantized variant of [aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B](https://huggingface.co/aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B) using [RotorQuant](https://github.com/scrya-com/rotorquant) block-diagonal rotations. MERaLiON-2-3B uses a Whisper-large-v3 encoder paired with a Gemma-2-2B-IT decoder for transcription, translation, and spoken language understanding.
This is not weight quantization — the model weights remain unchanged. RotorQuant compresses the KV cache at inference time using learned Clifford algebra rotations, enabling longer audio contexts and lower VRAM usage with no training or calibration required.
## What is RotorQuant?
RotorQuant applies block-diagonal rotations (Clifford algebra) for online KV cache quantization during inference — no training or calibration required. It achieves **5.3x faster prefill** and **28% faster decode** compared to TurboQuant while using **44x fewer parameters**.
| Metric | RotorQuant | TurboQuant |
|--------|-----------|-----------|
| Perplexity | 6.91 | 7.07 |
| Decode Speed | 119 tok/s | 93 tok/s |
| Prefill Speed | 3,822 tok/s | 722 tok/s |
| Parameters | 128 | 16,384 |
| Complexity | O(d) | O(d log d) |
## Quickstart
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from rotorquant import IsoQuantCache
from datasets import load_dataset
model_id = "majentik/MERaLiON-2-3B-RotorQuant"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
# Load audio sample
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
sample = next(iter(dataset))
audio = sample["audio"]
# Process audio input
inputs = processor(
audio=audio["array"],
sampling_rate=audio["sampling_rate"],
return_tensors="pt",
).to(model.device)
# Use RotorQuant KV cache (3-bit recommended)
cache = IsoQuantCache(bits=3)
output = model.generate(
**inputs,
past_key_values=cache,
use_cache=True,
max_new_tokens=256,
)
transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
print(transcription)
```
### Translation Example
```python
# Translate spoken Mandarin to English text
inputs = processor(
audio=mandarin_audio["array"],
sampling_rate=mandarin_audio["sampling_rate"],
return_tensors="pt",
task="translate",
).to(model.device)
cache = IsoQuantCache(bits=3)
output = model.generate(
**inputs,
past_key_values=cache,
use_cache=True,
max_new_tokens=256,
)
translation = processor.batch_decode(output, skip_special_tokens=True)[0]
print(translation)
```
## Backends
- **PlanarQuant** (2D Givens rotations) — fastest, recommended for production
- **IsoQuant** (4D quaternion rotations) — balanced quality/speed
- **RotorQuant** (3D Clifford algebra) — research
```python
from rotorquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache
# Production (fastest)
cache = PlanarQuantCache(bits=3)
# Balanced (recommended default)
cache = IsoQuantCache(bits=3)
# Research
cache = RotorQuantCache(bits=3)
```
## Configuration
| Bits | KV Cache Compression | Quality | Recommended For |
|------|---------------------|---------|-----------------|
| 3-bit | ~10x | Excellent | Production — best speed/quality tradeoff |
| 4-bit | ~5x | Near-lossless | Quality-critical applications |
## Memory Savings
VRAM usage for the Gemma-2-2B-IT decoder at different audio context lengths:
| Context Length | FP16 KV Cache | 3-bit RotorQuant | 4-bit RotorQuant |
|---------------|---------------|-------------------|-------------------|
| 8K | 0.3 GB | 0.03 GB | 0.06 GB |
| 32K | 1.2 GB | 0.12 GB | 0.24 GB |
| 64K | 2.4 GB | 0.24 GB | 0.48 GB |
| 128K | 4.8 GB | 0.48 GB | 0.96 GB |
## See Also
- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
- [TurboQuant variant](https://huggingface.co/majentik/MERaLiON-2-3B-TurboQuant)
- [Base model](https://huggingface.co/aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B)
|