Add model card

6170ff5 verified 3 days ago

4.47 kB

library_name: transformers
base_model: aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
tags:
  - rotorquant
  - kv-cache-quantization
  - efficient-inference
  - meralion
  - speech-to-text
  - transcription
  - translation
  - multimodal
  - audio
  - whisper
  - gemma-2
license: other

MERaLiON-2-10B-RotorQuant — RotorQuant KV Cache Compression

KV cache quantized variant of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B using RotorQuant block-diagonal rotations. MERaLiON-2-10B uses a Whisper encoder paired with a Gemma-2-9B-IT decoder for transcription, translation, and spoken language understanding.

This is not weight quantization — the model weights remain unchanged. RotorQuant compresses the KV cache at inference time using learned Clifford algebra rotations, enabling longer audio contexts and lower VRAM usage with no training or calibration required.

What is RotorQuant?

RotorQuant applies block-diagonal rotations (Clifford algebra) for online KV cache quantization during inference — no training or calibration required. It achieves 5.3x faster prefill and 28% faster decode compared to TurboQuant while using 44x fewer parameters.

Metric	RotorQuant	TurboQuant
Perplexity	6.91	7.07
Decode Speed	119 tok/s	93 tok/s
Prefill Speed	3,822 tok/s	722 tok/s
Parameters	128	16,384
Complexity	O(d)	O(d log d)

Quickstart

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from rotorquant import IsoQuantCache
from datasets import load_dataset

model_id = "majentik/MERaLiON-2-10B-RotorQuant"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load audio sample
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
sample = next(iter(dataset))
audio = sample["audio"]

# Process audio input
inputs = processor(
    audio=audio["array"],
    sampling_rate=audio["sampling_rate"],
    return_tensors="pt",
).to(model.device)

# Use RotorQuant KV cache (3-bit recommended)
cache = IsoQuantCache(bits=3)
output = model.generate(
    **inputs,
    past_key_values=cache,
    use_cache=True,
    max_new_tokens=256,
)
transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
print(transcription)

Translation Example

# Translate spoken Mandarin to English text
inputs = processor(
    audio=mandarin_audio["array"],
    sampling_rate=mandarin_audio["sampling_rate"],
    return_tensors="pt",
    task="translate",
).to(model.device)

cache = IsoQuantCache(bits=3)
output = model.generate(
    **inputs,
    past_key_values=cache,
    use_cache=True,
    max_new_tokens=256,
)
translation = processor.batch_decode(output, skip_special_tokens=True)[0]
print(translation)

Backends

PlanarQuant (2D Givens rotations) — fastest, recommended for production
IsoQuant (4D quaternion rotations) — balanced quality/speed
RotorQuant (3D Clifford algebra) — research

from rotorquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache

# Production (fastest)
cache = PlanarQuantCache(bits=3)

# Balanced (recommended default)
cache = IsoQuantCache(bits=3)

# Research
cache = RotorQuantCache(bits=3)

Configuration

Bits	KV Cache Compression	Quality	Recommended For
3-bit	~10x	Excellent	Production — best speed/quality tradeoff
4-bit	~5x	Near-lossless	Quality-critical applications

Memory Savings

VRAM usage for the Gemma-2-9B-IT decoder at different audio context lengths:

Context Length	FP16 KV Cache	3-bit RotorQuant	4-bit RotorQuant
8K	0.9 GB	0.09 GB	0.18 GB
32K	3.6 GB	0.36 GB	0.72 GB
64K	7.2 GB	0.72 GB	1.44 GB
128K	14.4 GB	1.44 GB	2.88 GB

majentik
/

MERaLiON-2-10B-RotorQuant