Add model card

6170ff5 verified 3 days ago

4.47 kB

	---
	library_name: transformers
	base_model: aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
	tags:
	- rotorquant
	- kv-cache-quantization
	- efficient-inference
	- meralion
	- speech-to-text
	- transcription
	- translation
	- multimodal
	- audio
	- whisper
	- gemma-2
	license: other
	---

	# MERaLiON-2-10B-RotorQuant — RotorQuant KV Cache Compression

	KV cache quantized variant of [aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B](https://huggingface.co/aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B) using [RotorQuant](https://github.com/scrya-com/rotorquant) block-diagonal rotations. MERaLiON-2-10B uses a Whisper encoder paired with a Gemma-2-9B-IT decoder for transcription, translation, and spoken language understanding.

	This is not weight quantization — the model weights remain unchanged. RotorQuant compresses the KV cache at inference time using learned Clifford algebra rotations, enabling longer audio contexts and lower VRAM usage with no training or calibration required.

	## What is RotorQuant?

	RotorQuant applies block-diagonal rotations (Clifford algebra) for online KV cache quantization during inference — no training or calibration required. It achieves 5.3x faster prefill and 28% faster decode compared to TurboQuant while using 44x fewer parameters.

	\| Metric \| RotorQuant \| TurboQuant \|
	\|--------\|-----------\|-----------\|
	\| Perplexity \| 6.91 \| 7.07 \|
	\| Decode Speed \| 119 tok/s \| 93 tok/s \|
	\| Prefill Speed \| 3,822 tok/s \| 722 tok/s \|
	\| Parameters \| 128 \| 16,384 \|
	\| Complexity \| O(d) \| O(d log d) \|

	## Quickstart

	```python
	import torch
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
	from rotorquant import IsoQuantCache
	from datasets import load_dataset

	model_id = "majentik/MERaLiON-2-10B-RotorQuant"

	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	# Load audio sample
	dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
	sample = next(iter(dataset))
	audio = sample["audio"]

	# Process audio input
	inputs = processor(
	audio=audio["array"],
	sampling_rate=audio["sampling_rate"],
	return_tensors="pt",
	).to(model.device)

	# Use RotorQuant KV cache (3-bit recommended)
	cache = IsoQuantCache(bits=3)
	output = model.generate(
	**inputs,
	past_key_values=cache,
	use_cache=True,
	max_new_tokens=256,
	)
	transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
	print(transcription)
	```

	### Translation Example

	```python
	# Translate spoken Mandarin to English text
	inputs = processor(
	audio=mandarin_audio["array"],
	sampling_rate=mandarin_audio["sampling_rate"],
	return_tensors="pt",
	task="translate",
	).to(model.device)

	cache = IsoQuantCache(bits=3)
	output = model.generate(
	**inputs,
	past_key_values=cache,
	use_cache=True,
	max_new_tokens=256,
	)
	translation = processor.batch_decode(output, skip_special_tokens=True)[0]
	print(translation)
	```

	## Backends

	- PlanarQuant (2D Givens rotations) — fastest, recommended for production
	- IsoQuant (4D quaternion rotations) — balanced quality/speed
	- RotorQuant (3D Clifford algebra) — research

	```python
	from rotorquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache

	# Production (fastest)
	cache = PlanarQuantCache(bits=3)

	# Balanced (recommended default)
	cache = IsoQuantCache(bits=3)

	# Research
	cache = RotorQuantCache(bits=3)
	```

	## Configuration

	\| Bits \| KV Cache Compression \| Quality \| Recommended For \|
	\|------\|---------------------\|---------\|-----------------\|
	\| 3-bit \| ~10x \| Excellent \| Production — best speed/quality tradeoff \|
	\| 4-bit \| ~5x \| Near-lossless \| Quality-critical applications \|

	## Memory Savings

	VRAM usage for the Gemma-2-9B-IT decoder at different audio context lengths:

	\| Context Length \| FP16 KV Cache \| 3-bit RotorQuant \| 4-bit RotorQuant \|
	\|---------------\|---------------\|-------------------\|-------------------\|
	\| 8K \| 0.9 GB \| 0.09 GB \| 0.18 GB \|
	\| 32K \| 3.6 GB \| 0.36 GB \| 0.72 GB \|
	\| 64K \| 7.2 GB \| 0.72 GB \| 1.44 GB \|
	\| 128K \| 14.4 GB \| 1.44 GB \| 2.88 GB \|

	## See Also

	- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
	- [TurboQuant variant](https://huggingface.co/majentik/MERaLiON-2-10B-TurboQuant)
	- [Base model](https://huggingface.co/aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B)