MERaLiON-2-10B + TurboQuant KV Cache Compression
Integration of TurboQuant KV cache compression with MERaLiON-2-10B, a 10B-parameter speech-language model built on a Whisper encoder and Gemma-2-9b-IT decoder.
TurboQuant compresses the Gemma-2 decoder's KV cache at inference time using learned quantization codebooks. No model retraining is required -- it is a drop-in replacement for the past_key_values argument.
Benchmark Results
Tested on Apple M4 Max (128GB unified memory), MPS backend, bfloat16, with a 6.84s audio clip.
Natural Generation (greedy, model-determined length)
| Configuration | Tokens | Time (s) | Tok/s | KV Memory | Savings |
|---|---|---|---|---|---|
| Baseline (fp16 cache) | 9 | 1.85 | 4.9 | 353 MB | -- |
| TurboQuant 4-bit | 9 | 1.35 | 6.7 | 201 MB | 1.76x |
| TurboQuant 2-bit | 9 | 1.22 | 7.4 | 201 MB | 1.76x |
- 4-bit output matches baseline exactly
- 2-bit shows minor divergence (expected at lower precision)
- 1.38x speedup (4-bit) / 1.52x speedup (2-bit) for short sequences
Forced Long Generation (128+ tokens)
| Configuration | Tokens | Time (s) | Tok/s | KV Memory | Savings |
|---|---|---|---|---|---|
| Baseline | 129 | 59.6 | 2.2 | 7,805 MB | -- |
| TurboQuant 4-bit | 130 | 170.5 | 0.8 | 3,985 MB | 1.96x |
| TurboQuant 2-bit | 130 | 323.6 | 0.4 | 3,985 MB | 1.96x |
For longer sequences, TurboQuant compression overhead on MPS exceeds the memory-bandwidth savings. The primary benefit at longer context is KV cache memory reduction (1.96x). On CUDA GPUs, speed improvements are expected to scale better.
Usage
import numpy as np
if not hasattr(np, "trapz"):
np.trapz = np.trapezoid # numpy 2.x compat
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from turboquant import TurboQuantCache
model_id = "MERaLiON/MERaLiON-2-10B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto",
trust_remote_code=True, attn_implementation="eager"
)
# Prepare audio inputs
import soundfile as sf, librosa
audio, sr = sf.read("audio.wav")
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
inputs = processor(
text="<SpeechHere> Transcribe the audio.",
audios=[audio], sampling_rate=16000, padding=True
)
# ... move to device, add batch dim, cast to bf16 ...
# Generate with TurboQuant 4-bit KV cache
cache = TurboQuantCache(bits=4)
output = model.generate(**inputs, max_new_tokens=256, past_key_values=cache)
See inference.py for a complete working script.
CLI
python inference.py --audio speech.wav # 4-bit TurboQuant (default)
python inference.py --audio speech.wav --bits 2 # 2-bit TurboQuant
python inference.py --audio speech.wav --no-turboquant # baseline
Requirements
pip install torch transformers turboquant soundfile librosa scipy
Compatibility Notes
MERaLiON-2-10B's custom modeling code was written for an older version of transformers. Running on transformers 5.5+ requires several patches to the cached model files:
HybridCacheimport -- removed in transformers 5.x (replaced with sentinel class)pad_token_idaccess -- not set in MERaLiON2Config (usegetattrfallback)tie_weights()signature -- newrecompute_mappingkwarg (accept**kwargs)cache_positioninprepare_inputs_for_generation-- can beNonein new generate flow
These patches are applied to the HuggingFace-cached model files automatically by the inference script's import mechanism. See turboquant_config.json for details.
Architecture
Audio (WAV) --> Whisper Encoder (large-v3 scale) --> Speech Adapter --> Gemma-2-9b-IT Decoder
^
|
TurboQuant KV Cache
(2-bit or 4-bit compression)
Files
inference.py-- Complete inference script with CLIturboquant_config.json-- Configuration and benchmark databenchmark_results.json-- Raw benchmark results
Citation
@article{meralion2,
title={MERaLiON-AudioLLM: Technical Report},
author={MERaLiON Team},
year={2024}
}