MERaLiON-2-3B with TurboQuant KV Cache Compression
This repository provides inference scripts and benchmark data for running MERaLiON-2-3B with TurboQuant KV cache compression.
Model Description
MERaLiON-2-3B is a speech-language model developed by I2R, A*STAR Singapore. It combines a Whisper-large-v3 speech encoder with a Gemma-2-2B-IT text decoder via a learned audio adapter. The model handles speech-to-text tasks including transcription, translation, and spoken language understanding.
TurboQuant compresses the KV cache at inference time using quantization, reducing memory usage during autoregressive generation. This is particularly beneficial for long-context generation where KV cache memory can become a bottleneck.
Key Results
| Configuration | Avg Time (s) | Relative | Output Quality |
|---|---|---|---|
| Baseline (no compression) | 7.21 | 1.00x | Reference |
| TurboQuant 4-bit | 11.37 | 1.58x | Identical to baseline |
| TurboQuant 2-bit | 12.22 | 1.70x | Minor variations, coherent |
Test conditions: Apple Silicon MPS, 128GB unified memory, 6.84s audio input, 256 max new tokens, PyTorch 2.11.0, transformers 5.5.0.
Note: On Apple Silicon MPS, TurboQuant adds latency due to CPU-side quantization overhead. On CUDA GPUs, TurboQuant typically provides speedups by reducing memory bandwidth pressure, especially for long sequences.
Quick Start
Installation
pip install torch transformers librosa turboquant
Inference
# Baseline (no compression)
python inference.py --audio path/to/audio.wav
# With TurboQuant 4-bit KV cache compression
python inference.py --audio path/to/audio.wav --bits 4
# With TurboQuant 2-bit KV cache compression
python inference.py --audio path/to/audio.wav --bits 2
# Custom prompt
python inference.py --audio path/to/audio.wav --prompt "Summarize this audio:"
# CPU inference
python inference.py --audio path/to/audio.wav --device cpu --bits 4
Python API
from inference import run_inference
# Basic transcription
text = run_inference("audio.wav")
# With 4-bit KV cache compression
text = run_inference("audio.wav", bits=4)
# With 2-bit compression on CPU
text = run_inference("audio.wav", bits=2, device="cpu")
Compatibility Notes
This integration includes patches for running MERaLiON-2-3B with recent library versions:
- transformers >= 5.x: The original model code references
HybridCachewhich was removed. The inference script provides a compatibility shim. - NumPy >= 2.0: TurboQuant uses
np.trapzwhich was renamed tonp.trapezoid. The inference script patches this automatically. - Weight tying: The model uses tied embeddings (
lm_head.weight=embed_tokens.weight). The patched model code handles this correctly during loading.
Files
inference.py- Main inference script with CLI interfacebenchmark.py- Benchmark script comparing baseline vs TurboQuantbenchmark_results.json- Raw benchmark dataturboquant_config.json- Configuration and benchmark metadata
Architecture
Audio Input
|
v
[Whisper-large-v3 Encoder] --> [Audio Adapter (MLP)] --> [Gemma-2-2B-IT Decoder]
|
v
[TurboQuant KV Cache]
|
v
Text Output
TurboQuant operates on the decoder's KV cache during autoregressive generation. It quantizes key and value tensors to the specified bit width (2, 3, or 4 bits) using MSE-optimal quantization, reducing memory footprint while preserving generation quality.
Citation
@misc{meralion2-turboquant,
title={MERaLiON-2-3B with TurboQuant KV Cache Compression},
year={2026},
url={https://huggingface.co/majentik/MERaLiON-2-3B-TurboQuant}
}
License
This repository contains inference scripts only. The base model (MERaLiON-2-3B) is subject to its own license terms. See the original model card for details.