MERaLiON-2-3B with TurboQuant KV Cache Compression

This repository provides inference scripts and benchmark data for running MERaLiON-2-3B with TurboQuant KV cache compression.

Model Description

MERaLiON-2-3B is a speech-language model developed by I2R, A*STAR Singapore. It combines a Whisper-large-v3 speech encoder with a Gemma-2-2B-IT text decoder via a learned audio adapter. The model handles speech-to-text tasks including transcription, translation, and spoken language understanding.

TurboQuant compresses the KV cache at inference time using quantization, reducing memory usage during autoregressive generation. This is particularly beneficial for long-context generation where KV cache memory can become a bottleneck.

Key Results

Configuration Avg Time (s) Relative Output Quality
Baseline (no compression) 7.21 1.00x Reference
TurboQuant 4-bit 11.37 1.58x Identical to baseline
TurboQuant 2-bit 12.22 1.70x Minor variations, coherent

Test conditions: Apple Silicon MPS, 128GB unified memory, 6.84s audio input, 256 max new tokens, PyTorch 2.11.0, transformers 5.5.0.

Note: On Apple Silicon MPS, TurboQuant adds latency due to CPU-side quantization overhead. On CUDA GPUs, TurboQuant typically provides speedups by reducing memory bandwidth pressure, especially for long sequences.

Quick Start

Installation

pip install torch transformers librosa turboquant

Inference

# Baseline (no compression)
python inference.py --audio path/to/audio.wav

# With TurboQuant 4-bit KV cache compression
python inference.py --audio path/to/audio.wav --bits 4

# With TurboQuant 2-bit KV cache compression
python inference.py --audio path/to/audio.wav --bits 2

# Custom prompt
python inference.py --audio path/to/audio.wav --prompt "Summarize this audio:"

# CPU inference
python inference.py --audio path/to/audio.wav --device cpu --bits 4

Python API

from inference import run_inference

# Basic transcription
text = run_inference("audio.wav")

# With 4-bit KV cache compression
text = run_inference("audio.wav", bits=4)

# With 2-bit compression on CPU
text = run_inference("audio.wav", bits=2, device="cpu")

Compatibility Notes

This integration includes patches for running MERaLiON-2-3B with recent library versions:

  • transformers >= 5.x: The original model code references HybridCache which was removed. The inference script provides a compatibility shim.
  • NumPy >= 2.0: TurboQuant uses np.trapz which was renamed to np.trapezoid. The inference script patches this automatically.
  • Weight tying: The model uses tied embeddings (lm_head.weight = embed_tokens.weight). The patched model code handles this correctly during loading.

Files

  • inference.py - Main inference script with CLI interface
  • benchmark.py - Benchmark script comparing baseline vs TurboQuant
  • benchmark_results.json - Raw benchmark data
  • turboquant_config.json - Configuration and benchmark metadata

Architecture

Audio Input
    |
    v
[Whisper-large-v3 Encoder] --> [Audio Adapter (MLP)] --> [Gemma-2-2B-IT Decoder]
                                                              |
                                                              v
                                                    [TurboQuant KV Cache]
                                                              |
                                                              v
                                                        Text Output

TurboQuant operates on the decoder's KV cache during autoregressive generation. It quantizes key and value tensors to the specified bit width (2, 3, or 4 bits) using MSE-optimal quantization, reducing memory footprint while preserving generation quality.

Citation

@misc{meralion2-turboquant,
  title={MERaLiON-2-3B with TurboQuant KV Cache Compression},
  year={2026},
  url={https://huggingface.co/majentik/MERaLiON-2-3B-TurboQuant}
}

License

This repository contains inference scripts only. The base model (MERaLiON-2-3B) is subject to its own license terms. See the original model card for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/MERaLiON-2-3B-TurboQuant

Finetuned
(4)
this model