MERaLiON-2-10B + TurboQuant KV Cache Compression

Integration of TurboQuant KV cache compression with MERaLiON-2-10B, a 10B-parameter speech-language model built on a Whisper encoder and Gemma-2-9b-IT decoder.

TurboQuant compresses the Gemma-2 decoder's KV cache at inference time using learned quantization codebooks. No model retraining is required -- it is a drop-in replacement for the past_key_values argument.

Benchmark Results

Tested on Apple M4 Max (128GB unified memory), MPS backend, bfloat16, with a 6.84s audio clip.

Natural Generation (greedy, model-determined length)

Configuration Tokens Time (s) Tok/s KV Memory Savings
Baseline (fp16 cache) 9 1.85 4.9 353 MB --
TurboQuant 4-bit 9 1.35 6.7 201 MB 1.76x
TurboQuant 2-bit 9 1.22 7.4 201 MB 1.76x
  • 4-bit output matches baseline exactly
  • 2-bit shows minor divergence (expected at lower precision)
  • 1.38x speedup (4-bit) / 1.52x speedup (2-bit) for short sequences

Forced Long Generation (128+ tokens)

Configuration Tokens Time (s) Tok/s KV Memory Savings
Baseline 129 59.6 2.2 7,805 MB --
TurboQuant 4-bit 130 170.5 0.8 3,985 MB 1.96x
TurboQuant 2-bit 130 323.6 0.4 3,985 MB 1.96x

For longer sequences, TurboQuant compression overhead on MPS exceeds the memory-bandwidth savings. The primary benefit at longer context is KV cache memory reduction (1.96x). On CUDA GPUs, speed improvements are expected to scale better.

Usage

import numpy as np
if not hasattr(np, "trapz"):
    np.trapz = np.trapezoid  # numpy 2.x compat

import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from turboquant import TurboQuantCache

model_id = "MERaLiON/MERaLiON-2-10B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto",
    trust_remote_code=True, attn_implementation="eager"
)

# Prepare audio inputs
import soundfile as sf, librosa
audio, sr = sf.read("audio.wav")
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
inputs = processor(
    text="<SpeechHere> Transcribe the audio.",
    audios=[audio], sampling_rate=16000, padding=True
)
# ... move to device, add batch dim, cast to bf16 ...

# Generate with TurboQuant 4-bit KV cache
cache = TurboQuantCache(bits=4)
output = model.generate(**inputs, max_new_tokens=256, past_key_values=cache)

See inference.py for a complete working script.

CLI

python inference.py --audio speech.wav                    # 4-bit TurboQuant (default)
python inference.py --audio speech.wav --bits 2           # 2-bit TurboQuant
python inference.py --audio speech.wav --no-turboquant    # baseline

Requirements

pip install torch transformers turboquant soundfile librosa scipy

Compatibility Notes

MERaLiON-2-10B's custom modeling code was written for an older version of transformers. Running on transformers 5.5+ requires several patches to the cached model files:

  1. HybridCache import -- removed in transformers 5.x (replaced with sentinel class)
  2. pad_token_id access -- not set in MERaLiON2Config (use getattr fallback)
  3. tie_weights() signature -- new recompute_mapping kwarg (accept **kwargs)
  4. cache_position in prepare_inputs_for_generation -- can be None in new generate flow

These patches are applied to the HuggingFace-cached model files automatically by the inference script's import mechanism. See turboquant_config.json for details.

Architecture

Audio (WAV) --> Whisper Encoder (large-v3 scale) --> Speech Adapter --> Gemma-2-9b-IT Decoder
                                                                        ^
                                                                        |
                                                              TurboQuant KV Cache
                                                              (2-bit or 4-bit compression)

Files

  • inference.py -- Complete inference script with CLI
  • turboquant_config.json -- Configuration and benchmark data
  • benchmark_results.json -- Raw benchmark results

Citation

@article{meralion2,
  title={MERaLiON-AudioLLM: Technical Report},
  author={MERaLiON Team},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/MERaLiON-2-10B-TurboQuant

Finetuned
(4)
this model