Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
Paper
• 2603.02252 • Published
• 1
Use this model when: transcribing speech that was recorded in reverberant or “roomy” conditions (meetings, lectures, far-field mics). It keeps the same WER as the base Whisper-medium on clean/reverberant benchmarks while being trained specifically on reverberant data.
This model is a fine-tuned version of openai/whisper-medium on the Whisper-RIR-Mega dataset for ASR robustness to room reverberation. One-line load; no PEFT needed.
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("mandipgoswami/whisper-medium-rirmega")
model = WhisperForConditionalGeneration.from_pretrained("mandipgoswami/whisper-medium-rirmega")
audio, sr = librosa.load("path/to/reverberant_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcript)
from_pretrained).| Dataset | Split | WER |
|---|---|---|
| Whisper-RIR-Mega | test | 0.0430 |
English only. Trained on 400 reverberant samples; best used in conditions similar to the Whisper-RIR-Mega benchmark.
If you use this model, please cite:
@article{goswami2026whisperrirmega,
title={Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics},
author={Goswami, Mandip},
journal={arXiv preprint arXiv:2603.02252},
year={2026}
}
Base model
openai/whisper-medium