Whisper-Medium fine-tuned for reverberant speech (Whisper-RIR-Mega)

Use this model when: transcribing speech that was recorded in reverberant or “roomy” conditions (meetings, lectures, far-field mics). It keeps the same WER as the base Whisper-medium on clean/reverberant benchmarks while being trained specifically on reverberant data.

This model is a fine-tuned version of openai/whisper-medium on the Whisper-RIR-Mega dataset for ASR robustness to room reverberation. One-line load; no PEFT needed.

Quick usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("mandipgoswami/whisper-medium-rirmega")
model = WhisperForConditionalGeneration.from_pretrained("mandipgoswami/whisper-medium-rirmega")

audio, sr = librosa.load("path/to/reverberant_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcript)

When to use

  • Reverberant or room-recorded speech (meetings, lectures, far-field).
  • You want English ASR with the same ease as base Whisper (single from_pretrained).
  • You care about robustness to room acoustics without losing clean-speech quality.

Training

  • Base model: openai/whisper-medium
  • Dataset: Whisper-RIR-Mega (reverberant speech with clean transcripts)
  • Epochs: 4
  • Learning rate: 8e-06
  • Effective batch size: 16 (2 × 8 gradient accumulation)
  • Precision: BF16/FP16 mixed precision
  • Gradient checkpointing: Enabled
  • Hardware: Single NVIDIA RTX 5080 (16 GB)

Evaluation

Dataset Split WER
Whisper-RIR-Mega test 0.0430

Limitations

English only. Trained on 400 reverberant samples; best used in conditions similar to the Whisper-RIR-Mega benchmark.

Citation

If you use this model, please cite:

@article{goswami2026whisperrirmega,
  title={Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics},
  author={Goswami, Mandip},
  journal={arXiv preprint arXiv:2603.02252},
  year={2026}
}
Downloads last month
43
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mandipgoswami/whisper-medium-rirmega

Finetuned
(817)
this model

Dataset used to train mandipgoswami/whisper-medium-rirmega

Space using mandipgoswami/whisper-medium-rirmega 1

Paper for mandipgoswami/whisper-medium-rirmega

Evaluation results