metadata
title: Whisper CTC-DRO ASR model - set 4
language: multilingual
tags:
- asr
- whisper
- whisper-dro
- seq2seq
license: apache-2.0
Whisper CTC-DRO ASR model - set 4
This repository contains an automatic speech recognition (ASR) model fine-tuned from openai/whisper-large-v3 using the principles of CTC-DRO applied to Whisper's seq2seq architecture.
The model was trained on balanced training data from set 4 (slv, snd, spa, urd).
DRO hyperparameters: eta=1e-3, alpha=0.1, aggregation: mean
Intended Use
This model is intended for multilingual ASR. Users can run inference using the HuggingFace Transformers library:
import torch
import librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("bartelds/whisper-dro-set4-dro")
processor = WhisperProcessor.from_pretrained("bartelds/whisper-dro-set4-dro")
model.eval()
audio, sr = librosa.load("input.wav", sr=16000)
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
generated = model.generate(input_features=inputs.input_features)
text = processor.tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print("Recognized text:", text)
How to Use
- Install dependencies:
pip install transformers torch librosa - Load the model and processor using
from_pretrained()as shown above. - The model supports multilingual transcription -- see the training repository for evaluation details.
Training
- Base model:
openai/whisper-large-v3 - Training code: whisper-dro
- Paper: CTC-DRO