Fine-Tuned Whisper-small Model for French ASR

This model is a fine-tuned version of openai/whisper-small, trained on french version of CV17 dataset

Live demo

Click here (press restart to run the space)

Then you have two options: Either upload a French audio or record yourself speaking French by clicking on the mic and then the orange dot.
Hit submit and the model will output the transcription.

Performance and Evaluation

WER (Word Error Rate): Measures the percentage of words incorrectly predicted.
CER (Character Error Rate): Measures the percentage of characters incorrectly predicted.

Test Set: CV17(16k samples)

Model	WER (lower is better)	CER (lower is better)
Whisper Small (baseline)	0.3405	0.1680
Whisper Medium (baseline)	0.2597	0.1264
My Model	0.1648	0.0676

Test Set: MLS (2426 samples)

Model	WER (lower is better)	CER (lower is better)
Whisper Small (baseline)	0.3271	0.1066
Whisper Medium (baseline)	0.2974	0.0919
My Model	0.3269	0.1013

Usage

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="nambn0321/ASR_french_3", device=device)


pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data (this is an example but when you load your own data, make sure to use torchaudio or librosa to load the audio into the dataset)
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search