| --- |
| license: mit |
| datasets: |
| - facebook/multilingual_librispeech |
| language: |
| - fr |
| base_model: |
| - openai/whisper-small |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Fine-Tuned Whisper-small Model for French ASR |
|
|
| This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small), trained on french version of [CV17 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) |
|
|
| # Live demo |
| Click [here](https://huggingface.co/spaces/nambn0321/ASR_french) (press restart to run the space) |
|
|
| - Then you have two options: Either upload a French audio or record yourself speaking French by clicking on the mic and then the orange dot. |
|
|
| - Hit submit and the model will output the transcription. |
|
|
| # Performance and Evaluation |
| - **WER (Word Error Rate):** Measures the percentage of words incorrectly predicted. |
| - **CER (Character Error Rate):** Measures the percentage of characters incorrectly predicted. |
|
|
| ### **Test Set: CV17(16k samples)** |
| | Model | WER (lower is better) | CER (lower is better) | |
| |----------------------------|---------------------------|-------------------------| |
| | [**Whisper Small** (baseline)](https://huggingface.co/openai/whisper-small) | 0.3405 | 0.1680 | |
| | [**Whisper Medium** (baseline)](https://huggingface.co/openai/whisper-medium) | 0.2597 | 0.1264 | |
| | **My Model** | 0.1648 | 0.0676 | |
|
|
| ### **Test Set: MLS (2426 samples)** |
| | Model | WER (lower is better) | CER (lower is better) | |
| |----------------------------|---------------------------|-------------------------| |
| | [**Whisper Small** (baseline)](https://huggingface.co/openai/whisper-small) | 0.3271 | 0.1066 | |
| | [**Whisper Medium** (baseline)](https://huggingface.co/openai/whisper-medium) | 0.2974 | 0.0919 | |
| | **My Model** | 0.3269 | 0.1013 | |
|
|
|
|
| # Usage |
| ```python |
| import torch |
| |
| from datasets import load_dataset |
| from transformers import pipeline |
| |
| device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
| |
| # Load pipeline |
| pipe = pipeline("automatic-speech-recognition", model="nambn0321/ASR_french_3", device=device) |
| |
| |
| pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe") |
| |
| # Load data (this is an example but when you load your own data, make sure to use torchaudio or librosa to load the audio into the dataset) |
| ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) |
| test_segment = next(iter(ds_mcv_test)) |
| waveform = test_segment["audio"] |
| |
| # Run |
| generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy |
| # generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search |
| ``` |
| **NOM** |