|
|
--- |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- audio |
|
|
- automatic-speech-recognition |
|
|
- hf-asr-leaderboard |
|
|
language: et |
|
|
model-index: |
|
|
- name: TalTechNLP/whisper-medium-et |
|
|
results: |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Common Voice 11 |
|
|
type: mozilla-foundation/common_voice_11_0 |
|
|
config: et |
|
|
split: test |
|
|
metrics: |
|
|
- name: Test WER |
|
|
type: wer |
|
|
value: 14.66 |
|
|
- name: Test CER |
|
|
type: cer |
|
|
value: 3.76 |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Common Voice 8 |
|
|
type: mozilla-foundation/common_voice_8_0 |
|
|
config: et |
|
|
split: test |
|
|
metrics: |
|
|
- name: Test WER |
|
|
type: wer |
|
|
value: 13.793 |
|
|
- name: Test CER |
|
|
type: cer |
|
|
value: 3.194 |
|
|
--- |
|
|
|
|
|
|
|
|
# Whisper-medium-et |
|
|
|
|
|
This is a Whisper-medium model [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) finetuned on around 800 hours of diverse Estonian data. |
|
|
|
|
|
## Model description |
|
|
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. |
|
|
|
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. |
|
|
|
|
|
## How to use |
|
|
|
|
|
Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper). |
|
|
|
|
|
|
|
|
#### Limitations and bias |
|
|
|
|
|
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: |
|
|
* Speech containing technical and other domain-specific terms |
|
|
* Children's speech |
|
|
* Non-native speech |
|
|
* Speech recorded under very noisy conditions or with a microphone far from the speaker |
|
|
* Very spontaneous and overlapping speech |
|
|
|
|
|
## Training data |
|
|
Acoustic training data: |
|
|
|
|
|
| Type | Amount (h) | |
|
|
|-----------------------|:------:| |
|
|
| Broadcast speech | 591 | |
|
|
| Spontaneous speech | 53 | |
|
|
| Elderly speech corpus | 53 | |
|
|
| Talks, lectures | 49 | |
|
|
| Parliament speeches | 31 | |
|
|
| *Total* | *761* | |
|
|
|
|
|
|
|
|
|
|
|
## Training procedure |
|
|
|
|
|
Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script. |
|
|
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model. |
|
|
|
|
|
## Evaluation results |
|
|
|
|
|
### WER |
|
|
|
|
|
WER results below are obtained using greedy decoding (i.e., beam size 1). |
|
|
|
|
|
|Dataset | WER | |
|
|
|---|---| |
|
|
| Common Voice 8.0 | 13.8 | |
|
|
| Common Voice 11.0 | 14.7 | |
|
|
|