TalTechNLP
/

whisper-medium-et

Automatic Speech Recognition

hf-asr-leaderboard

Eval Results (legacy)

Model card Files Files and versions

whisper-medium-et / README.md

Tanel's picture

Update README.md

6759493 almost 3 years ago

|

history blame contribute delete

2.71 kB

	---
	license: cc-by-4.0
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	language: et
	model-index:
	- name: TalTechNLP/whisper-medium-et
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 11
	type: mozilla-foundation/common_voice_11_0
	config: et
	split: test
	metrics:
	- name: Test WER
	type: wer
	value: 14.66
	- name: Test CER
	type: cer
	value: 3.76
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 8
	type: mozilla-foundation/common_voice_8_0
	config: et
	split: test
	metrics:
	- name: Test WER
	type: wer
	value: 13.793
	- name: Test CER
	type: cer
	value: 3.194
	---


	# Whisper-medium-et

	This is a Whisper-medium model [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) finetuned on around 800 hours of diverse Estonian data.

	## Model description
	This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.


	## Intended uses & limitations

	This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

	## How to use

	Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper).


	#### Limitations and bias

	Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
	* Speech containing technical and other domain-specific terms
	* Children's speech
	* Non-native speech
	* Speech recorded under very noisy conditions or with a microphone far from the speaker
	* Very spontaneous and overlapping speech

	## Training data
	Acoustic training data:

	\| Type \| Amount (h) \|
	\|-----------------------\|:------:\|
	\| Broadcast speech \| 591 \|
	\| Spontaneous speech \| 53 \|
	\| Elderly speech corpus \| 53 \|
	\| Talks, lectures \| 49 \|
	\| Parliament speeches \| 31 \|
	\| Total \| 761 \|



	## Training procedure

	Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script.
	Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.

	## Evaluation results

	### WER

	WER results below are obtained using greedy decoding (i.e., beam size 1).

	\|Dataset \| WER \|
	\|---\|---\|
	\| Common Voice 8.0 \| 13.8 \|
	\| Common Voice 11.0 \| 14.7 \|