Yehor
/

w2v-bert-uk

Automatic Speech Recognition

Eval Results (legacy)

Model card Files Files and versions

w2v-bert-uk / README.md

Yehor's picture

A fix

e4ab943 verified 11 months ago

|

history blame contribute delete

3.24 kB

	---
	base_model: facebook/w2v-bert-2.0
	language:
	- uk
	tags:
	- automatic-speech-recognition
	datasets:
	- mozilla-foundation/common_voice_10_0
	metrics:
	- wer
	model-index:
	- name: w2v-bert-2.0-uk
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common_voice_10_0
	type: common_voice_10_0
	config: uk
	split: test
	args: uk
	metrics:
	- name: WER
	type: wer
	value: 6.6
	- name: CER
	type: cer
	value: 1.34
	license: apache-2.0
	---

	🚨🚨🚨 ATTENTION! 🚨🚨🚨

	Use an updated model: https://huggingface.co/Yehor/w2v-bert-uk-v2.1

	---

	# w2v-bert-uk `v1`

	## Community

	- Discord: https://bit.ly/discord-uds
	- Speech Recognition: https://t.me/speech_recognition_uk
	- Speech Synthesis: https://t.me/speech_synthesis_uk

	See other Ukrainian models: https://github.com/egorsmkv/speech-recognition-uk

	## Google Colab

	You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing

	## Metrics

	- AM (F16):
	- WER: 0.066 metric, 6.6%
	- CER: 0.013 metric, 1.34%
	- Accuracy on words: 93.4%
	- Accuracy on chars: 98.7%

	## Hyperparameters

	This model was trained with the following hparams using 2 RTX A4000:

	```bash
	torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
	--custom_set ~/cv10/train.csv \
	--custom_set_eval ~/cv10/test.csv \
	--num_train_epochs 15 \
	--tokenize_config . \
	--w2v2_bert_model facebook/w2v-bert-2.0 \
	--batch 4 \
	--num_proc 5 \
	--grad_accum 1 \
	--learning_rate 3e-5 \
	--logging_steps 20 \
	--eval_step 500 \
	--group_by_length \
	--attention_dropout 0.0 \
	--activation_dropout 0.05 \
	--feat_proj_dropout 0.05 \
	--feat_quantizer_dropout 0.0 \
	--hidden_dropout 0.05 \
	--layerdrop 0.0 \
	--final_dropout 0.0 \
	--mask_time_prob 0.0 \
	--mask_time_length 10 \
	--mask_feature_prob 0.0 \
	--mask_feature_length 10
	```

	## Usage

	```python
	# pip install -U torch soundfile transformers

	import torch
	import soundfile as sf
	from transformers import AutoModelForCTC, Wav2Vec2BertProcessor

	# Config
	model_name = 'Yehor/w2v-bert-2.0-uk'
	device = 'cuda:1' # or cpu
	sampling_rate = 16_000

	# Load the model
	asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
	processor = Wav2Vec2BertProcessor.from_pretrained(model_name)

	paths = [
	'sample1.wav',
	]

	# Extract audio
	audio_inputs = []
	for path in paths:
	audio_input, _ = sf.read(path)
	audio_inputs.append(audio_input)

	# Transcribe the audio
	inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
	features = torch.tensor(inputs).to(device)

	with torch.no_grad():
	logits = asr_model(features).logits

	predicted_ids = torch.argmax(logits, dim=-1)
	predictions = processor.batch_decode(predicted_ids)

	# Log results
	print('Predictions:')
	print(predictions)
	```

	## Cite this work

	```
	@misc {smoliakov_2025,
	author = { {Smoliakov} },
	title = { w2v-bert-uk (Revision e5a17ab) },
	year = 2025,
	url = { https://huggingface.co/Yehor/w2v-bert-uk },
	doi = { 10.57967/hf/4560 },
	publisher = { Hugging Face }
	}
	```