File size: 3,238 Bytes

1ac22db
bd7057e
03b9654
 
 
 
bd7057e
e4ab943
bd7057e
7de3b2c
bd7057e
7de3b2c
 
 
 
 
 
 
 
 
 
 
 
e085c2c
7de3b2c
687504f
e085c2c
 
687504f
7de3b2c
1ac22db
a5706d8
a3cc419
a5706d8
3523b0e
3f87c4d
a3cc419
 
1df7a64
34edfbb
a3cc419
d97ce4d
e5a17ab
a3cc419
 
d97ce4d
001f556
 
8072f41
 
 
 
3f87c4d
545584b
83dd8cf
687504f
 
d709b9e
 
545584b
3f87c4d
 
649a76a
a5706d8
d709b9e
a5706d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3687bc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf35794
3687bc0
 
d55370f

---
base_model: facebook/w2v-bert-2.0
language:
- uk
tags:
- automatic-speech-recognition
datasets:
- mozilla-foundation/common_voice_10_0
metrics:
- wer
model-index:
- name: w2v-bert-2.0-uk
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common_voice_10_0
      type: common_voice_10_0
      config: uk
      split: test
      args: uk
    metrics:
    - name: WER
      type: wer
      value: 6.6
    - name: CER
      type: cer
      value: 1.34
license: apache-2.0
---

🚨🚨🚨 **ATTENTION!** 🚨🚨🚨

**Use an updated model**: https://huggingface.co/Yehor/w2v-bert-uk-v2.1

---

# w2v-bert-uk `v1`

## Community

- **Discord**: https://bit.ly/discord-uds
- Speech Recognition: https://t.me/speech_recognition_uk
- Speech Synthesis: https://t.me/speech_synthesis_uk

See other Ukrainian models: https://github.com/egorsmkv/speech-recognition-uk

## Google Colab

You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing

## Metrics

- AM (F16):
  - WER: 0.066 metric, 6.6%
  - CER: 0.013 metric, 1.34%
  - Accuracy on words: 93.4%
  - Accuracy on chars: 98.7%

## Hyperparameters

This model was trained with the following hparams using 2 RTX A4000:

```bash
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
  --custom_set ~/cv10/train.csv \
  --custom_set_eval ~/cv10/test.csv \
  --num_train_epochs 15 \
  --tokenize_config . \
  --w2v2_bert_model facebook/w2v-bert-2.0 \
  --batch 4 \
  --num_proc 5 \
  --grad_accum 1 \
  --learning_rate 3e-5 \
  --logging_steps 20 \
  --eval_step 500 \
  --group_by_length \
  --attention_dropout 0.0 \
  --activation_dropout 0.05 \
  --feat_proj_dropout 0.05 \
  --feat_quantizer_dropout 0.0 \
  --hidden_dropout 0.05 \
  --layerdrop 0.0 \
  --final_dropout 0.0 \
  --mask_time_prob 0.0 \
  --mask_time_length 10 \
  --mask_feature_prob 0.0 \
  --mask_feature_length 10
```

## Usage

```python
# pip install -U torch soundfile transformers

import torch
import soundfile as sf
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor

# Config
model_name = 'Yehor/w2v-bert-2.0-uk'
device = 'cuda:1' # or cpu
sampling_rate = 16_000

# Load the model
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2BertProcessor.from_pretrained(model_name)

paths = [
  'sample1.wav',
]

# Extract audio
audio_inputs = []
for path in paths:
  audio_input, _ = sf.read(path)
  audio_inputs.append(audio_input)

# Transcribe the audio
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
features = torch.tensor(inputs).to(device)

with torch.no_grad():
  logits = asr_model(features).logits

predicted_ids = torch.argmax(logits, dim=-1)
predictions = processor.batch_decode(predicted_ids)

# Log results
print('Predictions:')
print(predictions)
```

## Cite this work

```
@misc {smoliakov_2025,
	author       = { {Smoliakov} },
	title        = { w2v-bert-uk (Revision e5a17ab) },
	year         = 2025,
	url          = { https://huggingface.co/Yehor/w2v-bert-uk },
	doi          = { 10.57967/hf/4560 },
	publisher    = { Hugging Face }
}
```