|
|
--- |
|
|
base_model: facebook/w2v-bert-2.0 |
|
|
language: |
|
|
- uk |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_10_0 |
|
|
metrics: |
|
|
- wer |
|
|
model-index: |
|
|
- name: w2v-bert-2.0-uk |
|
|
results: |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: common_voice_10_0 |
|
|
type: common_voice_10_0 |
|
|
config: uk |
|
|
split: test |
|
|
args: uk |
|
|
metrics: |
|
|
- name: WER |
|
|
type: wer |
|
|
value: 6.6 |
|
|
- name: CER |
|
|
type: cer |
|
|
value: 1.34 |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
🚨🚨🚨 **ATTENTION!** 🚨🚨🚨 |
|
|
|
|
|
**Use an updated model**: https://huggingface.co/Yehor/w2v-bert-uk-v2.1 |
|
|
|
|
|
--- |
|
|
|
|
|
# w2v-bert-uk `v1` |
|
|
|
|
|
## Community |
|
|
|
|
|
- **Discord**: https://bit.ly/discord-uds |
|
|
- Speech Recognition: https://t.me/speech_recognition_uk |
|
|
- Speech Synthesis: https://t.me/speech_synthesis_uk |
|
|
|
|
|
See other Ukrainian models: https://github.com/egorsmkv/speech-recognition-uk |
|
|
|
|
|
## Google Colab |
|
|
|
|
|
You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing |
|
|
|
|
|
## Metrics |
|
|
|
|
|
- AM (F16): |
|
|
- WER: 0.066 metric, 6.6% |
|
|
- CER: 0.013 metric, 1.34% |
|
|
- Accuracy on words: 93.4% |
|
|
- Accuracy on chars: 98.7% |
|
|
|
|
|
## Hyperparameters |
|
|
|
|
|
This model was trained with the following hparams using 2 RTX A4000: |
|
|
|
|
|
```bash |
|
|
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \ |
|
|
--custom_set ~/cv10/train.csv \ |
|
|
--custom_set_eval ~/cv10/test.csv \ |
|
|
--num_train_epochs 15 \ |
|
|
--tokenize_config . \ |
|
|
--w2v2_bert_model facebook/w2v-bert-2.0 \ |
|
|
--batch 4 \ |
|
|
--num_proc 5 \ |
|
|
--grad_accum 1 \ |
|
|
--learning_rate 3e-5 \ |
|
|
--logging_steps 20 \ |
|
|
--eval_step 500 \ |
|
|
--group_by_length \ |
|
|
--attention_dropout 0.0 \ |
|
|
--activation_dropout 0.05 \ |
|
|
--feat_proj_dropout 0.05 \ |
|
|
--feat_quantizer_dropout 0.0 \ |
|
|
--hidden_dropout 0.05 \ |
|
|
--layerdrop 0.0 \ |
|
|
--final_dropout 0.0 \ |
|
|
--mask_time_prob 0.0 \ |
|
|
--mask_time_length 10 \ |
|
|
--mask_feature_prob 0.0 \ |
|
|
--mask_feature_length 10 |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
# pip install -U torch soundfile transformers |
|
|
|
|
|
import torch |
|
|
import soundfile as sf |
|
|
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor |
|
|
|
|
|
# Config |
|
|
model_name = 'Yehor/w2v-bert-2.0-uk' |
|
|
device = 'cuda:1' # or cpu |
|
|
sampling_rate = 16_000 |
|
|
|
|
|
# Load the model |
|
|
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device) |
|
|
processor = Wav2Vec2BertProcessor.from_pretrained(model_name) |
|
|
|
|
|
paths = [ |
|
|
'sample1.wav', |
|
|
] |
|
|
|
|
|
# Extract audio |
|
|
audio_inputs = [] |
|
|
for path in paths: |
|
|
audio_input, _ = sf.read(path) |
|
|
audio_inputs.append(audio_input) |
|
|
|
|
|
# Transcribe the audio |
|
|
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features |
|
|
features = torch.tensor(inputs).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = asr_model(features).logits |
|
|
|
|
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
|
predictions = processor.batch_decode(predicted_ids) |
|
|
|
|
|
# Log results |
|
|
print('Predictions:') |
|
|
print(predictions) |
|
|
``` |
|
|
|
|
|
## Cite this work |
|
|
|
|
|
``` |
|
|
@misc {smoliakov_2025, |
|
|
author = { {Smoliakov} }, |
|
|
title = { w2v-bert-uk (Revision e5a17ab) }, |
|
|
year = 2025, |
|
|
url = { https://huggingface.co/Yehor/w2v-bert-uk }, |
|
|
doi = { 10.57967/hf/4560 }, |
|
|
publisher = { Hugging Face } |
|
|
} |
|
|
``` |
|
|
|