metadata
license: apache-2.0
language:
- sr
base_model:
- openai/whisper-small
datasets:
- google/fleurs
- Sagicc/audio-lmb-ds
- espnet/yodas_owsmv4
- classla/ParlaSpeech-RS
metrics:
- wer
model-index:
- name: Whisper Small
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 24.0
type: mozilla-foundation/common_voice_24_0
config: sr
split: test
args: sr
metrics:
- name: Wer
type: wer
value: 0.065924219787
library_name: transformers
whisper-small-sr
Fine-tuned OpenAI Whisper Small.
Output script: this model is intended to produce Serbian Latin only.
- WER on Common Voice 24.0 Serbian test: 6.59%
Model description
Training and evaluation data
This model was fine-tuned on a mixture of publicly available Serbian speech corpora, including:
- Mozilla Common Voice 24.0, evaluated on CV test (sr)
- FLEURS Serbian
- ParlaSpeech-RS (subset of the full dataset)
- Additional Serbian corpora used in the training pipeline
Training procedure
- Epochs: 9
- Batch size: 32 / 20
- Optimizer: AdamW
- LR: 6e-5 with warmup (50 steps) + cosine decay to min_lr = 1e-7
- Mixed precision: bfloat16 (fp32 in the final epoch)
- SpecAugment: frequency + time masking
- Sampling: weighted sampling across datasets
Training results
| Epoch | Train loss | CV WER |
|---|---|---|
| 1 | 0.333 | 0.1614 |
| 2 | 0.344 | 0.1278 |
| 3 | 0.251 | 0.1112 |
| 4 | 0.202 | 0.1032 |
| 5 | 0.167 | 0.0934 |
| 6 | 0.138 | 0.0790 |
| 7 | 0.118 | 0.0740 |
| 8 | 0.103 | 0.0709 |
| 9 | 0.096 | 0.0659 |
Evaluation Metrics
- WER (normalized) on Common Voice 24.0 Serbian test: 7.09%
- Text normalization used for WER:
- punctuation removed
- lowercased
- Cyrillic → Latin conversion
- numbers converted to words