File size: 2,869 Bytes
788df3f bea0e72 d15a5a9 bea0e72 69e21c3 bea0e72 69e21c3 788df3f 939a05a 788df3f bea0e72 939a05a 788df3f bea0e72 788df3f bea0e72 788df3f bea0e72 788df3f bea0e72 788df3f bea0e72 788df3f bea0e72 788df3f bea0e72 788df3f bea0e72 788df3f 061cb55 788df3f bea0e72 788df3f 69e21c3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | ---
language:
- sk
tags:
- speech
- asr
- whisper
- slovak
- parliament
- legal
- politics
base_model: openai/whisper-medium
datasets:
- erikbozik/slovak-plenary-asr-corpus
metrics:
- wer
model-index:
- name: whisper-medium-sk
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 21 (Slovak test set)
type: common_voice
metrics:
- name: WER
type: wer
value: 18
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: FLEURS (Slovak test set)
type: fleurs
metrics:
- name: WER
type: wer
value: 7.6
license: mit
---
# Whisper Medium — Fine-tuned on SloPalSpeech
This model is a fine-tuned version of [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium).
It is adapted for **Slovak ASR** using [SloPalSpeech](https://huggingface.co/datasets/erikbozik/slovak-plenary-asr-corpus): **2,806 hours** of aligned, ≤30 s speech–text pairs from official plenary sessions of the **Slovak National Council**.
- **Language:** Slovak
- **Domain:** Parliamentary / formal speech
- **Training data:** 2,806 h
- **Intended use:** Slovak speech recognition; strongest in formal/public-speaking contexts
## 🧪 Evaluation
| Dataset | Base WER | Fine-tuned WER | Δ (abs) |
|---|---:|---:|---:|
| Common Voice 21 (sk) | 38.0 | **18.0** | -20.0 |
| FLEURS (sk) | 18.7 | **7.6** | -11.1 |
*Numbers from the paper’s final benchmark runs.*
## 🔧 Training Details
- **Framework:** Hugging Face Transformers
- **Hardware:** NVIDIA A10 GPUs
- **Epochs:** up to 3 with early stopping on validation WER
- **Learning rate:** ~**40× smaller** than Whisper pretraining LR
## ⚠️ Limitations
- Domain bias toward parliamentary speech (e.g., political vocabulary, formal register).
- As with Whisper models generally, occasional hallucinations may appear; consider temperature fallback / compression-ratio checks at inference time.
- Multilingual performance is not guaranteed (full-parameter finetuning emphasized Slovak).
## 📝 Citation & Paper
For more details, please see our paper on [arXiv](https://arxiv.org/abs/2509.19270). If you use this model in your work, please cite it as:
```bibtex
@misc{božík2025slopalspeech2800hourslovakspeech,
title={SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data},
author={Erik Božík and Marek Šuppa},
year={2025},
eprint={2509.19270},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.19270},
}
```
## 🙏 Acknowledgements
This work was supported by [**VÚB Banka**](https://www.vub.sk) who provided the GPU resources and backing necessary to accomplish it, enabling progress in Slovak ASR research. |