File size: 3,183 Bytes

788df3f
fc363b6
 
 
bea0e72
 
fc363b6
 
 
 
 
bea0e72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc363b6
69e21c3
fc363b6
bea0e72
 
 
 
 
 
 
fc363b6
69e21c3
fc363b6
788df3f
 
939a05a
788df3f
fc363b6
 
939a05a
788df3f
bea0e72
 
 
 
788df3f
bea0e72
788df3f
bea0e72
 
 
 
788df3f
bea0e72
788df3f
bea0e72
788df3f
bea0e72
 
 
 
788df3f
bea0e72
788df3f
bea0e72
 
 
788df3f
061cb55
fc363b6
061cb55
 
 
 
 
 
 
 
 
 
 
788df3f
bea0e72
788df3f
69e21c3

---
base_model: openai/whisper-medium
datasets:
- erikbozik/slovak-plenary-asr-corpus
language:
- sk
license: mit
metrics:
- wer
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- asr
- whisper
- slovak
- parliament
- legal
- politics
model-index:
- name: whisper-medium-sk
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 21 (Slovak test set)
      type: common_voice
    metrics:
    - type: wer
      value: 18
      name: WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS (Slovak test set)
      type: fleurs
    metrics:
    - type: wer
      value: 7.6
      name: WER
---

# Whisper Medium — Fine-tuned on SloPalSpeech

This model is a fine-tuned version of [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium), presented in the paper [SloPal: A 60-Million-Word Slovak Parliamentary Corpus with Aligned Speech and Fine-Tuned ASR Models](https://huggingface.co/papers/2509.19270).  

It is adapted for **Slovak ASR** using [SloPalSpeech](https://huggingface.co/datasets/erikbozik/slovak-plenary-asr-corpus): **2,806 hours** of aligned, ≤30 s speech–text pairs from official plenary sessions of the **Slovak National Council**.

- **Language:** Slovak  
- **Domain:** Parliamentary / formal speech  
- **Training data:** 2,806 h
- **Intended use:** Slovak speech recognition; strongest in formal/public-speaking contexts

## 🧪 Evaluation

| Dataset | Base WER | Fine-tuned WER | Δ (abs) |
|---|---:|---:|---:|
| Common Voice 21 (sk) | 38.0 | **18.0** | -20.0 |
| FLEURS (sk) | 18.7 | **7.6** | -11.1 |

*Numbers from the paper’s final benchmark runs.*

## 🔧 Training Details

- **Framework:** Hugging Face Transformers  
- **Hardware:** NVIDIA A10 GPUs  
- **Epochs:** up to 3 with early stopping on validation WER  
- **Learning rate:** ~**40× smaller** than Whisper pretraining LR

## ⚠️ Limitations

- Domain bias toward parliamentary speech (e.g., political vocabulary, formal register).  
- As with Whisper models generally, occasional hallucinations may appear; consider temperature fallback / compression-ratio checks at inference time.  
- Multilingual performance is not guaranteed (full-parameter finetuning emphasized Slovak).

## 📝 Citation & Paper
For more details, please see our paper on [arXiv](https://arxiv.org/abs/2509.19270) or the [Hugging Face paper page](https://huggingface.co/papers/2509.19270). If you use this model in your work, please cite it as:
```bibtex
@misc{božík2025slopalspeech2800hourslovakspeech,
      title={SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data}, 
      author={Erik Božík and Marek Šuppa},
      year={2025},
      eprint={2509.19270},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.19270}, 
}
```

## 🙏 Acknowledgements

This work was supported by [**VÚB Banka**](https://www.vub.sk) who provided the GPU resources and backing necessary to accomplish it, enabling progress in Slovak ASR research.