File size: 2,869 Bytes
788df3f
bea0e72
 
 
 
 
 
 
 
 
 
 
 
d15a5a9
bea0e72
 
 
 
 
 
 
 
 
 
 
 
69e21c3
 
 
bea0e72
 
 
 
 
 
 
69e21c3
 
 
 
788df3f
 
939a05a
788df3f
bea0e72
939a05a
788df3f
bea0e72
 
 
 
788df3f
bea0e72
788df3f
bea0e72
 
 
 
788df3f
bea0e72
788df3f
bea0e72
788df3f
bea0e72
 
 
 
788df3f
bea0e72
788df3f
bea0e72
 
 
788df3f
061cb55
 
 
 
 
 
 
 
 
 
 
 
 
788df3f
bea0e72
788df3f
69e21c3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
language:
- sk
tags:
- speech
- asr
- whisper
- slovak
- parliament
- legal
- politics
base_model: openai/whisper-medium
datasets:
- erikbozik/slovak-plenary-asr-corpus
metrics:
- wer
model-index:
- name: whisper-medium-sk
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 21 (Slovak test set)
      type: common_voice
    metrics:
    - name: WER
      type: wer
      value: 18
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS (Slovak test set)
      type: fleurs
    metrics:
    - name: WER
      type: wer
      value: 7.6
license: mit
---

# Whisper Medium — Fine-tuned on SloPalSpeech

This model is a fine-tuned version of [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium).  
It is adapted for **Slovak ASR** using [SloPalSpeech](https://huggingface.co/datasets/erikbozik/slovak-plenary-asr-corpus): **2,806 hours** of aligned, ≤30 s speech–text pairs from official plenary sessions of the **Slovak National Council**.

- **Language:** Slovak  
- **Domain:** Parliamentary / formal speech  
- **Training data:** 2,806 h
- **Intended use:** Slovak speech recognition; strongest in formal/public-speaking contexts

## 🧪 Evaluation

| Dataset | Base WER | Fine-tuned WER | Δ (abs) |
|---|---:|---:|---:|
| Common Voice 21 (sk) | 38.0 | **18.0** | -20.0 |
| FLEURS (sk) | 18.7 | **7.6** | -11.1 |

*Numbers from the paper’s final benchmark runs.*

## 🔧 Training Details

- **Framework:** Hugging Face Transformers  
- **Hardware:** NVIDIA A10 GPUs  
- **Epochs:** up to 3 with early stopping on validation WER  
- **Learning rate:** ~**40× smaller** than Whisper pretraining LR

## ⚠️ Limitations

- Domain bias toward parliamentary speech (e.g., political vocabulary, formal register).  
- As with Whisper models generally, occasional hallucinations may appear; consider temperature fallback / compression-ratio checks at inference time.  
- Multilingual performance is not guaranteed (full-parameter finetuning emphasized Slovak).

## 📝 Citation & Paper
For more details, please see our paper on [arXiv](https://arxiv.org/abs/2509.19270). If you use this model in your work, please cite it as:
```bibtex
@misc{božík2025slopalspeech2800hourslovakspeech,
      title={SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data}, 
      author={Erik Božík and Marek Šuppa},
      year={2025},
      eprint={2509.19270},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.19270}, 
}
```

## 🙏 Acknowledgements

This work was supported by [**VÚB Banka**](https://www.vub.sk) who provided the GPU resources and backing necessary to accomplish it, enabling progress in Slovak ASR research.