File size: 3,065 Bytes
0de57e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8375a67
 
 
0de57e8
 
 
 
 
 
 
8375a67
 
 
 
0de57e8
 
 
 
 
377984b
0de57e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377984b
 
 
 
 
 
 
 
 
 
 
 
 
0de57e8
 
 
8375a67
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
language:
- sk
tags:
- speech
- asr
- whisper
- slovak
- parliament
- legal
- politics
base_model: openai/whisper-large-v3
datasets:
- erikbozik/slovak-plenary-asr-corpus
metrics:
- wer
model-index:
- name: whisper-large-v3-sk
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 21 (Slovak test set)
      type: common_voice
    metrics:
    - name: WER
      type: wer
      value: 11.6
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS (Slovak test set)
      type: fleurs
    metrics:
    - name: WER
      type: wer
      value: 5.5
license: mit
---

# Whisper Large-v3 — Fine-tuned on Slovak Plenary ASR Corpus

This model is a fine-tuned version of [`openai/whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3).  
It is adapted for **Slovak ASR** using [SloPalSpeech](https://huggingface.co/datasets/erikbozik/slovak-plenary-asr-corpus): **2,806 hours** of aligned, ≤30 s speech–text pairs from official plenary sessions of the **Slovak National Council**.

- **Language:** Slovak  
- **Domain:** Parliamentary / formal speech  
- **Training data:** 2,806 h
- **Intended use:** Slovak speech recognition; strongest in formal/public-speaking contexts

## 🧪 Evaluation

| Dataset | Base WER | Fine-tuned WER | Δ (abs) |
|---|---:|---:|---:|
| Common Voice 21 (sk) | 20.8 | **11.6** | -9.2 |
| FLEURS (sk) | 9.2 | **5.5** | -3.7 |

*Numbers from the paper’s final benchmark runs.*

## 🔧 Training Details

- **Framework:** Hugging Face Transformers  
- **Hardware:** Multi-GPU setup (NVIDIA A10s) with Fully Sharded Data Parallel (FSDP)  
- **Epochs:** ~2 with early stopping on validation WER  
- **Learning rate:** `1e-5` with weight decay `0.01` to prevent overfitting  
- **Notes:** Training required sharded checkpoints; evaluation run separately due to runtime compatibility issues

## ⚠️ Limitations

- Domain bias toward parliamentary speech (e.g., political vocabulary, formal register).  
- As with Whisper models generally, occasional hallucinations may appear; consider temperature fallback / compression-ratio checks at inference time.  
- Multilingual performance is not guaranteed (full-parameter finetuning emphasized Slovak).

## 📝 Citation & Paper
For more details, please see our paper on [arXiv](https://arxiv.org/abs/2509.19270). If you use this model in your work, please cite it as:
```bibtex
@misc{božík2025slopalspeech2800hourslovakspeech,
      title={SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data}, 
      author={Erik Božík and Marek Šuppa},
      year={2025},
      eprint={2509.19270},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.19270}, 
}
```

## 🙏 Acknowledgements

This work was supported by [**VÚB Banka**](https://www.vub.sk) who provided the GPU resources and backing necessary to accomplish it, enabling progress in Slovak ASR research.