NaiveNeuron
/

whisper-medium-sk

Eval Results (legacy)

Model card Files Files and versions

whisper-medium-sk / README.md

nielsr's picture

nielsr HF Staff

Add pipeline_tag, library_name and link to paper

fc363b6 verified 12 days ago

|

3.18 kB

	---
	base_model: openai/whisper-medium
	datasets:
	- erikbozik/slovak-plenary-asr-corpus
	language:
	- sk
	license: mit
	metrics:
	- wer
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- speech
	- asr
	- whisper
	- slovak
	- parliament
	- legal
	- politics
	model-index:
	- name: whisper-medium-sk
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Common Voice 21 (Slovak test set)
	type: common_voice
	metrics:
	- type: wer
	value: 18
	name: WER
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: FLEURS (Slovak test set)
	type: fleurs
	metrics:
	- type: wer
	value: 7.6
	name: WER
	---

	# Whisper Medium — Fine-tuned on SloPalSpeech

	This model is a fine-tuned version of [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium), presented in the paper [SloPal: A 60-Million-Word Slovak Parliamentary Corpus with Aligned Speech and Fine-Tuned ASR Models](https://huggingface.co/papers/2509.19270).

	It is adapted for Slovak ASR using [SloPalSpeech](https://huggingface.co/datasets/erikbozik/slovak-plenary-asr-corpus): 2,806 hours of aligned, ≤30 s speech–text pairs from official plenary sessions of the Slovak National Council.

	- Language: Slovak
	- Domain: Parliamentary / formal speech
	- Training data: 2,806 h
	- Intended use: Slovak speech recognition; strongest in formal/public-speaking contexts

	## 🧪 Evaluation

	\| Dataset \| Base WER \| Fine-tuned WER \| Δ (abs) \|
	\|---\|---:\|---:\|---:\|
	\| Common Voice 21 (sk) \| 38.0 \| 18.0 \| -20.0 \|
	\| FLEURS (sk) \| 18.7 \| 7.6 \| -11.1 \|

	Numbers from the paper’s final benchmark runs.

	## 🔧 Training Details

	- Framework: Hugging Face Transformers
	- Hardware: NVIDIA A10 GPUs
	- Epochs: up to 3 with early stopping on validation WER
	- Learning rate: ~40× smaller than Whisper pretraining LR

	## ⚠️ Limitations

	- Domain bias toward parliamentary speech (e.g., political vocabulary, formal register).
	- As with Whisper models generally, occasional hallucinations may appear; consider temperature fallback / compression-ratio checks at inference time.
	- Multilingual performance is not guaranteed (full-parameter finetuning emphasized Slovak).

	## 📝 Citation & Paper
	For more details, please see our paper on [arXiv](https://arxiv.org/abs/2509.19270) or the [Hugging Face paper page](https://huggingface.co/papers/2509.19270). If you use this model in your work, please cite it as:
	```bibtex
	@misc{božík2025slopalspeech2800hourslovakspeech,
	title={SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data},
	author={Erik Božík and Marek Šuppa},
	year={2025},
	eprint={2509.19270},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.19270},
	}
	```

	## 🙏 Acknowledgements

	This work was supported by [VÚB Banka](https://www.vub.sk) who provided the GPU resources and backing necessary to accomplish it, enabling progress in Slovak ASR research.