whisper-medium-sk / README.md
nielsr's picture
nielsr HF Staff
Add pipeline_tag, library_name and link to paper
fc363b6 verified
|
raw
history blame
3.18 kB
metadata
base_model: openai/whisper-medium
datasets:
  - erikbozik/slovak-plenary-asr-corpus
language:
  - sk
license: mit
metrics:
  - wer
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - speech
  - asr
  - whisper
  - slovak
  - parliament
  - legal
  - politics
model-index:
  - name: whisper-medium-sk
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 21 (Slovak test set)
          type: common_voice
        metrics:
          - type: wer
            value: 18
            name: WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: FLEURS (Slovak test set)
          type: fleurs
        metrics:
          - type: wer
            value: 7.6
            name: WER

Whisper Medium — Fine-tuned on SloPalSpeech

This model is a fine-tuned version of openai/whisper-medium, presented in the paper SloPal: A 60-Million-Word Slovak Parliamentary Corpus with Aligned Speech and Fine-Tuned ASR Models.

It is adapted for Slovak ASR using SloPalSpeech: 2,806 hours of aligned, ≤30 s speech–text pairs from official plenary sessions of the Slovak National Council.

  • Language: Slovak
  • Domain: Parliamentary / formal speech
  • Training data: 2,806 h
  • Intended use: Slovak speech recognition; strongest in formal/public-speaking contexts

🧪 Evaluation

Dataset Base WER Fine-tuned WER Δ (abs)
Common Voice 21 (sk) 38.0 18.0 -20.0
FLEURS (sk) 18.7 7.6 -11.1

Numbers from the paper’s final benchmark runs.

🔧 Training Details

  • Framework: Hugging Face Transformers
  • Hardware: NVIDIA A10 GPUs
  • Epochs: up to 3 with early stopping on validation WER
  • Learning rate: ~40× smaller than Whisper pretraining LR

⚠️ Limitations

  • Domain bias toward parliamentary speech (e.g., political vocabulary, formal register).
  • As with Whisper models generally, occasional hallucinations may appear; consider temperature fallback / compression-ratio checks at inference time.
  • Multilingual performance is not guaranteed (full-parameter finetuning emphasized Slovak).

📝 Citation & Paper

For more details, please see our paper on arXiv or the Hugging Face paper page. If you use this model in your work, please cite it as:

@misc{božík2025slopalspeech2800hourslovakspeech,
      title={SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data}, 
      author={Erik Božík and Marek Šuppa},
      year={2025},
      eprint={2509.19270},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.19270}, 
}

🙏 Acknowledgements

This work was supported by VÚB Banka who provided the GPU resources and backing necessary to accomplish it, enabling progress in Slovak ASR research.