jankoko's picture
Update README.md
0e7f675 verified
metadata
library_name: transformers
license: mit
language: en
metrics:
  - wer
base_model: openai/whisper-small
datasets: wTIMIT
pipeline_tag: automatic-speech-recognition
tags:
  - Whispered ASR
  - SpecAugment
  - Data Augmentation
  - Whispered Speech

This model is a fine-tuned version of openai/whisper-small on the wTIMIT-US dataset using SpecAugment, a time- and frequency-masking data augmentation method.

The model was fine-tuned jointly on normal and whispered speech, using SpecAugment in its LibriSpeech Double (LD) configuration. It serves as a baseline for comparison against phone-aware masking methods such as F0-Mask, F1-Mask, and LF-Mask.

Evaluation Results on wTIMIT-US (Test Set)

Setup Training Data Augmentation WER (Normal) WER (Whispered)
Baseline Both modes None 5.8 11.7
SpecAugment Both modes SpecAugment (LD) 5.2 12.3

SpecAugment significantly improved WER on normal speech compared to the baseline without augmentation (p=0.014), while showing no statistically significant difference in whispered speech performance (p=0.147).

Cite as

Kokowski, J. (2025). F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Master’s Thesis, University of Groningen, Campus Fryslân.
Available at: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674

If you use this model or build upon this work, please cite the thesis above.

Model: Whisper-small
Augmentation: SpecAugment (LD)
Evaluation toolkit: SCTK (sclite)
Notes: For statistical comparisons and MAPSSWE evaluation, see Section 5 of the thesis.

🔗 Related Models