File size: 2,920 Bytes
28a541b
 
 
 
 
 
 
 
 
 
4d5c4e7
 
28a541b
4d5c4e7
 
28a541b
5e6355e
1ebf7a6
 
7451768
1ebf7a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e6355e
7743db6
5e6355e
9926253
 
 
1d720b4
 
4388df1
 
9926253
3be3c5c
 
 
 
5832582
 
 
51a1b28
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
library_name: transformers
license: mit
language: en
metrics:
  - wer
base_model: openai/whisper-small
datasets: wTIMIT
pipeline_tag: automatic-speech-recognition
tags:
  - Whispered ASR
  - Frequency Masking
  - F0-Mask
  - SpecAugment
  - Whispered Speech
---

This model is a fine-tuned version of `openai/whisper-small` on the wTIMIT-US dataset using the F0-Mask augmentation method. It was evaluated on both normal and whispered speech subsets, with Word Error Rate (WER) as the primary metric.

The results below highlight performance improvements over the baseline for whispered speech, validating the effectiveness of phoneme-aware low-frequency masking (PALF-Mask).


### Evaluation Results on wTIMIT-US (Test Set)

| **Setup**            | **Training Data** | **Augmentation**     | **WER (Normal)** | **WER (Whispered)** |
|----------------------|-------------------|-----------------------|------------------|----------------------|
| No Fine-tuning       | Zero-shot         | None                  | 5.0              | 13.7                 |
| Baseline             | Both modes        | None                  | 5.8              | 11.7                 |
| SpecAugment          | Both modes        | SpecAugment (LD)      | 5.2              | 12.3                 |
| **F0-Mask (Ours)**   | Both modes        | F0-based Masking      | 5.0 (ns, *p*=0.144) | **11.5 (****, *p*=0.002)** |

**★** = Statistically significant improvement over SpecAugment (paired MAPSSWE)  
**ns** = No significant difference (not statistically significant)

> Compared to the SpecAugment baseline, F0-Mask achieved a statistically significant improvement in whispered speech recognition (↓0.8% absolute WER, *p*=0.002), while maintaining comparable performance on normal speech (*p*=0.144).

**Notably, the whispered WER of 11.5% matches the best result previously reported on this dataset by Marchenko (2024).**

### Cite as

Kokowski, J. (2025). *F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition*. Master’s Thesis, University of Groningen, Campus Fryslân.  
Available at: [https://campus-fryslan.studenttheses.ub.rug.nl/view/degree_programme/voice=5Ftechnology.html](https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674
)

> If you use this model or build upon this work, please cite the thesis above.


**Model:** Whisper-small  
**Augmentation:** F0-Mask  
**Evaluation toolkit:** [SCTK (sclite)](https://github.com/usnistgov/SCTK)  
**Notes:** For complete results, including MAPSSWE and CER scores, refer to Section 5 of the thesis.

### 🔗 Related Models

- [SpecAugment Baseline](https://huggingface.co/jankoko/SpecAugment-Whisper-small)
- [F0-Mask Version](https://huggingface.co/jankoko/PALF-Whisper-small) ← current
- [F1-Mask Version](https://huggingface.co/jankoko/PALF-F1-Whisper-small) 
- [LF-Mask Version](https://huggingface.co/jankoko/PALF-LF-Whisper-small)