Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,87 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: cs
|
| 3 |
+
datasets:
|
| 4 |
+
- CommonVoice
|
| 5 |
+
- CC100
|
| 6 |
+
tags:
|
| 7 |
+
- automatic-speech-recognition
|
| 8 |
+
- whisper
|
| 9 |
+
- knowledge-distillation
|
| 10 |
+
- czech
|
| 11 |
+
license: mit
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Whisper Tiny Czech (Knowledge Distillation from MLM)
|
| 15 |
+
|
| 16 |
+
This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**.
|
| 17 |
+
|
| 18 |
+
## Model Description
|
| 19 |
+
|
| 20 |
+
During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.
|
| 21 |
+
|
| 22 |
+
- **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
|
| 23 |
+
- **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
|
| 24 |
+
- **Tokenizer**: Same byte pair encoding (BPE) as Whisper.
|
| 25 |
+
- **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.
|
| 26 |
+
|
| 27 |
+
### Loss Function
|
| 28 |
+
|
| 29 |
+
The training loss combined the standard ASR loss with KD loss:
|
| 30 |
+
|
| 31 |
+
\[
|
| 32 |
+
L_{t} = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
|
| 33 |
+
\]
|
| 34 |
+
|
| 35 |
+
where \(\lambda_{lm}\) balances the two components.
|
| 36 |
+
|
| 37 |
+
### Hyperparameters
|
| 38 |
+
|
| 39 |
+
| Model | Learning Rate | KD Lambda | Batch Size |
|
| 40 |
+
|--------------------|---------------|-----------|------------|
|
| 41 |
+
| Tiny Baseline | 5e-4 | - | 8 |
|
| 42 |
+
| Tiny Adapted (KD) | 1e-4 | 1e-3 | 8 |
|
| 43 |
+
|
| 44 |
+
The learning rates are not matching because they were optimised for each case separately.
|
| 45 |
+
|
| 46 |
+
### Results on CommonVoice Czech
|
| 47 |
+
|
| 48 |
+
| Model | Validation Loss | WER | CER |
|
| 49 |
+
|--------------------|------------------|------|------|
|
| 50 |
+
| Tiny Baseline | 1.236 | 0.447| 0.031|
|
| 51 |
+
| Tiny Adapted (KD) | 0.636 | 0.345| 0.023|
|
| 52 |
+
|
| 53 |
+
✅ **CER reduced by ~25%**
|
| 54 |
+
✅ **WER reduced by ~23%**
|
| 55 |
+
|
| 56 |
+
This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Intended Use
|
| 61 |
+
|
| 62 |
+
This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.
|
| 63 |
+
|
| 64 |
+
## Limitations
|
| 65 |
+
|
| 66 |
+
- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
|
| 67 |
+
- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).
|
| 68 |
+
|
| 69 |
+
## Acknowledgments
|
| 70 |
+
- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
|
| 71 |
+
- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
|
| 72 |
+
- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
|
| 73 |
+
- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))
|
| 74 |
+
|
| 75 |
+
## Citation
|
| 76 |
+
|
| 77 |
+
If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):
|
| 78 |
+
|
| 79 |
+
```
|
| 80 |
+
@misc{nadrchal_2025,
|
| 81 |
+
title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
|
| 82 |
+
author={David Nadrchal},
|
| 83 |
+
year={2025},
|
| 84 |
+
note={Bachelor's Thesis},
|
| 85 |
+
url={https://github.com/Hobit2002/TracheoSpeech_ASR}
|
| 86 |
+
}
|
| 87 |
+
```
|