Hobit2002
/

whisper_tiny_cs

+---
+language: cs
+datasets:
+- CommonVoice
+- CC100
+tags:
+- automatic-speech-recognition
+- whisper
+- knowledge-distillation
+- czech
+license: mit
+---
+# Whisper Tiny Czech (Knowledge Distillation from MLM)
+This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**.
+## Model Description
+During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.
+- **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
+- **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
+- **Tokenizer**: Same byte pair encoding (BPE) as Whisper.
+- **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.
+### Loss Function
+The training loss combined the standard ASR loss with KD loss:
+\[
+L_{t} = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
+\]
+where \(\lambda_{lm}\) balances the two components.
+### Hyperparameters
+| Model              | Learning Rate | KD Lambda | Batch Size |
+|--------------------|---------------|-----------|------------|
+| Tiny Baseline      | 5e-4           | -         | 8          |
+| Tiny Adapted (KD)  | 1e-4           | 1e-3      | 8          |
+The learning rates are not matching because they were optimised for each case separately.
+### Results on CommonVoice Czech
+| Model              | Validation Loss | WER  | CER  |
+|--------------------|------------------|------|------|
+| Tiny Baseline      | 1.236             | 0.447| 0.031|
+| Tiny Adapted (KD)  | 0.636             | 0.345| 0.023|
+✅ **CER reduced by ~25%**
+✅ **WER reduced by ~23%**
+This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.
+---
+## Intended Use
+This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.
+## Limitations
+- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
+- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).
+## Acknowledgments
+- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
+- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
+- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
+- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))
+## Citation
+If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):
+```
+@misc{nadrchal_2025,
+    title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
+    author={David Nadrchal},
+    year={2025},
+    note={Bachelor's Thesis},
+    url={https://github.com/Hobit2002/TracheoSpeech_ASR}
+}
+```