thu-spmi
/

whistle-small-german

+---
+license: apache-2.0
+language:
+- de
+metrics:
+- wer
+base_model:
+- thu-spmi/whistle-small
+new_version: thu-spmi/whistle-small-german
+pipeline_tag: automatic-speech-recognition
+tags:
+- audio
+- automatic-speech-recognition
+---
+# Whistle
+Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.
+Whistle was proposed in the paper [Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision](https://arxiv.org/abs/2406.02166) by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found [here](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10).
+## Model details
+Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available [CommonVoice_v11](https://commonvoice.mozilla.org/)
+Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=thu-spmi/whistle). The checkpoints are summarised in the following table with links to the models on the Hub:
+### Evaluation
+Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).
+Evaluation on Public [CommonVoice_v11](https://commonvoice.mozilla.org/)
+* %PER
+| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
+| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 8.02 | 3.37 | 5.68 | 4.04 | 8.29 | 5.77 | 6.05 | 18.07 | 8.32 | 8.53 | 7.61 |
+| [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 6.70 | 2.63 | 4.53 | 3.12 | 5.95 | 3.95 | 4.61 | 14.81 | 6.04 | 8.47 | 6.08 |
+| [large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __5.42__ | __1.96__ | __3.52__ | __2.25__ | __4.06__ | __2.64__ | __2.97__ | __11.33__ | __4.04__ | __5.97__ | __4.41__ |
+| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | de | Avg.
+| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| [small-finetune-german](https://huggingface.co/thu-spmi/whistle-small-german) | 90 MB | - | - | - | - | - | - | - | - | - | - | __5.37__ | - |
+* %WER with 4-gram LM
+| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
+| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 10.76 | 8.68 | 16.01 | 9.98 | 1.02 | 7.32 | 1.59 | 6.14 | 7.63 | 7.30 | 7.64 |
+| [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 9.83 | 7.82 | 14.94 | 9.04 | __0.91__ | 6.57 | 1.65 | 5.65 | 7.27 | 7.37 | 7.10 |
+| [Large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __8.80__ | __7.02__ | __14.02__ | __8.16__ | 0.94 | __6.22__ | __1.46__ | __5.06__ | __7.05__ | __6.92__ | __6.56__ |
+| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | de | Avg.
+| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| [small-finetune-german](https://huggingface.co/thu-spmi/whistle-small-german) | 90 MB | - | - | - | - | - | - | - | - | - | - | __15.73__ | - |
+More performance please ref to [benchmark](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10)
+## Training Data
+All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in [lang-process](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10/exp/Multilingual/Multi._phoneme_S#training-process).
+But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.
+| Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours |
+| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
+| `English` | `en` | 39 | 2227.3 | 27.2 | 27.0 |
+| `Spanish` | `es` | 32 | 382.3 | 26.0 | 26.5 |
+| `French` | `fr` | 33 | 823.4 | 25.0 | 25.4 |
+| `Italian` | `it` | 30 | 271.5 | 24.7 | 26.0 |
+| `Kirghiz` | `ky` | 32 | 32.7 | 2.1 | 2.2 |
+| `Dutch` | `nl` | 39 | 70.2 | 13.8 | 13.9 |
+| `Russian` | `ru` | 32 | 149.8 | 14.6 | 15.0 |
+| `Swedish` | `sv-SE` | 33 | 29.8 | 5.5 | 6.2 |
+| `Turkish` | `tr` | 41 | 61.5 | 10.1 | 11.4 |
+| `Tatar` | `tt` | 31 | 20.8 | 3.0 | 5.7 |
+## BibTeX entry and citation info
+```bibtex
+@article{yusuyin2025whistle,
+  title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
+  author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
+  journal={IEEE Transactions on Audio, Speech and Language Processing},
+  year={2025},
+  publisher={IEEE}
+}
+```
+### Community
+If you encounter problems in use, you can directly raise Issues on the [github](https://github.com/thu-spmi/CAT/tree/master) page.