| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - es |
| | - fr |
| | - it |
| | - ky |
| | - nl |
| | - ru |
| | - sv |
| | - tr |
| | - tt |
| | metrics: |
| | - wer |
| | pipeline_tag: automatic-speech-recognition |
| | tags: |
| | - audio |
| | - automatic-speech-recognition |
| | --- |
| | |
| | # Whistle |
| |
|
| | Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions. |
| |
|
| | Whistle was proposed in the paper [Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision](https://arxiv.org/abs/2406.02166) by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found [here](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10). |
| |
|
| | ## Model details |
| |
|
| | Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available [CommonVoice_v11](https://commonvoice.mozilla.org/) |
| |
|
| | Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=thu-spmi/whistle). The checkpoints are summarised in the following table with links to the models on the Hub: |
| |
|
| | ### Evaluation |
| |
|
| | Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%). |
| |
|
| | Evaluation on Public [CommonVoice_v11](https://commonvoice.mozilla.org/) |
| |
|
| | * %PER |
| | | Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg. |
| | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | |
| | | [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 8.02 | 3.37 | 5.68 | 4.04 | 8.29 | 5.77 | 6.05 | 18.07 | 8.32 | 8.53 | 7.61 | |
| | | [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 6.70 | 2.63 | 4.53 | 3.12 | 5.95 | 3.95 | 4.61 | 14.81 | 6.04 | 8.47 | 6.08 | |
| | | [large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __5.42__ | __1.96__ | __3.52__ | __2.25__ | __4.06__ | __2.64__ | __2.97__ | __11.33__ | __4.04__ | __5.97__ | __4.41__ | |
| |
|
| | * %WER with 4-gram LM |
| | | Model | Model size | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg. |
| | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | |
| | | [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 10.76 | 8.68 | 16.01 | 9.98 | 1.02 | 7.32 | 1.59 | 6.14 | 7.63 | 7.30 | 7.64 | |
| | | [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 9.83 | 7.82 | 14.94 | 9.04 | __0.91__ | 6.57 | 1.65 | 5.65 | 7.27 | 7.37 | 7.10 | |
| | | [Large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __8.80__ | __7.02__ | __14.02__ | __8.16__ | 0.94 | __6.22__ | __1.46__ | __5.06__ | __7.05__ | __6.92__ | __6.56__ | |
| |
|
| | More performance please ref to [benchmark](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10) |
| |
|
| | ## Training Data |
| |
|
| | All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in [lang-process](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10/exp/Multilingual/Multi._phoneme_S#training-process). |
| | But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table. |
| |
|
| | | Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours | |
| | | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | |
| | | `English` | `en` | 39 | 2227.3 | 27.2 | 27.0 | |
| | | `Spanish` | `es` | 32 | 382.3 | 26.0 | 26.5 | |
| | | `French` | `fr` | 33 | 823.4 | 25.0 | 25.4 | |
| | | `Italian` | `it` | 30 | 271.5 | 24.7 | 26.0 | |
| | | `Kirghiz` | `ky` | 32 | 32.7 | 2.1 | 2.2 | |
| | | `Dutch` | `nl` | 39 | 70.2 | 13.8 | 13.9 | |
| | | `Russian` | `ru` | 32 | 149.8 | 14.6 | 15.0 | |
| | | `Swedish` | `sv-SE` | 33 | 29.8 | 5.5 | 6.2 | |
| | | `Turkish` | `tr` | 41 | 61.5 | 10.1 | 11.4 | |
| | | `Tatar` | `tt` | 31 | 20.8 | 3.0 | 5.7 | |
| |
|
| | ## BibTeX entry and citation info |
| | ```bibtex |
| | @article{yusuyin2025whistle, |
| | title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision}, |
| | author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian}, |
| | journal={IEEE Transactions on Audio, Speech and Language Processing}, |
| | year={2025}, |
| | publisher={IEEE} |
| | } |
| | ``` |
| |
|
| | ### Community |
| |
|
| | If you encounter problems in use, you can directly raise Issues on the [github](https://github.com/thu-spmi/CAT/tree/master) page. |