whistle-medium / README.md
thu-spmi's picture
Update README.md
f366ad1 verified
---
license: apache-2.0
language:
- en
- es
- fr
- it
- ky
- nl
- ru
- sv
- tr
- tt
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
---
# Whistle
Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.
Whistle was proposed in the paper [Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision](https://arxiv.org/abs/2406.02166) by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found [here](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10).
## Model details
Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available [CommonVoice_v11](https://commonvoice.mozilla.org/)
Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=thu-spmi/whistle). The checkpoints are summarised in the following table with links to the models on the Hub:
### Evaluation
Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).
Evaluation on Public [CommonVoice_v11](https://commonvoice.mozilla.org/)
* %PER
| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 8.02 | 3.37 | 5.68 | 4.04 | 8.29 | 5.77 | 6.05 | 18.07 | 8.32 | 8.53 | 7.61 |
| [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 6.70 | 2.63 | 4.53 | 3.12 | 5.95 | 3.95 | 4.61 | 14.81 | 6.04 | 8.47 | 6.08 |
| [large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __5.42__ | __1.96__ | __3.52__ | __2.25__ | __4.06__ | __2.64__ | __2.97__ | __11.33__ | __4.04__ | __5.97__ | __4.41__ |
* %WER with 4-gram LM
| Model | Model size | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 10.76 | 8.68 | 16.01 | 9.98 | 1.02 | 7.32 | 1.59 | 6.14 | 7.63 | 7.30 | 7.64 |
| [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 9.83 | 7.82 | 14.94 | 9.04 | __0.91__ | 6.57 | 1.65 | 5.65 | 7.27 | 7.37 | 7.10 |
| [Large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __8.80__ | __7.02__ | __14.02__ | __8.16__ | 0.94 | __6.22__ | __1.46__ | __5.06__ | __7.05__ | __6.92__ | __6.56__ |
More performance please ref to [benchmark](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10)
## Training Data
All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in [lang-process](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10/exp/Multilingual/Multi._phoneme_S#training-process).
But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.
| Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| `English` | `en` | 39 | 2227.3 | 27.2 | 27.0 |
| `Spanish` | `es` | 32 | 382.3 | 26.0 | 26.5 |
| `French` | `fr` | 33 | 823.4 | 25.0 | 25.4 |
| `Italian` | `it` | 30 | 271.5 | 24.7 | 26.0 |
| `Kirghiz` | `ky` | 32 | 32.7 | 2.1 | 2.2 |
| `Dutch` | `nl` | 39 | 70.2 | 13.8 | 13.9 |
| `Russian` | `ru` | 32 | 149.8 | 14.6 | 15.0 |
| `Swedish` | `sv-SE` | 33 | 29.8 | 5.5 | 6.2 |
| `Turkish` | `tr` | 41 | 61.5 | 10.1 | 11.4 |
| `Tatar` | `tt` | 31 | 20.8 | 3.0 | 5.7 |
## BibTeX entry and citation info
```bibtex
@article{yusuyin2025whistle,
title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
journal={IEEE Transactions on Audio, Speech and Language Processing},
year={2025},
publisher={IEEE}
}
```
### Community
If you encounter problems in use, you can directly raise Issues on the [github](https://github.com/thu-spmi/CAT/tree/master) page.