Update README.md

f366ad1 verified 4 months ago

5.22 kB

	---
	license: apache-2.0
	language:
	- en
	- es
	- fr
	- it
	- ky
	- nl
	- ru
	- sv
	- tr
	- tt
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	tags:
	- audio
	- automatic-speech-recognition
	---

	# Whistle

	Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.

	Whistle was proposed in the paper [Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision](https://arxiv.org/abs/2406.02166) by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found [here](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10).

	## Model details

	Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available [CommonVoice_v11](https://commonvoice.mozilla.org/)

	Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=thu-spmi/whistle). The checkpoints are summarised in the following table with links to the models on the Hub:

	### Evaluation

	Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).

	Evaluation on Public [CommonVoice_v11](https://commonvoice.mozilla.org/)

	* %PER
	\| Model \| Parameters \| en \| es \| fr \| it \| ky \| nl \| ru \| sv-SE \| tr \| tt \| Avg.
	\| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \|
	\| [small](https://huggingface.co/thu-spmi/whistle-small) \| 90 MB \| 8.02 \| 3.37 \| 5.68 \| 4.04 \| 8.29 \| 5.77 \| 6.05 \| 18.07 \| 8.32 \| 8.53 \| 7.61 \|
	\| [medium](https://huggingface.co/thu-spmi/whistle-medium) \| 218 MB \| 6.70 \| 2.63 \| 4.53 \| 3.12 \| 5.95 \| 3.95 \| 4.61 \| 14.81 \| 6.04 \| 8.47 \| 6.08 \|
	\| [large](https://huggingface.co/thu-spmi/whistle-large) \| 543 MB \| __5.42__ \| __1.96__ \| __3.52__ \| __2.25__ \| __4.06__ \| __2.64__ \| __2.97__ \| __11.33__ \| __4.04__ \| __5.97__ \| __4.41__ \|

	* %WER with 4-gram LM
	\| Model \| Model size \| en \| es \| fr \| it \| ky \| nl \| ru \| sv-SE \| tr \| tt \| Avg.
	\| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \| ------ \|
	\| [small](https://huggingface.co/thu-spmi/whistle-small) \| 90 MB \| 10.76 \| 8.68 \| 16.01 \| 9.98 \| 1.02 \| 7.32 \| 1.59 \| 6.14 \| 7.63 \| 7.30 \| 7.64 \|
	\| [medium](https://huggingface.co/thu-spmi/whistle-medium) \| 218 MB \| 9.83 \| 7.82 \| 14.94 \| 9.04 \| __0.91__ \| 6.57 \| 1.65 \| 5.65 \| 7.27 \| 7.37 \| 7.10 \|
	\| [Large](https://huggingface.co/thu-spmi/whistle-large) \| 543 MB \| __8.80__ \| __7.02__ \| __14.02__ \| __8.16__ \| 0.94 \| __6.22__ \| __1.46__ \| __5.06__ \| __7.05__ \| __6.92__ \| __6.56__ \|

	More performance please ref to [benchmark](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10)

	## Training Data

	All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in [lang-process](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10/exp/Multilingual/Multi._phoneme_S#training-process).
	But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.

	\| Language \| Language ID \| # of phonemes \| Train hours \| Dev hours \| Test hours \|
	\| ----------- \| ----------- \| ----------- \| ----------- \| ----------- \| ----------- \|
	\| `English` \| `en` \| 39 \| 2227.3 \| 27.2 \| 27.0 \|
	\| `Spanish` \| `es` \| 32 \| 382.3 \| 26.0 \| 26.5 \|
	\| `French` \| `fr` \| 33 \| 823.4 \| 25.0 \| 25.4 \|
	\| `Italian` \| `it` \| 30 \| 271.5 \| 24.7 \| 26.0 \|
	\| `Kirghiz` \| `ky` \| 32 \| 32.7 \| 2.1 \| 2.2 \|
	\| `Dutch` \| `nl` \| 39 \| 70.2 \| 13.8 \| 13.9 \|
	\| `Russian` \| `ru` \| 32 \| 149.8 \| 14.6 \| 15.0 \|
	\| `Swedish` \| `sv-SE` \| 33 \| 29.8 \| 5.5 \| 6.2 \|
	\| `Turkish` \| `tr` \| 41 \| 61.5 \| 10.1 \| 11.4 \|
	\| `Tatar` \| `tt` \| 31 \| 20.8 \| 3.0 \| 5.7 \|

	## BibTeX entry and citation info
	```bibtex
	@article{yusuyin2025whistle,
	title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
	author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
	journal={IEEE Transactions on Audio, Speech and Language Processing},
	year={2025},
	publisher={IEEE}
	}
	```

	### Community

	If you encounter problems in use, you can directly raise Issues on the [github](https://github.com/thu-spmi/CAT/tree/master) page.