Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,90 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- de
|
| 5 |
+
metrics:
|
| 6 |
+
- wer
|
| 7 |
+
base_model:
|
| 8 |
+
- thu-spmi/whistle-small
|
| 9 |
+
new_version: thu-spmi/whistle-small-german
|
| 10 |
+
pipeline_tag: automatic-speech-recognition
|
| 11 |
+
tags:
|
| 12 |
+
- audio
|
| 13 |
+
- automatic-speech-recognition
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Whistle
|
| 17 |
+
|
| 18 |
+
Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.
|
| 19 |
+
|
| 20 |
+
Whistle was proposed in the paper [Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision](https://arxiv.org/abs/2406.02166) by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found [here](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10).
|
| 21 |
+
|
| 22 |
+
## Model details
|
| 23 |
+
|
| 24 |
+
Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available [CommonVoice_v11](https://commonvoice.mozilla.org/)
|
| 25 |
+
|
| 26 |
+
Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=thu-spmi/whistle). The checkpoints are summarised in the following table with links to the models on the Hub:
|
| 27 |
+
|
| 28 |
+
### Evaluation
|
| 29 |
+
|
| 30 |
+
Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).
|
| 31 |
+
|
| 32 |
+
Evaluation on Public [CommonVoice_v11](https://commonvoice.mozilla.org/)
|
| 33 |
+
|
| 34 |
+
* %PER
|
| 35 |
+
| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
|
| 36 |
+
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
| 37 |
+
| [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 8.02 | 3.37 | 5.68 | 4.04 | 8.29 | 5.77 | 6.05 | 18.07 | 8.32 | 8.53 | 7.61 |
|
| 38 |
+
| [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 6.70 | 2.63 | 4.53 | 3.12 | 5.95 | 3.95 | 4.61 | 14.81 | 6.04 | 8.47 | 6.08 |
|
| 39 |
+
| [large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __5.42__ | __1.96__ | __3.52__ | __2.25__ | __4.06__ | __2.64__ | __2.97__ | __11.33__ | __4.04__ | __5.97__ | __4.41__ |
|
| 40 |
+
|
| 41 |
+
| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | de | Avg.
|
| 42 |
+
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
| 43 |
+
| [small-finetune-german](https://huggingface.co/thu-spmi/whistle-small-german) | 90 MB | - | - | - | - | - | - | - | - | - | - | __5.37__ | - |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
* %WER with 4-gram LM
|
| 47 |
+
| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
|
| 48 |
+
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
| 49 |
+
| [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 10.76 | 8.68 | 16.01 | 9.98 | 1.02 | 7.32 | 1.59 | 6.14 | 7.63 | 7.30 | 7.64 |
|
| 50 |
+
| [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 9.83 | 7.82 | 14.94 | 9.04 | __0.91__ | 6.57 | 1.65 | 5.65 | 7.27 | 7.37 | 7.10 |
|
| 51 |
+
| [Large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __8.80__ | __7.02__ | __14.02__ | __8.16__ | 0.94 | __6.22__ | __1.46__ | __5.06__ | __7.05__ | __6.92__ | __6.56__ |
|
| 52 |
+
|
| 53 |
+
| Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | de | Avg.
|
| 54 |
+
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
| 55 |
+
| [small-finetune-german](https://huggingface.co/thu-spmi/whistle-small-german) | 90 MB | - | - | - | - | - | - | - | - | - | - | __15.73__ | - |
|
| 56 |
+
|
| 57 |
+
More performance please ref to [benchmark](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10)
|
| 58 |
+
|
| 59 |
+
## Training Data
|
| 60 |
+
|
| 61 |
+
All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in [lang-process](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10/exp/Multilingual/Multi._phoneme_S#training-process).
|
| 62 |
+
But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.
|
| 63 |
+
|
| 64 |
+
| Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours |
|
| 65 |
+
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
|
| 66 |
+
| `English` | `en` | 39 | 2227.3 | 27.2 | 27.0 |
|
| 67 |
+
| `Spanish` | `es` | 32 | 382.3 | 26.0 | 26.5 |
|
| 68 |
+
| `French` | `fr` | 33 | 823.4 | 25.0 | 25.4 |
|
| 69 |
+
| `Italian` | `it` | 30 | 271.5 | 24.7 | 26.0 |
|
| 70 |
+
| `Kirghiz` | `ky` | 32 | 32.7 | 2.1 | 2.2 |
|
| 71 |
+
| `Dutch` | `nl` | 39 | 70.2 | 13.8 | 13.9 |
|
| 72 |
+
| `Russian` | `ru` | 32 | 149.8 | 14.6 | 15.0 |
|
| 73 |
+
| `Swedish` | `sv-SE` | 33 | 29.8 | 5.5 | 6.2 |
|
| 74 |
+
| `Turkish` | `tr` | 41 | 61.5 | 10.1 | 11.4 |
|
| 75 |
+
| `Tatar` | `tt` | 31 | 20.8 | 3.0 | 5.7 |
|
| 76 |
+
|
| 77 |
+
## BibTeX entry and citation info
|
| 78 |
+
```bibtex
|
| 79 |
+
@article{yusuyin2025whistle,
|
| 80 |
+
title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
|
| 81 |
+
author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
|
| 82 |
+
journal={IEEE Transactions on Audio, Speech and Language Processing},
|
| 83 |
+
year={2025},
|
| 84 |
+
publisher={IEEE}
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Community
|
| 89 |
+
|
| 90 |
+
If you encounter problems in use, you can directly raise Issues on the [github](https://github.com/thu-spmi/CAT/tree/master) page.
|