thu-spmi commited on
Commit
2d93457
·
verified ·
1 Parent(s): a5c2734

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -3
README.md CHANGED
@@ -1,3 +1,90 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - de
5
+ metrics:
6
+ - wer
7
+ base_model:
8
+ - thu-spmi/whistle-small
9
+ new_version: thu-spmi/whistle-small-german
10
+ pipeline_tag: automatic-speech-recognition
11
+ tags:
12
+ - audio
13
+ - automatic-speech-recognition
14
+ ---
15
+
16
+ # Whistle
17
+
18
+ Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.
19
+
20
+ Whistle was proposed in the paper [Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision](https://arxiv.org/abs/2406.02166) by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found [here](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10).
21
+
22
+ ## Model details
23
+
24
+ Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available [CommonVoice_v11](https://commonvoice.mozilla.org/)
25
+
26
+ Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=thu-spmi/whistle). The checkpoints are summarised in the following table with links to the models on the Hub:
27
+
28
+ ### Evaluation
29
+
30
+ Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).
31
+
32
+ Evaluation on Public [CommonVoice_v11](https://commonvoice.mozilla.org/)
33
+
34
+ * %PER
35
+ | Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
36
+ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
37
+ | [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 8.02 | 3.37 | 5.68 | 4.04 | 8.29 | 5.77 | 6.05 | 18.07 | 8.32 | 8.53 | 7.61 |
38
+ | [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 6.70 | 2.63 | 4.53 | 3.12 | 5.95 | 3.95 | 4.61 | 14.81 | 6.04 | 8.47 | 6.08 |
39
+ | [large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __5.42__ | __1.96__ | __3.52__ | __2.25__ | __4.06__ | __2.64__ | __2.97__ | __11.33__ | __4.04__ | __5.97__ | __4.41__ |
40
+
41
+ | Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | de | Avg.
42
+ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
43
+ | [small-finetune-german](https://huggingface.co/thu-spmi/whistle-small-german) | 90 MB | - | - | - | - | - | - | - | - | - | - | __5.37__ | - |
44
+
45
+
46
+ * %WER with 4-gram LM
47
+ | Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | Avg.
48
+ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
49
+ | [small](https://huggingface.co/thu-spmi/whistle-small) | 90 MB | 10.76 | 8.68 | 16.01 | 9.98 | 1.02 | 7.32 | 1.59 | 6.14 | 7.63 | 7.30 | 7.64 |
50
+ | [medium](https://huggingface.co/thu-spmi/whistle-medium) | 218 MB | 9.83 | 7.82 | 14.94 | 9.04 | __0.91__ | 6.57 | 1.65 | 5.65 | 7.27 | 7.37 | 7.10 |
51
+ | [Large](https://huggingface.co/thu-spmi/whistle-large) | 543 MB | __8.80__ | __7.02__ | __14.02__ | __8.16__ | 0.94 | __6.22__ | __1.46__ | __5.06__ | __7.05__ | __6.92__ | __6.56__ |
52
+
53
+ | Model | Parameters | en | es | fr | it | ky | nl | ru | sv-SE | tr | tt | de | Avg.
54
+ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
55
+ | [small-finetune-german](https://huggingface.co/thu-spmi/whistle-small-german) | 90 MB | - | - | - | - | - | - | - | - | - | - | __15.73__ | - |
56
+
57
+ More performance please ref to [benchmark](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10)
58
+
59
+ ## Training Data
60
+
61
+ All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in [lang-process](https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10/exp/Multilingual/Multi._phoneme_S#training-process).
62
+ But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.
63
+
64
+ | Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours |
65
+ | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
66
+ | `English` | `en` | 39 | 2227.3 | 27.2 | 27.0 |
67
+ | `Spanish` | `es` | 32 | 382.3 | 26.0 | 26.5 |
68
+ | `French` | `fr` | 33 | 823.4 | 25.0 | 25.4 |
69
+ | `Italian` | `it` | 30 | 271.5 | 24.7 | 26.0 |
70
+ | `Kirghiz` | `ky` | 32 | 32.7 | 2.1 | 2.2 |
71
+ | `Dutch` | `nl` | 39 | 70.2 | 13.8 | 13.9 |
72
+ | `Russian` | `ru` | 32 | 149.8 | 14.6 | 15.0 |
73
+ | `Swedish` | `sv-SE` | 33 | 29.8 | 5.5 | 6.2 |
74
+ | `Turkish` | `tr` | 41 | 61.5 | 10.1 | 11.4 |
75
+ | `Tatar` | `tt` | 31 | 20.8 | 3.0 | 5.7 |
76
+
77
+ ## BibTeX entry and citation info
78
+ ```bibtex
79
+ @article{yusuyin2025whistle,
80
+ title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
81
+ author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
82
+ journal={IEEE Transactions on Audio, Speech and Language Processing},
83
+ year={2025},
84
+ publisher={IEEE}
85
+ }
86
+ ```
87
+
88
+ ### Community
89
+
90
+ If you encounter problems in use, you can directly raise Issues on the [github](https://github.com/thu-spmi/CAT/tree/master) page.