LLYYJJ cjli commited on
Commit
8c5c7c2
·
0 Parent(s):

Duplicate from espnet/powsm

Browse files

Co-authored-by: Chin-Jou Li <cjli@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - anyspeech/ipapack_plus_train_1
4
+ - anyspeech/ipapack_plus_train_2
5
+ - anyspeech/ipapack_plus_train_3
6
+ - anyspeech/ipapack_plus_train_4
7
+ language: multilingual
8
+ library_name: espnet
9
+ license: cc-by-4.0
10
+ metrics:
11
+ - pfer
12
+ - cer
13
+ tags:
14
+ - espnet
15
+ - audio
16
+ - phone-recognition
17
+ - automatic-speech-recognition
18
+ - grapheme-to-phoneme
19
+ - phoneme-to-grapheme
20
+ pipeline_tag: automatic-speech-recognition
21
+ ---
22
+
23
+ ### 🐁POWSM
24
+
25
+ <p align="left">
26
+ <a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a>
27
+ <a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
28
+ <a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a>
29
+ </p>
30
+
31
+ POWSM is the first phonetic foundation model that can perform four phone-related tasks:
32
+ Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
33
+ conversion (P2G).
34
+
35
+ Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
36
+
37
+ > [!TIP]
38
+ > Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure,
39
+ > and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models!
40
+
41
+
42
+ To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
43
+ ```
44
+ torch
45
+ espnet
46
+ espnet_model_zoo
47
+ ```
48
+
49
+ **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
50
+
51
+ ### Example script for PR/ASR/G2P/P2G
52
+
53
+ Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
54
+
55
+ To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
56
+
57
+
58
+ > [!NOTE]
59
+ > Jan 2026: We release a retrained version with improved ASR text normalization.
60
+ > It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
61
+ > Additional details are provided in the updated arXiv appendix.
62
+
63
+ ```python
64
+ from espnet2.bin.s2t_inference import Speech2Text
65
+ import soundfile as sf # or librosa
66
+
67
+ task = "<pr>"
68
+ s2t = Speech2Text.from_pretrained(
69
+ "espnet/powsm",
70
+ device="cuda",
71
+ lang_sym="<eng>", # ISO 639-3; set to <unk> for unseen languages
72
+ task_sym=task, # <pr>, <asr>, <g2p>, <p2g>
73
+ )
74
+
75
+ speech, rate = sf.read("sample.wav")
76
+ prompt = "<na>" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes
77
+ pred = s2t(speech, text_prev=prompt)[0][0]
78
+
79
+ # post-processing for better format
80
+ pred = pred.split("<notimestamps>")[1].strip()
81
+ if task == "<pr>" or task == "<g2p>":
82
+ pred = pred.replace("/", "")
83
+ print(pred)
84
+ ```
85
+
86
+ #### Other tasks
87
+
88
+ See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
89
+
90
+ LID is learned implicitly during training, and you may run it with the script below:
91
+
92
+ ```python
93
+ from espnet2.bin.s2t_inference_language import Speech2Language
94
+ import soundfile as sf # or librosa
95
+
96
+ s2t = Speech2Language.from_pretrained(
97
+ "espnet/powsm",
98
+ device="cuda",
99
+ nbest=1, # number of possible languages to return
100
+ first_lang_sym="<afr>", # fixed; defined in vocab list
101
+ last_lang_sym="<zul>" # fixed; defined in vocab list
102
+ )
103
+
104
+ speech, rate = sf.read("sample.wav")
105
+ pred = model(speech)[0] # a list of lang-prob pair
106
+ print(pred)
107
+ ```
108
+
109
+ ### Citations
110
+
111
+ ```BibTex
112
+ @article{powsm,
113
+ title={POWSM: A Phonetic Open Whisper-Style Speech Foundation Model},
114
+ author={Chin-Jou Li and Kalvin Chang and Shikhar Bharadwaj and Eunjung Yeo and Kwanghee Choi and Jian Zhu and David Mortensen and Shinji Watanabe},
115
+ year={2025},
116
+ eprint={2510.24992},
117
+ archivePrefix={arXiv},
118
+ primaryClass={cs.CL},
119
+ url={https://arxiv.org/abs/2510.24992},
120
+ }
121
+ ```
data/token_list/bpe_unigram40000/bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b9a8b76353430d41a1b8f7f2ec0f40fa8c4e75567eaef6887bdbb893c55236a
3
+ size 967858
exp/s2t_stats_raw_bpe40000/train/feats_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3ca2ef68be502a75a646c8da36847375964a0d6499fd9ee2d7d620a0f31d746
3
+ size 1402
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/config.yaml ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a91a03bcfd59a956319939891c4098e8f7e8c9ea568d7ec2bcbc1131b32d1197
3
+ size 1374692510
meta.yaml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ espnet: '202412'
2
+ files:
3
+ s2t_model_file: exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth
4
+ python: 3.11.9
5
+ torch: 2.4.0+cu118
6
+ yaml_files:
7
+ s2t_train_config: exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/config.yaml
textnorm_retrained/data/token_list/bpe_unigram40000/bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1914ae6ed41df02e174ce16a9976cb96b25b4393b14c61435bff702b829f3799
3
+ size 972584
textnorm_retrained/exp/s2t_stats_raw_bpe40000/train/feats_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5f80694a59a93aab7beeed44621ba82625a90ac838954c581909a8490cd2244
3
+ size 1402
textnorm_retrained/exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/config.yaml ADDED
The diff for this file is too large to render. See raw diff
 
textnorm_retrained/exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:515c479555a6b3e46aff40b7254932d22b6213b1b36f8cd19e20984f7b3f9dd0
3
+ size 1374692891
textnorm_retrained/meta.yaml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ espnet: '202511' # with unmerged local change; will update to suitable version
2
+ files:
3
+ s2t_model_file: exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth
4
+ python: 3.12.8
5
+ torch: 2.9.1+cu128
6
+ yaml_files:
7
+ s2t_train_config: exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/config.yaml