Duplicate from espnet/powsm
Browse filesCo-authored-by: Chin-Jou Li <cjli@users.noreply.huggingface.co>
- .gitattributes +35 -0
- README.md +121 -0
- data/token_list/bpe_unigram40000/bpe.model +3 -0
- exp/s2t_stats_raw_bpe40000/train/feats_stats.npz +3 -0
- exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/config.yaml +0 -0
- exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth +3 -0
- meta.yaml +7 -0
- textnorm_retrained/data/token_list/bpe_unigram40000/bpe.model +3 -0
- textnorm_retrained/exp/s2t_stats_raw_bpe40000/train/feats_stats.npz +3 -0
- textnorm_retrained/exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/config.yaml +0 -0
- textnorm_retrained/exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth +3 -0
- textnorm_retrained/meta.yaml +7 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- anyspeech/ipapack_plus_train_1
|
| 4 |
+
- anyspeech/ipapack_plus_train_2
|
| 5 |
+
- anyspeech/ipapack_plus_train_3
|
| 6 |
+
- anyspeech/ipapack_plus_train_4
|
| 7 |
+
language: multilingual
|
| 8 |
+
library_name: espnet
|
| 9 |
+
license: cc-by-4.0
|
| 10 |
+
metrics:
|
| 11 |
+
- pfer
|
| 12 |
+
- cer
|
| 13 |
+
tags:
|
| 14 |
+
- espnet
|
| 15 |
+
- audio
|
| 16 |
+
- phone-recognition
|
| 17 |
+
- automatic-speech-recognition
|
| 18 |
+
- grapheme-to-phoneme
|
| 19 |
+
- phoneme-to-grapheme
|
| 20 |
+
pipeline_tag: automatic-speech-recognition
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
### 🐁POWSM
|
| 24 |
+
|
| 25 |
+
<p align="left">
|
| 26 |
+
<a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a>
|
| 27 |
+
<a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
|
| 28 |
+
<a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a>
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
POWSM is the first phonetic foundation model that can perform four phone-related tasks:
|
| 32 |
+
Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
|
| 33 |
+
conversion (P2G).
|
| 34 |
+
|
| 35 |
+
Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
|
| 36 |
+
|
| 37 |
+
> [!TIP]
|
| 38 |
+
> Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure,
|
| 39 |
+
> and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models!
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
| 43 |
+
```
|
| 44 |
+
torch
|
| 45 |
+
espnet
|
| 46 |
+
espnet_model_zoo
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
|
| 50 |
+
|
| 51 |
+
### Example script for PR/ASR/G2P/P2G
|
| 52 |
+
|
| 53 |
+
Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
|
| 54 |
+
|
| 55 |
+
To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
> [!NOTE]
|
| 59 |
+
> Jan 2026: We release a retrained version with improved ASR text normalization.
|
| 60 |
+
> It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
|
| 61 |
+
> Additional details are provided in the updated arXiv appendix.
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
from espnet2.bin.s2t_inference import Speech2Text
|
| 65 |
+
import soundfile as sf # or librosa
|
| 66 |
+
|
| 67 |
+
task = "<pr>"
|
| 68 |
+
s2t = Speech2Text.from_pretrained(
|
| 69 |
+
"espnet/powsm",
|
| 70 |
+
device="cuda",
|
| 71 |
+
lang_sym="<eng>", # ISO 639-3; set to <unk> for unseen languages
|
| 72 |
+
task_sym=task, # <pr>, <asr>, <g2p>, <p2g>
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
speech, rate = sf.read("sample.wav")
|
| 76 |
+
prompt = "<na>" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes
|
| 77 |
+
pred = s2t(speech, text_prev=prompt)[0][0]
|
| 78 |
+
|
| 79 |
+
# post-processing for better format
|
| 80 |
+
pred = pred.split("<notimestamps>")[1].strip()
|
| 81 |
+
if task == "<pr>" or task == "<g2p>":
|
| 82 |
+
pred = pred.replace("/", "")
|
| 83 |
+
print(pred)
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
#### Other tasks
|
| 87 |
+
|
| 88 |
+
See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
|
| 89 |
+
|
| 90 |
+
LID is learned implicitly during training, and you may run it with the script below:
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
from espnet2.bin.s2t_inference_language import Speech2Language
|
| 94 |
+
import soundfile as sf # or librosa
|
| 95 |
+
|
| 96 |
+
s2t = Speech2Language.from_pretrained(
|
| 97 |
+
"espnet/powsm",
|
| 98 |
+
device="cuda",
|
| 99 |
+
nbest=1, # number of possible languages to return
|
| 100 |
+
first_lang_sym="<afr>", # fixed; defined in vocab list
|
| 101 |
+
last_lang_sym="<zul>" # fixed; defined in vocab list
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
speech, rate = sf.read("sample.wav")
|
| 105 |
+
pred = model(speech)[0] # a list of lang-prob pair
|
| 106 |
+
print(pred)
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Citations
|
| 110 |
+
|
| 111 |
+
```BibTex
|
| 112 |
+
@article{powsm,
|
| 113 |
+
title={POWSM: A Phonetic Open Whisper-Style Speech Foundation Model},
|
| 114 |
+
author={Chin-Jou Li and Kalvin Chang and Shikhar Bharadwaj and Eunjung Yeo and Kwanghee Choi and Jian Zhu and David Mortensen and Shinji Watanabe},
|
| 115 |
+
year={2025},
|
| 116 |
+
eprint={2510.24992},
|
| 117 |
+
archivePrefix={arXiv},
|
| 118 |
+
primaryClass={cs.CL},
|
| 119 |
+
url={https://arxiv.org/abs/2510.24992},
|
| 120 |
+
}
|
| 121 |
+
```
|
data/token_list/bpe_unigram40000/bpe.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9b9a8b76353430d41a1b8f7f2ec0f40fa8c4e75567eaef6887bdbb893c55236a
|
| 3 |
+
size 967858
|
exp/s2t_stats_raw_bpe40000/train/feats_stats.npz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f3ca2ef68be502a75a646c8da36847375964a0d6499fd9ee2d7d620a0f31d746
|
| 3 |
+
size 1402
|
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/config.yaml
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a91a03bcfd59a956319939891c4098e8f7e8c9ea568d7ec2bcbc1131b32d1197
|
| 3 |
+
size 1374692510
|
meta.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
espnet: '202412'
|
| 2 |
+
files:
|
| 3 |
+
s2t_model_file: exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth
|
| 4 |
+
python: 3.11.9
|
| 5 |
+
torch: 2.4.0+cu118
|
| 6 |
+
yaml_files:
|
| 7 |
+
s2t_train_config: exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe40000/config.yaml
|
textnorm_retrained/data/token_list/bpe_unigram40000/bpe.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1914ae6ed41df02e174ce16a9976cb96b25b4393b14c61435bff702b829f3799
|
| 3 |
+
size 972584
|
textnorm_retrained/exp/s2t_stats_raw_bpe40000/train/feats_stats.npz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e5f80694a59a93aab7beeed44621ba82625a90ac838954c581909a8490cd2244
|
| 3 |
+
size 1402
|
textnorm_retrained/exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/config.yaml
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
textnorm_retrained/exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:515c479555a6b3e46aff40b7254932d22b6213b1b36f8cd19e20984f7b3f9dd0
|
| 3 |
+
size 1374692891
|
textnorm_retrained/meta.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
espnet: '202511' # with unmerged local change; will update to suitable version
|
| 2 |
+
files:
|
| 3 |
+
s2t_model_file: exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/valid.acc.ave_5best.till45epoch.pth
|
| 4 |
+
python: 3.12.8
|
| 5 |
+
torch: 2.9.1+cu128
|
| 6 |
+
yaml_files:
|
| 7 |
+
s2t_train_config: exp/s2t_train_ctc3_conv2d_size768_e9_d9_mel128_raw_bpe40000/config.yaml
|