File size: 4,435 Bytes
0702ca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21ffa41
 
 
 
 
 
 
 
 
b59aa36
 
 
 
 
21ffa41
 
 
 
 
b59aa36
 
 
 
 
 
 
 
 
647fd96
b59aa36
 
 
 
 
21ffa41
 
 
 
 
 
b59aa36
 
 
 
0724cab
b59aa36
 
 
0724cab
647fd96
b59aa36
 
5809b0d
b59aa36
 
5809b0d
 
 
0724cab
b59aa36
 
 
 
647fd96
 
b59aa36
 
647fd96
 
 
 
 
 
 
 
 
 
 
 
 
 
0724cab
647fd96
 
 
b59aa36
 
 
 
b7676ce
 
 
 
 
 
 
 
b59aa36
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
datasets:
- anyspeech/ipapack_plus_train_1
- anyspeech/ipapack_plus_train_2
- anyspeech/ipapack_plus_train_3
- anyspeech/ipapack_plus_train_4
language: multilingual
library_name: espnet
license: cc-by-4.0
metrics:
  - pfer
  - cer
tags:
  - espnet
  - audio
  - phone-recognition
  - automatic-speech-recognition
  - grapheme-to-phoneme
  - phoneme-to-grapheme
pipeline_tag: automatic-speech-recognition
---

### 🐁POWSM

<p align="left">
  <a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a>
  <a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
  <a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a>
</p>

POWSM is the first phonetic foundation model that can perform four phone-related tasks:
Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
conversion (P2G).

Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.

> [!TIP]
> Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure,
> and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models! 


To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
```
torch
espnet
espnet_model_zoo
```

**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1

### Example script for PR/ASR/G2P/P2G

Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.

To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. 


> [!NOTE]
> Jan 2026: We release a retrained version with improved ASR text normalization. 
> It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
> Additional details are provided in the updated arXiv appendix.

```python
from espnet2.bin.s2t_inference import Speech2Text
import soundfile as sf  # or librosa

task = "<pr>"
s2t = Speech2Text.from_pretrained(
    "espnet/powsm",
    device="cuda",
    lang_sym="<eng>",   # ISO 639-3; set to <unk> for unseen languages
    task_sym=task,    # <pr>, <asr>, <g2p>, <p2g>
)

speech, rate = sf.read("sample.wav")
prompt = "<na>"         # G2P: set to ASR transcript; P2G: set to phone transcription with slashes
pred = s2t(speech, text_prev=prompt)[0][0]

# post-processing for better format
pred = pred.split("<notimestamps>")[1].strip()
if task == "<pr>" or task == "<g2p>":
  pred = pred.replace("/", "")
print(pred)
```

#### Other tasks

See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!

LID is learned implicitly during training, and you may run it with the script below:

```python
from espnet2.bin.s2t_inference_language import Speech2Language
import soundfile as sf      # or librosa

s2t = Speech2Language.from_pretrained(
    "espnet/powsm",
    device="cuda",
    nbest=1,                # number of possible languages to return
    first_lang_sym="<afr>", # fixed; defined in vocab list
    last_lang_sym="<zul>"   # fixed; defined in vocab list
)

speech, rate = sf.read("sample.wav")
pred = model(speech)[0]     # a list of lang-prob pair
print(pred)
```

### Citations

```BibTex
@article{powsm,
      title={POWSM: A Phonetic Open Whisper-Style Speech Foundation Model}, 
      author={Chin-Jou Li and Kalvin Chang and Shikhar Bharadwaj and Eunjung Yeo and Kwanghee Choi and Jian Zhu and David Mortensen and Shinji Watanabe},
      year={2025},
      eprint={2510.24992},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.24992}, 
}
```