espnet
/

powsm_ctc

+---
+datasets:
+- anyspeech/ipapack_plus_train_1
+- anyspeech/ipapack_plus_train_2
+- anyspeech/ipapack_plus_train_3
+- anyspeech/ipapack_plus_train_4
+language: multilingual
+library_name: espnet
+license: cc-by-4.0
+metrics:
+  - pfer
+  - cer
+tags:
+  - espnet
+  - audio
+  - phone-recognition
+  - automatic-speech-recognition
+  - grapheme-to-phoneme
+  - phoneme-to-grapheme
+pipeline_tag: automatic-speech-recognition
+---
+🐁POWSM-CTC is a variant of [POWSM](https://arxiv.org/abs/2510.24992),  the first phonetic foundation model that can perform four phone-related tasks.
+Its multi-task encoder-CTC structure is based on [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/), and trained on [IPAPack++](https://huggingface.co/anyspeech), the same dataset as POWSM.
+POWSM-CTC is proposed together with our paper [PRiSM](https://arxiv.org/abs/2601.14046), the first open-source benchmark for phone recognition systems.
+Its decoding is much faster than encoder-decoder models, with similar or enhanced PR performance on unseen domain.
+To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
+```
+torch
+espnet
+espnet_model_zoo
+```
+**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm_ctc/s2t1
+### Example script for PR/ASR/G2P/P2G
+Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
+To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
+```python
+from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
+s2t = Speech2TextGreedySearch.from_pretrained(
+    "espnet/powsm_ctc",
+    device="cuda",
+    use_flash_attn=True,
+    lang_sym='<unk>',
+    task_sym='<pr>',
+)
+res = s2t.batch_decode(
+    ["audio1.wav", "audio2.wav"], # a list of audios (path or 1-D array/tensor)
+    batch_size=16,
+)   # res is a list of str
+```
+### Citations
+```BibTex
+@article{prism,
+      title={PRiSM: Benchmarking Phone Realization in Speech Models},
+      author={Shikhar Bharadwaj and Chin-Jou Li and Yoonjae Kim and Kwanghee Choi and Eunjung Yeo and Ryan Soh-Eun Shim and Hanyu Zhou and Brendon Boldt and Karen Rosero Jacome and Kalvin Chang and Darsh Agrawal and Keer Xu and Chao-Han Huck Yang and Jian Zhu and Shinji Watanabe and David R. Mortensen},
+      year={2026},
+      eprint={2601.14046},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2601.14046},
+}