datasets:
- anyspeech/ipapack_plus_train_1
- anyspeech/ipapack_plus_train_2
- anyspeech/ipapack_plus_train_3
- anyspeech/ipapack_plus_train_4
language: multilingual
library_name: espnet
license: cc-by-4.0
metrics:
- pfer
- cer
tags:
- espnet
- audio
- phone-recognition
- automatic-speech-recognition
- grapheme-to-phoneme
- phoneme-to-grapheme
pipeline_tag: automatic-speech-recognition
🐁POWSM-CTC
POWSM-CTC is a variant of POWSM, the first phonetic foundation model that can perform four phone-related tasks. Its multi-task encoder-CTC structure is based on OWSM-CTC, and trained on IPAPack++, the same dataset as POWSM.
This model is proposed together with our paper PRiSM, the first open-source benchmark for phone recognition systems. Its decoding is much faster than encoder-decoder models, with similar or enhanced PR performance on unseen domain.
Check out POWSM-CTC's predecessor: 🐁POWSM
To use the pre-trained model, please install espnet and espnet_model_zoo. The requirements are:
torch
espnet
espnet_model_zoo
The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/powsm_ctc/s2t1
Example script for PR/ASR/G2P/P2G
Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/powsm_ctc",
device="cuda",
use_flash_attn=True,
lang_sym='<unk>',
task_sym='<pr>',
)
res = s2t.batch_decode(
["audio1.wav", "audio2.wav"], # a list of audios (path or 1-D array/tensor)
batch_size=16,
) # res is a list of str
Citations
@article{prism,
title={PRiSM: Benchmarking Phone Realization in Speech Models},
author={Shikhar Bharadwaj and Chin-Jou Li and Yoonjae Kim and Kwanghee Choi and Eunjung Yeo and Ryan Soh-Eun Shim and Hanyu Zhou and Brendon Boldt and Karen Rosero Jacome and Kalvin Chang and Darsh Agrawal and Keer Xu and Chao-Han Huck Yang and Jian Zhu and Shinji Watanabe and David R. Mortensen},
year={2026},
eprint={2601.14046},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.14046},
}