File size: 3,406 Bytes

---
datasets:
- anyspeech/ipapack_plus_train_1
- anyspeech/ipapack_plus_train_2
- anyspeech/ipapack_plus_train_3
- anyspeech/ipapack_plus_train_4
language: multilingual
library_name: espnet
license: cc-by-4.0
metrics:
  - pfer
  - cer
tags:
  - espnet
  - audio
  - phone-recognition
  - automatic-speech-recognition
  - grapheme-to-phoneme
  - phoneme-to-grapheme
pipeline_tag: automatic-speech-recognition
---

### 🐁POWSM-CTC

<p align="left">
  <a href="https://arxiv.org/abs/2601.14046"><img src="https://img.shields.io/badge/Paper-2601.14046-red.svg?logo=arxiv&logoColor=red"/></a>
  <a href="https://huggingface.co/espnet/powsm_ctc"><img src="https://img.shields.io/badge/Model-powsm_ctc-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
  <a href="https://github.com/changelinglab/prism"><img src="https://img.shields.io/badge/Benchmark-PRiSM-green.svg?logo=github&logoColor=black"/></a>
  <a href="https://github.com/espnet/egs2/powsm_ctc/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm_ctc-blue.svg?logo=github&logoColor=black"/></a>
</p>

POWSM-CTC is a variant of [POWSM](https://huggingface.co/espnet/powsm),  the first phonetic foundation model that can perform four phone-related tasks.
Its multi-task encoder-CTC structure is based on [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/), and trained on [IPAPack++](https://huggingface.co/anyspeech), the same dataset as POWSM.

This model is proposed together with our paper [PRiSM](https://arxiv.org/abs/2601.14046), the first open-source benchmark for phone recognition systems. 
Its decoding is much faster than encoder-decoder models, with similar or enhanced PR performance on unseen domain.

> [!TIP]
> Check out POWSM-CTC's predecessor: [🐁POWSM](https://huggingface.co/espnet/powsm)

To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
```
torch
espnet
espnet_model_zoo
```

**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm_ctc/s2t1


### Example script for PR/ASR/G2P/P2G
Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. 
```python
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/powsm_ctc",
    device="cuda",
    use_flash_attn=True,
    lang_sym='<unk>',
    task_sym='<pr>',
)

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav"], # a list of audios (path or 1-D array/tensor)
    batch_size=16,
)   # res is a list of str
```

### Citations

```BibTex
@article{prism,
      title={PRiSM: Benchmarking Phone Realization in Speech Models},
      author={Shikhar Bharadwaj and Chin-Jou Li and Yoonjae Kim and Kwanghee Choi and Eunjung Yeo and Ryan Soh-Eun Shim and Hanyu Zhou and Brendon Boldt and Karen Rosero Jacome and Kalvin Chang and Darsh Agrawal and Keer Xu and Chao-Han Huck Yang and Jian Zhu and Shinji Watanabe and David R. Mortensen},
      year={2026},
      eprint={2601.14046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.14046}, 
}