powsm_ctc / README.md

cjli

Update README.md

01c6307 verified 1 day ago

preview code

raw

history blame contribute delete

3.41 kB

metadata

datasets:
  - anyspeech/ipapack_plus_train_1
  - anyspeech/ipapack_plus_train_2
  - anyspeech/ipapack_plus_train_3
  - anyspeech/ipapack_plus_train_4
language: multilingual
library_name: espnet
license: cc-by-4.0
metrics:
  - pfer
  - cer
tags:
  - espnet
  - audio
  - phone-recognition
  - automatic-speech-recognition
  - grapheme-to-phoneme
  - phoneme-to-grapheme
pipeline_tag: automatic-speech-recognition

🐁POWSM-CTC

POWSM-CTC is a variant of POWSM, the first phonetic foundation model that can perform four phone-related tasks. Its multi-task encoder-CTC structure is based on OWSM-CTC, and trained on IPAPack++, the same dataset as POWSM.

This model is proposed together with our paper PRiSM, the first open-source benchmark for phone recognition systems. Its decoding is much faster than encoder-decoder models, with similar or enhanced PR performance on unseen domain.

Check out POWSM-CTC's predecessor: 🐁POWSM

To use the pre-trained model, please install espnet and espnet_model_zoo. The requirements are:

torch
espnet
espnet_model_zoo

The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/powsm_ctc/s2t1

Example script for PR/ASR/G2P/P2G

Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/powsm_ctc",
    device="cuda",
    use_flash_attn=True,
    lang_sym='<unk>',
    task_sym='<pr>',
)

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav"], # a list of audios (path or 1-D array/tensor)
    batch_size=16,
)   # res is a list of str

Citations

@article{prism,
      title={PRiSM: Benchmarking Phone Realization in Speech Models},
      author={Shikhar Bharadwaj and Chin-Jou Li and Yoonjae Kim and Kwanghee Choi and Eunjung Yeo and Ryan Soh-Eun Shim and Hanyu Zhou and Brendon Boldt and Karen Rosero Jacome and Kalvin Chang and Darsh Agrawal and Keer Xu and Chao-Han Huck Yang and Jian Zhu and Shinji Watanabe and David R. Mortensen},
      year={2026},
      eprint={2601.14046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.14046}, 
}