--- datasets: - anyspeech/ipapack_plus_train_1 - anyspeech/ipapack_plus_train_2 - anyspeech/ipapack_plus_train_3 - anyspeech/ipapack_plus_train_4 language: multilingual library_name: espnet license: cc-by-4.0 metrics: - pfer - cer tags: - espnet - audio - phone-recognition - automatic-speech-recognition - grapheme-to-phoneme - phoneme-to-grapheme pipeline_tag: automatic-speech-recognition --- ### 🐁POWSM-CTC
POWSM-CTC is a variant of [POWSM](https://huggingface.co/espnet/powsm), the first phonetic foundation model that can perform four phone-related tasks. Its multi-task encoder-CTC structure is based on [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/), and trained on [IPAPack++](https://huggingface.co/anyspeech), the same dataset as POWSM. This model is proposed together with our paper [PRiSM](https://arxiv.org/abs/2601.14046), the first open-source benchmark for phone recognition systems. Its decoding is much faster than encoder-decoder models, with similar or enhanced PR performance on unseen domain. > [!TIP] > Check out POWSM-CTC's predecessor: [🐁POWSM](https://huggingface.co/espnet/powsm) To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are: ``` torch espnet espnet_model_zoo ``` **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm_ctc/s2t1 ### Example script for PR/ASR/G2P/P2G Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. ```python from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch s2t = Speech2TextGreedySearch.from_pretrained( "espnet/powsm_ctc", device="cuda", use_flash_attn=True, lang_sym='