| --- |
| datasets: |
| - anyspeech/ipapack_plus_train_1 |
| - anyspeech/ipapack_plus_train_2 |
| - anyspeech/ipapack_plus_train_3 |
| - anyspeech/ipapack_plus_train_4 |
| language: multilingual |
| library_name: espnet |
| license: cc-by-4.0 |
| metrics: |
| - pfer |
| - cer |
| tags: |
| - espnet |
| - audio |
| - phone-recognition |
| - automatic-speech-recognition |
| - grapheme-to-phoneme |
| - phoneme-to-grapheme |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| ### 🐁POWSM |
|
|
| <p align="left"> |
| <a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a> |
| <a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a> |
| <a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a> |
| </p> |
|
|
| POWSM is the first phonetic foundation model that can perform four phone-related tasks: |
| Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme |
| conversion (P2G). |
|
|
| Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR. |
|
|
| > [!TIP] |
| > Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure, |
| > and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models! |
|
|
|
|
| To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are: |
| ``` |
| torch |
| espnet |
| espnet_model_zoo |
| ``` |
|
|
| **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1 |
|
|
| ### Example script for PR/ASR/G2P/P2G |
|
|
| Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. |
|
|
| To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. |
|
|
|
|
| > [!NOTE] |
| > Jan 2026: We release a retrained version with improved ASR text normalization. |
| > It is located in the subfolder `textnorm_retrained` and has the same structure as the main model. |
| > Additional details are provided in the updated arXiv appendix. |
| |
| ```python |
| from espnet2.bin.s2t_inference import Speech2Text |
| import soundfile as sf # or librosa |
|
|
| task = "<pr>" |
| s2t = Speech2Text.from_pretrained( |
| "espnet/powsm", |
| device="cuda", |
| lang_sym="<eng>", # ISO 639-3; set to <unk> for unseen languages |
| task_sym=task, # <pr>, <asr>, <g2p>, <p2g> |
| ) |
| |
| speech, rate = sf.read("sample.wav") |
| prompt = "<na>" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes |
| pred = s2t(speech, text_prev=prompt)[0][0] |
| |
| # post-processing for better format |
| pred = pred.split("<notimestamps>")[1].strip() |
| if task == "<pr>" or task == "<g2p>": |
| pred = pred.replace("/", "") |
| print(pred) |
| ``` |
| |
| #### Other tasks |
| |
| See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder! |
|
|
| LID is learned implicitly during training, and you may run it with the script below: |
|
|
| ```python |
| from espnet2.bin.s2t_inference_language import Speech2Language |
| import soundfile as sf # or librosa |
| |
| s2t = Speech2Language.from_pretrained( |
| "espnet/powsm", |
| device="cuda", |
| nbest=1, # number of possible languages to return |
| first_lang_sym="<afr>", # fixed; defined in vocab list |
| last_lang_sym="<zul>" # fixed; defined in vocab list |
| ) |
| |
| speech, rate = sf.read("sample.wav") |
| pred = model(speech)[0] # a list of lang-prob pair |
| print(pred) |
| ``` |
|
|
| ### Citations |
|
|
| ```BibTex |
| @article{powsm, |
| title={POWSM: A Phonetic Open Whisper-Style Speech Foundation Model}, |
| author={Chin-Jou Li and Kalvin Chang and Shikhar Bharadwaj and Eunjung Yeo and Kwanghee Choi and Jian Zhu and David Mortensen and Shinji Watanabe}, |
| year={2025}, |
| eprint={2510.24992}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2510.24992}, |
| } |
| ``` |