espnet
/

powsm

@@ -20,12 +20,25 @@ tags:
 pipeline_tag: automatic-speech-recognition
 ---
-🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks:
 Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
 conversion (P2G).
 Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
 To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 ```
 torch
@@ -35,17 +48,18 @@ espnet_model_zoo
 **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
-> [!NOTE]
-> Jan 2026: We release a retrained version with improved ASR text normalization.
-> It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
-> Additional details are provided in the updated arXiv appendix.
 ### Example script for PR/ASR/G2P/P2G
 Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
 To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
 ```python
 from espnet2.bin.s2t_inference import Speech2Text
 import soundfile as sf  # or librosa

 pipeline_tag: automatic-speech-recognition
 ---
+### 🐁POWSM
+<p align="left">
+  <a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a>
+  <a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
+  <a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a>
+</p>
+POWSM is the first phonetic foundation model that can perform four phone-related tasks:
 Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
 conversion (P2G).
 Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
+> [!TIP]
+> Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure,
+> and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models!
 To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 ```
 torch
 **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
 ### Example script for PR/ASR/G2P/P2G
 Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
 To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
+> [!NOTE]
+> Jan 2026: We release a retrained version with improved ASR text normalization.
+> It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
+> Additional details are provided in the updated arXiv appendix.
 ```python
 from espnet2.bin.s2t_inference import Speech2Text
 import soundfile as sf  # or librosa