espnet
/

powsm

Automatic Speech Recognition

phone-recognition

grapheme-to-phoneme

phoneme-to-grapheme

Model card Files Files and versions

cjli commited on Oct 28, 2025

Commit

b59aa36

·

1 Parent(s): b9127ba

update readme

Files changed (1) hide show

README.md +52 -1

README.md CHANGED Viewed

@@ -20,4 +20,55 @@ tags:
 pipeline_tag: automatic-speech-recognition
 ---
-POWSM

 pipeline_tag: automatic-speech-recognition
 ---
+🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks:
+Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
+conversion (P2G).
+Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
+To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
+```
+torch
+espnet
+espnet_model_zoo
+```
+**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
+### Example script
+Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
+To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
+```python
+from espnet2.bin.s2t_inference import Speech2Text
+import soundfile as sf  # or librosa
+task = '<pr>'
+s2t = Speech2Text.from_pretrained(
+    "espnet/powsm",
+    device="cuda",
+    generate_interctc_outputs=False,
+    lang_sym='<eng>',   # ISO 639-3; set to <unk> for unseen languages
+    task_sym='<pr>',    # <pr>, <asr>, <g2p>, <p2g>
+)
+speech, rate = sf.read("sample.wav", sr=16000)
+prompt = "<na>"         # G2P: set to ASR transcript; P2G: set to phone transcription with slashes
+pred = s2t(speech, text_prev=prompt)[0][0]
+if task == '<pr>' or task == '<g2p>:
+  pred = pred.replace("/", "")
+print(pred)
+```
+See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
+### Citations
+```BibTex
+@article{powsm
+}
+```