cjli commited on
Commit
21ffa41
·
verified ·
1 Parent(s): 5ece935

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -6
README.md CHANGED
@@ -20,12 +20,25 @@ tags:
20
  pipeline_tag: automatic-speech-recognition
21
  ---
22
 
23
- 🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks:
 
 
 
 
 
 
 
 
24
  Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
25
  conversion (P2G).
26
 
27
  Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
28
 
 
 
 
 
 
29
  To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
30
  ```
31
  torch
@@ -35,17 +48,18 @@ espnet_model_zoo
35
 
36
  **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
37
 
38
- > [!NOTE]
39
- > Jan 2026: We release a retrained version with improved ASR text normalization.
40
- > It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
41
- > Additional details are provided in the updated arXiv appendix.
42
-
43
  ### Example script for PR/ASR/G2P/P2G
44
 
45
  Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
46
 
47
  To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
48
 
 
 
 
 
 
 
49
  ```python
50
  from espnet2.bin.s2t_inference import Speech2Text
51
  import soundfile as sf # or librosa
 
20
  pipeline_tag: automatic-speech-recognition
21
  ---
22
 
23
+ ### 🐁POWSM
24
+
25
+ <p align="left">
26
+ <a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a>
27
+ <a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
28
+ <a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a>
29
+ </p>
30
+
31
+ POWSM is the first phonetic foundation model that can perform four phone-related tasks:
32
  Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
33
  conversion (P2G).
34
 
35
  Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
36
 
37
+ > [!TIP]
38
+ > Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure,
39
+ > and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models!
40
+
41
+
42
  To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
43
  ```
44
  torch
 
48
 
49
  **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
50
 
 
 
 
 
 
51
  ### Example script for PR/ASR/G2P/P2G
52
 
53
  Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
54
 
55
  To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
56
 
57
+
58
+ > [!NOTE]
59
+ > Jan 2026: We release a retrained version with improved ASR text normalization.
60
+ > It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
61
+ > Additional details are provided in the updated arXiv appendix.
62
+
63
  ```python
64
  from espnet2.bin.s2t_inference import Speech2Text
65
  import soundfile as sf # or librosa