Update README.md
Browse files
README.md
CHANGED
|
@@ -20,12 +20,25 @@ tags:
|
|
| 20 |
pipeline_tag: automatic-speech-recognition
|
| 21 |
---
|
| 22 |
|
| 23 |
-
🐁POWSM
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
|
| 25 |
conversion (P2G).
|
| 26 |
|
| 27 |
Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
| 30 |
```
|
| 31 |
torch
|
|
@@ -35,17 +48,18 @@ espnet_model_zoo
|
|
| 35 |
|
| 36 |
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
|
| 37 |
|
| 38 |
-
> [!NOTE]
|
| 39 |
-
> Jan 2026: We release a retrained version with improved ASR text normalization.
|
| 40 |
-
> It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
|
| 41 |
-
> Additional details are provided in the updated arXiv appendix.
|
| 42 |
-
|
| 43 |
### Example script for PR/ASR/G2P/P2G
|
| 44 |
|
| 45 |
Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
|
| 46 |
|
| 47 |
To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
```python
|
| 50 |
from espnet2.bin.s2t_inference import Speech2Text
|
| 51 |
import soundfile as sf # or librosa
|
|
|
|
| 20 |
pipeline_tag: automatic-speech-recognition
|
| 21 |
---
|
| 22 |
|
| 23 |
+
### 🐁POWSM
|
| 24 |
+
|
| 25 |
+
<p align="left">
|
| 26 |
+
<a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a>
|
| 27 |
+
<a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a>
|
| 28 |
+
<a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a>
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
POWSM is the first phonetic foundation model that can perform four phone-related tasks:
|
| 32 |
Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
|
| 33 |
conversion (P2G).
|
| 34 |
|
| 35 |
Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
|
| 36 |
|
| 37 |
+
> [!TIP]
|
| 38 |
+
> Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure,
|
| 39 |
+
> and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models!
|
| 40 |
+
|
| 41 |
+
|
| 42 |
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
| 43 |
```
|
| 44 |
torch
|
|
|
|
| 48 |
|
| 49 |
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
### Example script for PR/ASR/G2P/P2G
|
| 52 |
|
| 53 |
Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
|
| 54 |
|
| 55 |
To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
|
| 56 |
|
| 57 |
+
|
| 58 |
+
> [!NOTE]
|
| 59 |
+
> Jan 2026: We release a retrained version with improved ASR text normalization.
|
| 60 |
+
> It is located in the subfolder `textnorm_retrained` and has the same structure as the main model.
|
| 61 |
+
> Additional details are provided in the updated arXiv appendix.
|
| 62 |
+
|
| 63 |
```python
|
| 64 |
from espnet2.bin.s2t_inference import Speech2Text
|
| 65 |
import soundfile as sf # or librosa
|