update readme
Browse files
README.md
CHANGED
|
@@ -20,4 +20,55 @@ tags:
|
|
| 20 |
pipeline_tag: automatic-speech-recognition
|
| 21 |
---
|
| 22 |
|
| 23 |
-
POWSM
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
pipeline_tag: automatic-speech-recognition
|
| 21 |
---
|
| 22 |
|
| 23 |
+
🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks:
|
| 24 |
+
Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
|
| 25 |
+
conversion (P2G).
|
| 26 |
+
|
| 27 |
+
Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
|
| 28 |
+
|
| 29 |
+
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
| 30 |
+
```
|
| 31 |
+
torch
|
| 32 |
+
espnet
|
| 33 |
+
espnet_model_zoo
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
### Example script
|
| 40 |
+
|
| 41 |
+
Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
|
| 42 |
+
|
| 43 |
+
To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
from espnet2.bin.s2t_inference import Speech2Text
|
| 47 |
+
import soundfile as sf # or librosa
|
| 48 |
+
|
| 49 |
+
task = '<pr>'
|
| 50 |
+
s2t = Speech2Text.from_pretrained(
|
| 51 |
+
"espnet/powsm",
|
| 52 |
+
device="cuda",
|
| 53 |
+
generate_interctc_outputs=False,
|
| 54 |
+
lang_sym='<eng>', # ISO 639-3; set to <unk> for unseen languages
|
| 55 |
+
task_sym='<pr>', # <pr>, <asr>, <g2p>, <p2g>
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
speech, rate = sf.read("sample.wav", sr=16000)
|
| 59 |
+
prompt = "<na>" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes
|
| 60 |
+
pred = s2t(speech, text_prev=prompt)[0][0]
|
| 61 |
+
if task == '<pr>' or task == '<g2p>:
|
| 62 |
+
pred = pred.replace("/", "")
|
| 63 |
+
print(pred)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
### Citations
|
| 70 |
+
|
| 71 |
+
```BibTex
|
| 72 |
+
@article{powsm
|
| 73 |
+
}
|
| 74 |
+
```
|