cjli commited on
Commit
b59aa36
·
1 Parent(s): b9127ba

update readme

Browse files
Files changed (1) hide show
  1. README.md +52 -1
README.md CHANGED
@@ -20,4 +20,55 @@ tags:
20
  pipeline_tag: automatic-speech-recognition
21
  ---
22
 
23
- POWSM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  pipeline_tag: automatic-speech-recognition
21
  ---
22
 
23
+ 🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks:
24
+ Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme
25
+ conversion (P2G).
26
+
27
+ Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR.
28
+
29
+ To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
30
+ ```
31
+ torch
32
+ espnet
33
+ espnet_model_zoo
34
+ ```
35
+
36
+ **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
37
+
38
+
39
+ ### Example script
40
+
41
+ Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
42
+
43
+ To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
44
+
45
+ ```python
46
+ from espnet2.bin.s2t_inference import Speech2Text
47
+ import soundfile as sf # or librosa
48
+
49
+ task = '<pr>'
50
+ s2t = Speech2Text.from_pretrained(
51
+ "espnet/powsm",
52
+ device="cuda",
53
+ generate_interctc_outputs=False,
54
+ lang_sym='<eng>', # ISO 639-3; set to <unk> for unseen languages
55
+ task_sym='<pr>', # <pr>, <asr>, <g2p>, <p2g>
56
+ )
57
+
58
+ speech, rate = sf.read("sample.wav", sr=16000)
59
+ prompt = "<na>" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes
60
+ pred = s2t(speech, text_prev=prompt)[0][0]
61
+ if task == '<pr>' or task == '<g2p>:
62
+ pred = pred.replace("/", "")
63
+ print(pred)
64
+ ```
65
+
66
+ See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
67
+
68
+
69
+ ### Citations
70
+
71
+ ```BibTex
72
+ @article{powsm
73
+ }
74
+ ```