cjli commited on
Commit
b7fd466
·
verified ·
1 Parent(s): ef28b52

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - anyspeech/ipapack_plus_train_1
4
+ - anyspeech/ipapack_plus_train_2
5
+ - anyspeech/ipapack_plus_train_3
6
+ - anyspeech/ipapack_plus_train_4
7
+ language: multilingual
8
+ library_name: espnet
9
+ license: cc-by-4.0
10
+ metrics:
11
+ - pfer
12
+ - cer
13
+ tags:
14
+ - espnet
15
+ - audio
16
+ - phone-recognition
17
+ - automatic-speech-recognition
18
+ - grapheme-to-phoneme
19
+ - phoneme-to-grapheme
20
+ pipeline_tag: automatic-speech-recognition
21
+ ---
22
+
23
+ 🐁POWSM-CTC is a variant of [POWSM](https://arxiv.org/abs/2510.24992), the first phonetic foundation model that can perform four phone-related tasks.
24
+ Its multi-task encoder-CTC structure is based on [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/), and trained on [IPAPack++](https://huggingface.co/anyspeech), the same dataset as POWSM.
25
+
26
+ POWSM-CTC is proposed together with our paper [PRiSM](https://arxiv.org/abs/2601.14046), the first open-source benchmark for phone recognition systems.
27
+ Its decoding is much faster than encoder-decoder models, with similar or enhanced PR performance on unseen domain.
28
+
29
+ To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
30
+ ```
31
+ torch
32
+ espnet
33
+ espnet_model_zoo
34
+ ```
35
+
36
+ **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm_ctc/s2t1
37
+
38
+
39
+ ### Example script for PR/ASR/G2P/P2G
40
+ Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
41
+ To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/.
42
+ ```python
43
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
44
+
45
+ s2t = Speech2TextGreedySearch.from_pretrained(
46
+ "espnet/powsm_ctc",
47
+ device="cuda",
48
+ use_flash_attn=True,
49
+ lang_sym='<unk>',
50
+ task_sym='<pr>',
51
+ )
52
+
53
+ res = s2t.batch_decode(
54
+ ["audio1.wav", "audio2.wav"], # a list of audios (path or 1-D array/tensor)
55
+ batch_size=16,
56
+ ) # res is a list of str
57
+ ```
58
+
59
+ ### Citations
60
+
61
+ ```BibTex
62
+ @article{prism,
63
+ title={PRiSM: Benchmarking Phone Realization in Speech Models},
64
+ author={Shikhar Bharadwaj and Chin-Jou Li and Yoonjae Kim and Kwanghee Choi and Eunjung Yeo and Ryan Soh-Eun Shim and Hanyu Zhou and Brendon Boldt and Karen Rosero Jacome and Kalvin Chang and Darsh Agrawal and Keer Xu and Chao-Han Huck Yang and Jian Zhu and Shinji Watanabe and David R. Mortensen},
65
+ year={2026},
66
+ eprint={2601.14046},
67
+ archivePrefix={arXiv},
68
+ primaryClass={cs.CL},
69
+ url={https://arxiv.org/abs/2601.14046},
70
+ }