Add pipeline tag and link to paper
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,25 +1,26 @@
|
|
| 1 |
---
|
| 2 |
-
tags:
|
| 3 |
-
- espnet
|
| 4 |
-
- audio
|
| 5 |
-
- automatic-speech-recognition
|
| 6 |
-
- speech-translation
|
| 7 |
-
- language-identification
|
| 8 |
-
language: multilingual
|
| 9 |
datasets:
|
| 10 |
- espnet/yodas_owsmv4
|
|
|
|
|
|
|
| 11 |
license: cc-by-4.0
|
| 12 |
metrics:
|
| 13 |
- cer
|
| 14 |
- bleu
|
| 15 |
- accuracy
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
|
| 20 |
It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
|
| 21 |
|
| 22 |
-
[OWSM-CTC v4](https://
|
| 23 |
The newly curated data will be publicly released. Please stay tuned!
|
| 24 |
|
| 25 |
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
|
@@ -32,8 +33,6 @@ espnet_model_zoo
|
|
| 32 |
|
| 33 |
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
### Example script for batched inference
|
| 38 |
|
| 39 |
`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
|
|
@@ -157,8 +156,6 @@ segments = aligner(speech, text)
|
|
| 157 |
print(segments)
|
| 158 |
```
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
### OWSM series
|
| 163 |
|
| 164 |
#### Encoder-decoder OWSM
|
|
@@ -173,7 +170,6 @@ print(segments)
|
|
| 173 |
| OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
|
| 174 |
| OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
|
| 175 |
|
| 176 |
-
|
| 177 |
#### CTC-based OWSM
|
| 178 |
|
| 179 |
| Name | Size | Hugging Face Repo |
|
|
@@ -182,8 +178,6 @@ print(segments)
|
|
| 182 |
| OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
|
| 183 |
| OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
| 187 |
### Citations
|
| 188 |
|
| 189 |
#### OWSM v4
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
datasets:
|
| 3 |
- espnet/yodas_owsmv4
|
| 4 |
+
language: multilingual
|
| 5 |
+
library_name: espnet
|
| 6 |
license: cc-by-4.0
|
| 7 |
metrics:
|
| 8 |
- cer
|
| 9 |
- bleu
|
| 10 |
- accuracy
|
| 11 |
+
tags:
|
| 12 |
+
- espnet
|
| 13 |
+
- audio
|
| 14 |
+
- automatic-speech-recognition
|
| 15 |
+
- speech-translation
|
| 16 |
+
- language-identification
|
| 17 |
+
pipeline_tag: automatic-speech-recognition
|
| 18 |
---
|
| 19 |
|
| 20 |
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
|
| 21 |
It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
|
| 22 |
|
| 23 |
+
[OWSM-CTC v4](https://huggingface.co/papers/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
|
| 24 |
The newly curated data will be publicly released. Please stay tuned!
|
| 25 |
|
| 26 |
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
|
|
|
| 33 |
|
| 34 |
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
|
| 35 |
|
|
|
|
|
|
|
| 36 |
### Example script for batched inference
|
| 37 |
|
| 38 |
`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
|
|
|
|
| 156 |
print(segments)
|
| 157 |
```
|
| 158 |
|
|
|
|
|
|
|
| 159 |
### OWSM series
|
| 160 |
|
| 161 |
#### Encoder-decoder OWSM
|
|
|
|
| 170 |
| OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
|
| 171 |
| OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
|
| 172 |
|
|
|
|
| 173 |
#### CTC-based OWSM
|
| 174 |
|
| 175 |
| Name | Size | Hugging Face Repo |
|
|
|
|
| 178 |
| OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
|
| 179 |
| OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
|
| 180 |
|
|
|
|
|
|
|
| 181 |
### Citations
|
| 182 |
|
| 183 |
#### OWSM v4
|