espnet
/

owsm_ctc_v4_1B

@@ -1,25 +1,26 @@
 ---
-tags:
-- espnet
-- audio
-- automatic-speech-recognition
-- speech-translation
-- language-identification
-language: multilingual
 datasets:
 - espnet/yodas_owsmv4
 license: cc-by-4.0
 metrics:
 - cer
 - bleu
 - accuracy
-library_name: espnet
 ---
 [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
 It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
-[OWSM-CTC v4](https://arxiv.org/abs/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
 The newly curated data will be publicly released. Please stay tuned!
 To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
@@ -32,8 +33,6 @@ espnet_model_zoo
 **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
 ### Example script for batched inference
 `Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
@@ -157,8 +156,6 @@ segments = aligner(speech, text)
 print(segments)
 ```
 ### OWSM series
 #### Encoder-decoder OWSM
@@ -173,7 +170,6 @@ print(segments)
 | OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
 | OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
 #### CTC-based OWSM
 | Name | Size | Hugging Face Repo |
@@ -182,8 +178,6 @@ print(segments)
 | OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
 | OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
 ### Citations
 #### OWSM v4

 ---
 datasets:
 - espnet/yodas_owsmv4
+language: multilingual
+library_name: espnet
 license: cc-by-4.0
 metrics:
 - cer
 - bleu
 - accuracy
+tags:
+- espnet
+- audio
+- automatic-speech-recognition
+- speech-translation
+- language-identification
+pipeline_tag: automatic-speech-recognition
 ---
 [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
 It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
+[OWSM-CTC v4](https://huggingface.co/papers/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
 The newly curated data will be publicly released. Please stay tuned!
 To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
 ### Example script for batched inference
 `Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
 print(segments)
 ```
 ### OWSM series
 #### Encoder-decoder OWSM
 | OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
 | OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
 #### CTC-based OWSM
 | Name | Size | Hugging Face Repo |
 | OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
 | OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
 ### Citations
 #### OWSM v4