espnet
/

owsm_ctc_v4_1B

Automatic Speech Recognition

speech-translation

language-identification

Model card Files Files and versions

pyf98 commited on Aug 30, 2025

Commit

2422ee9

·

verified ·

1 Parent(s): 0f96e13

Update README.md

Files changed (1) hide show

README.md +9 -5

README.md CHANGED Viewed

@@ -17,11 +17,15 @@ tags:
 pipeline_tag: automatic-speech-recognition
 ---
-[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
-It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
-[OWSM-CTC v4](https://huggingface.co/papers/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
-The newly curated data will be publicly released. Please stay tuned!
 To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 ```
@@ -31,7 +35,7 @@ espnet
 espnet_model_zoo
 ```
-**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
 ### Example script for batched inference

 pipeline_tag: automatic-speech-recognition
 ---
+[Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) is the first **fully open** Whisper-style speech foundation model.
+It reproduces and advances OpenAI's Whisper-style training using publicly available data and open-source toolkits.
+The code, pre-trained model weights, and training logs are publicly released to promote open science in speech foundation models.
+[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is a novel encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
+It supports multilingual speech recognition, speech translation, and language identification within a single non-autoregressive model.
+[OWSM-CTC v4](https://www.isca-archive.org/interspeech_2025/peng25c_interspeech.html) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
+The newly curated data are publicly released: https://huggingface.co/datasets/espnet/yodas_owsmv4
 To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 ```
 espnet_model_zoo
 ```
+**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v4/s2t1
 ### Example script for batched inference