espnet
/

owsm_v4_medium_1B

Automatic Speech Recognition

speech-translation

language-identification

Model card Files Files and versions

pyf98 commited on Feb 7, 2025

Commit

2c83b85

·

verified ·

1 Parent(s): 001b0ba

Update README.md

Files changed (1) hide show

README.md +9 -4

README.md CHANGED Viewed

@@ -14,21 +14,26 @@ license: cc-by-4.0
 ## Open Whisper-style Speech Model (OWSM)
-OWSM aims to develop fully open speech foundation models using publicly available data and open-source toolkits including [ESPnet](https://github.com/espnet/espnet).
-Inference examples can be found in our [project page](https://www.wavlab.org/activities/2024/owsm/).
 The Gradio demo is [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo).
 [OWSM v4]() is the latest version in the OWSM series, which significantly outperforms OWSM v3.1 in LID and multilingual ASR.
-This repo contains a medium-sized model with 1.02B parameters, developed by [Yifan Peng](https://pyf98.github.io/) (CMU). It is trained on 320k hours of public speech data. The newly curated data will be publicly released. Please stay tuned!
 It supports the following speech-to-text tasks:
 - Language identification
 - Speech recognition
 - Speech translation
 - Utterance-level timestamp prediction
-- Long-form transcription
 ### OWSM series

 ## Open Whisper-style Speech Model (OWSM)
+OWSM aims to develop fully open speech foundation models using publicly available data and open-source toolkits, including [ESPnet](https://github.com/espnet/espnet).
+Inference examples can be found on our [project page](https://www.wavlab.org/activities/2024/owsm/).
 The Gradio demo is [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo).
 [OWSM v4]() is the latest version in the OWSM series, which significantly outperforms OWSM v3.1 in LID and multilingual ASR.
+Additionally, OWSM v4 applies 8 times subsampling (instead of 4 times in OWSM v3.1) to the log Mel features, leading to a final resolution of 80 ms in the encoder.
+When running inference, we recommend setting `maxlenratio=1.0` (default) instead of smaller values.
+This repo contains a base-sized model with 102M parameters, developed by [Yifan Peng](https://pyf98.github.io/) (CMU).
+It is trained on 320k hours of public speech data.
+The newly curated data will be publicly released. Please stay tuned!
 It supports the following speech-to-text tasks:
 - Language identification
 - Speech recognition
 - Speech translation
 - Utterance-level timestamp prediction
+- Long-form recognition or translation
 ### OWSM series