| | --- |
| | tags: |
| | - espnet |
| | - audio |
| | - automatic-speech-recognition |
| | - speech-translation |
| | language: multilingual |
| | datasets: |
| | - owsm_v3.1 |
| | license: cc-by-4.0 |
| | --- |
| | |
| | ## OWLS: Open Whisper-style Large-scale neural model Suite |
| |
|
| | OWLS is a suite of Whisper-style models, designed to help researchers understand the scaling properties of speech models. |
| | OWLS models range from 0.25B to 18B parameters, and are trained on up to 360K hours of data. |
| |
|
| | OWLS models are developed using [ESPnet](https://github.com/espnet/espnet), and support multilingual Speech Recognition and Translation. |
| |
|
| | It is part of the [OWSM](https://www.wavlab.org/activities/2024/owsm/) project, which aims to develop fully open speech foundation models using publicly available data and open-source toolkits. |
| |
|
| | The model in this repo has 9.31B parameters in total and is trained on 180k hours of public speech data. |
| | Specifically, it supports the following speech-to-text tasks: |
| | - Speech recognition |
| | - Any-to-any-language speech translation |
| | - Utterance-level alignment |
| | - Long-form transcription |
| | - Language identification |
| |
|
| | ## Use this model |
| |
|
| | You can use this model in your projects with the following code: |
| |
|
| | ```python |
| | # make sure espnet is installed: pip install espnet |
| | from espnet2.bin.s2t_inference import Speech2Text |
| | |
| | model = Speech2Text.from_pretrained( |
| | "espnet/owls_9B_180K" |
| | ) |
| | |
| | speech, rate = soundfile.read("speech.wav") |
| | speech = librosa.resample(speech, orig_sr=rate, target_sr=16000) |
| | # make sure 16k sampling rate |
| | |
| | text, *_ = model(speech)[0] |
| | ``` |
| |
|
| | ## OWLS models |
| | | Model Name | Checkpoint | Training Artifacts | |
| | | ------------------ | ------- | --------------------------------------------------------------------------------------- | |
| | | OWLS 0.25B 180K | https://huggingface.co/espnet/owls_025B_180K | TBA | |
| | | OWLS 0.50B 180K | https://huggingface.co/espnet/owls_05B_180K | https://huggingface.co/espnet/owls_05B_180K_intermediates/tree/main | |
| | | OWLS 1B 11K | TBA | TBA | |
| | | OWLS 1B 22K | TBA | TBA | |
| | | OWLS 1B 45K | TBA | TBA | |
| | | OWLS 1B 90K | TBA | TBA | |
| | | OWLS 1B 180K | https://huggingface.co/espnet/owls_1B_180K | TBA | |
| | | OWLS 2B 180K | https://huggingface.co/espnet/owls_2B_180K | TBA | |
| | | OWLS 4B 180K | https://huggingface.co/espnet/owls_4B_180K | https://huggingface.co/espnet/owls_4B_180K_intermediates | |
| | | OWLS 9B 180K | https://huggingface.co/espnet/owls_9B_180K | https://huggingface.co/espnet/owls_9B_180K_intermediates | |
| | | OWLS 18B 180K | https://huggingface.co/espnet/owls_18B_180K | TBA | |
| | | OWLS 18B 360K | https://huggingface.co/espnet/owls_18B_360K | TBA | |
| | |
| | |
| | |
| | ## Citations |
| | |
| | ``` |
| | @article{chen2025owls, |
| | title={OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models}, |
| | author={Chen, William and Tian, Jinchuan and Peng, Yifan and Yan, Brian and Yang, Chao-Han Huck and Watanabe, Shinji}, |
| | journal={arXiv preprint arXiv:2502.10373}, |
| | year={2025} |
| | } |
| | ``` |
| | |
| | |
| | |