espnet
/

owls_9B_180K

Automatic Speech Recognition

speech-translation

Model card Files Files and versions

owls_9B_180K / README.md

wanchichen's picture

Update README.md

6528475 verified 10 months ago

|

history blame contribute delete

2.87 kB

	---
	tags:
	- espnet
	- audio
	- automatic-speech-recognition
	- speech-translation
	language: multilingual
	datasets:
	- owsm_v3.1
	license: cc-by-4.0
	---

	## OWLS: Open Whisper-style Large-scale neural model Suite

	OWLS is a suite of Whisper-style models, designed to help researchers understand the scaling properties of speech models.
	OWLS models range from 0.25B to 18B parameters, and are trained on up to 360K hours of data.

	OWLS models are developed using [ESPnet](https://github.com/espnet/espnet), and support multilingual Speech Recognition and Translation.

	It is part of the [OWSM](https://www.wavlab.org/activities/2024/owsm/) project, which aims to develop fully open speech foundation models using publicly available data and open-source toolkits.

	The model in this repo has 9.31B parameters in total and is trained on 180k hours of public speech data.
	Specifically, it supports the following speech-to-text tasks:
	- Speech recognition
	- Any-to-any-language speech translation
	- Utterance-level alignment
	- Long-form transcription
	- Language identification

	## Use this model

	You can use this model in your projects with the following code:

	```python
	# make sure espnet is installed: pip install espnet
	from espnet2.bin.s2t_inference import Speech2Text

	model = Speech2Text.from_pretrained(
	"espnet/owls_9B_180K"
	)

	speech, rate = soundfile.read("speech.wav")
	speech = librosa.resample(speech, orig_sr=rate, target_sr=16000)
	# make sure 16k sampling rate

	text, *_ = model(speech)[0]
	```

	## OWLS models
	\| Model Name \| Checkpoint \| Training Artifacts \|
	\| ------------------ \| ------- \| --------------------------------------------------------------------------------------- \|
	\| OWLS 0.25B 180K \| https://huggingface.co/espnet/owls_025B_180K \| TBA \|
	\| OWLS 0.50B 180K \| https://huggingface.co/espnet/owls_05B_180K \| https://huggingface.co/espnet/owls_05B_180K_intermediates/tree/main \|
	\| OWLS 1B 11K \| TBA \| TBA \|
	\| OWLS 1B 22K \| TBA \| TBA \|
	\| OWLS 1B 45K \| TBA \| TBA \|
	\| OWLS 1B 90K \| TBA \| TBA \|
	\| OWLS 1B 180K \| https://huggingface.co/espnet/owls_1B_180K \| TBA \|
	\| OWLS 2B 180K \| https://huggingface.co/espnet/owls_2B_180K \| TBA \|
	\| OWLS 4B 180K \| https://huggingface.co/espnet/owls_4B_180K \| https://huggingface.co/espnet/owls_4B_180K_intermediates \|
	\| OWLS 9B 180K \| https://huggingface.co/espnet/owls_9B_180K \| https://huggingface.co/espnet/owls_9B_180K_intermediates \|
	\| OWLS 18B 180K \| https://huggingface.co/espnet/owls_18B_180K \| TBA \|
	\| OWLS 18B 360K \| https://huggingface.co/espnet/owls_18B_360K \| TBA \|



	## Citations

	```
	@article{chen2025owls,
	title={OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models},
	author={Chen, William and Tian, Jinchuan and Peng, Yifan and Yan, Brian and Yang, Chao-Han Huck and Watanabe, Shinji},
	journal={arXiv preprint arXiv:2502.10373},
	year={2025}
	}
	```