| license: apache-2.0 | |
| pipeline_tag: audio-text-to-text | |
| # OLMoASR | |
| OLMoASR is a series of English automatic speech recognition (ASR) models proposed in the [OLMoASR: Open Models and Data for Training Robust Speech Recognition Models](https://github.com/allenai/OLMoASR.git) | |
| paper by Huong Ngo et al. from Ai2. Trained on 440K hours of weakly-supervised audio-text pairs collected from the public internet, OLMoASR demonstrates strong robustness and zero-shot capabilities. Visit the | |
| [OLMoASR repository](https://github.com/allenai/OLMoASR.git) for access to data processing, training and evaluation code. | |
| # Model Details | |
| OLMoASR uses a Transformer-based encoder-decoder architecture and is an audio language model (LM), where there is an audio encoder and language decoder. | |
| OLMoASR has 5 different model sizes and all checkpoints are trained with English-only data. Below is a table enumerating the different model sizes and associated parameter count. | |
| | Size | Parameters | | |
| |-----------|------------| | |
| | tiny | 39 M | | |
| | base | 74 M | | |
| | small | 244 M | | |
| | medium | 769 M | | |
| | large | 1.5 B | | |
| | large-v2 | 1.5 B | | |
| # Training Data | |
| OLMoASR is trained on 440K hours of weakly-supervised data subsampled from OLMoASR-Mix, a filtered version of [OLMoASR-Pool](link). | |
| OLMoASR-Mix is a collection 1M hours of audio-text pairs, curated from the 3M hours of OLMoASR-Pool. | |
| # Usage | |
| To perform transcription, you can run | |
| ``` | |
| import olmoasr | |
| model = olmoasr.load_model("medium", inference=True) | |
| result = model.transcribe("audio.mp3") | |
| print(result) | |
| ``` | |
| # Evaluation | |
| To perform evaluation, you can visit the [OLMoASR repository](https://github.com/allenai/OLMoASR.git) for more details. | |
| # BibTeX entry and citation info |