| Automatic Speech Recognition (ASR) | |
| ================================== | |
| ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language | |
| (speech-to-text). Our goal is usually to have a model that minimizes the Word Error Rate (WER) metric when transcribing speech input. | |
| In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with | |
| as few errors as possible? | |
| Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to | |
| evaluate a speech sample. We would start from a language model that encapsulates the most likely orderings of words that are generated | |
| (e.g. an n-gram model), to a pronunciation model for each word in that ordering (e.g. a pronunciation table), to an acoustic model that | |
| translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model). | |
| Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio | |
| according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model ``Pr(audio|transcript)*Pr(transcript)``, | |
| and take the argmax of this over possible transcripts. | |
| Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a | |
| neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram | |
| model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of | |
| these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the | |
| whole prediction. | |
| Thus, we can see the appeal of end-to-end ASR architectures: discriminative models that simply take an audio input and give a textual | |
| output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be | |
| akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, | |
| we could integrate a language model that would improve our predictions, as well. | |
| And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! | |
| A demo below allows evaluation of NeMo ASR models in multiple langauges from the browser: | |
| .. raw:: html | |
| <iframe src="https://hf.space/embed/smajumdar/nemo_multilingual_language_id/+" | |
| width="100%" class="gradio-asr"></iframe> | |
| <script type="text/javascript" language="javascript"> | |
| $('.gradio-asr').css('height', $(window).height()+'px'); | |
| </script> | |
| The full documentation tree is as follows: | |
| .. toctree:: | |
| :maxdepth: 8 | |
| models | |
| datasets | |
| asr_language_modeling | |
| results | |
| scores | |
| configs | |
| api | |
| resources | |
| examples/kinyarwanda_asr.rst | |
| .. include:: resources.rst | |