Update README.md
Browse files
README.md
CHANGED
|
@@ -3,7 +3,7 @@ language:
|
|
| 3 |
- ca
|
| 4 |
library_name: nemo
|
| 5 |
datasets:
|
| 6 |
-
-
|
| 7 |
thumbnail: null
|
| 8 |
tags:
|
| 9 |
- automatic-speech-recognition
|
|
@@ -17,11 +17,6 @@ tags:
|
|
| 17 |
- hf-asr-leaderboard
|
| 18 |
- Riva
|
| 19 |
license: cc-by-4.0
|
| 20 |
-
widget:
|
| 21 |
-
- example_title: Librispeech sample 1
|
| 22 |
-
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
|
| 23 |
-
- example_title: Librispeech sample 2
|
| 24 |
-
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
| 25 |
model-index:
|
| 26 |
- name: stt_ca_conformer_ctc_large
|
| 27 |
results:
|
|
@@ -29,16 +24,16 @@ model-index:
|
|
| 29 |
name: Automatic Speech Recognition
|
| 30 |
type: automatic-speech-recognition
|
| 31 |
dataset:
|
| 32 |
-
name:
|
| 33 |
-
type:
|
| 34 |
-
config:
|
| 35 |
split: test
|
| 36 |
args:
|
| 37 |
-
language:
|
| 38 |
metrics:
|
| 39 |
- name: Test WER
|
| 40 |
type: wer
|
| 41 |
-
value:
|
| 42 |
|
| 43 |
---
|
| 44 |
|
|
@@ -93,7 +88,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
|
|
| 93 |
|
| 94 |
```shell
|
| 95 |
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
| 96 |
-
pretrained_name="nvidia/
|
| 97 |
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
| 98 |
```
|
| 99 |
|
|
@@ -115,40 +110,32 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
|
|
| 115 |
|
| 116 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 117 |
|
| 118 |
-
The
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
| 125 |
-
- Fisher Corpus
|
| 126 |
-
- Switchboard-1 Dataset
|
| 127 |
-
- WSJ-0 and WSJ-1
|
| 128 |
-
- National Speech Corpus (Part 1, Part 6)
|
| 129 |
-
- VCTK
|
| 130 |
-
- VoxPopuli (EN)
|
| 131 |
-
- Europarl-ASR (EN)
|
| 132 |
-
- Multilingual Librispeech (MLS EN) - 2,000 hours subset
|
| 133 |
-
- Mozilla Common Voice (v7.0)
|
| 134 |
|
| 135 |
-
|
| 136 |
|
| 137 |
## Performance
|
| 138 |
|
| 139 |
The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
|
| 140 |
|
| 141 |
-
| Version | Tokenizer
|
| 142 |
-
|---------|-----------------------|-----------------|-----
|
| 143 |
-
| 1.
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
| Language
|
| 148 |
-
|--------------------------
|
| 149 |
-
|N-gram LM
|
| 150 |
-
|Neural Rescorer(Transformer) | LS Train + LS LM Corpus | 3.4 | 1.7 | N=10, beam_width=128 |
|
| 151 |
-
|N-gram + Neural Rescorer(Transformer)| LS Train + LS LM Corpus | 3.2 | 1.8 | N=10, beam_width=128, n_gram_alpha=1.0, n_gram_beta=1.0 |
|
| 152 |
|
| 153 |
|
| 154 |
## Limitations
|
|
|
|
| 3 |
- ca
|
| 4 |
library_name: nemo
|
| 5 |
datasets:
|
| 6 |
+
- mozilla-foundation/common_voice_9_0
|
| 7 |
thumbnail: null
|
| 8 |
tags:
|
| 9 |
- automatic-speech-recognition
|
|
|
|
| 17 |
- hf-asr-leaderboard
|
| 18 |
- Riva
|
| 19 |
license: cc-by-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
model-index:
|
| 21 |
- name: stt_ca_conformer_ctc_large
|
| 22 |
results:
|
|
|
|
| 24 |
name: Automatic Speech Recognition
|
| 25 |
type: automatic-speech-recognition
|
| 26 |
dataset:
|
| 27 |
+
name: Mozilla Common Voice 9.0
|
| 28 |
+
type: mozilla-foundation/common_voice_9_0
|
| 29 |
+
config: ca
|
| 30 |
split: test
|
| 31 |
args:
|
| 32 |
+
language: ca
|
| 33 |
metrics:
|
| 34 |
- name: Test WER
|
| 35 |
type: wer
|
| 36 |
+
value: 4.27
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
|
|
| 88 |
|
| 89 |
```shell
|
| 90 |
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
| 91 |
+
pretrained_name="nvidia/stt_ca_conformer_ctc_large"
|
| 92 |
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
| 93 |
```
|
| 94 |
|
|
|
|
| 110 |
|
| 111 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 112 |
|
| 113 |
+
The vocabulary we use contains 44 characters:
|
| 114 |
+
```python
|
| 115 |
+
['s','e','r','v','i','d','p','o','g','a','m','t','u','l','f','c','z','b','q','n','é',"'",'x','ó','è','h','í','ü','j','à','ï','w','k','y','ç','ú','ò','á','ı','·','ñ','—','–','-']
|
| 116 |
+
```
|
| 117 |
|
| 118 |
+
Full config can be found inside the .nemo files.
|
| 119 |
|
| 120 |
+
The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
|
| 121 |
|
| 122 |
+
### Datasets
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
+
All the models in this collection are trained on MCV-9.0 Catalan dataset, which contains around 1203 hours training, 28 hours of development and 27 hours of testing speech audios.
|
| 125 |
|
| 126 |
## Performance
|
| 127 |
|
| 128 |
The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
|
| 129 |
|
| 130 |
+
| Version | Tokenizer | Vocabulary Size | Dev WER| Test WER| Train Dataset |
|
| 131 |
+
|---------|-----------------------|-----------------|-----|------|-----------------|
|
| 132 |
+
| 1.11.0 | SentencePiece Unigram | 128 |4.70 | 4.27 | MCV-9.0 Train set |
|
| 133 |
|
| 134 |
+
You may use language models (LMs) and beam search to improve the accuracy of the models, as reported in the follwoing table.
|
| 135 |
|
| 136 |
+
| Language Model | Test WER | Test WER w/ Oracle LM | Train Dataset | Settings |
|
| 137 |
+
|----------------|----------|-----------------------|------------------|-------------------------------------------------------|
|
| 138 |
+
| N-gram LM | 3.77 | 1.54 |MCV-9.0 Train set |N=6, beam_width=128, ngram_alpha=1.5, ngram_beta=2.0 |
|
|
|
|
|
|
|
| 139 |
|
| 140 |
|
| 141 |
## Limitations
|