| ------------------------------------------------------------------------------- | |
| DeepSpeech Scorer for Icelandic 22.06 | |
| ------------------------------------------------------------------------------- | |
| Authors : Carlos Daniel Hernández Mena (carlosm@ru.is). | |
| Language : Icelandic. | |
| Recommended use : speech recognition. | |
| ------------------------------------------------------------------------------- | |
| Description | |
| ------------------------------------------------------------------------------- | |
| "DeepSpeech Scorer for Icelandic 22.06" is a scorer suitable for recognizers | |
| based on the Mozilla's DeepSpeech recognizer [1]. A "scorer" is a single file | |
| used to perform language modeling. It is composed of two sub-components, a | |
| KenLM language model and a trie data structure containing all words in the | |
| vocabulary [2]. | |
| This scorer was originally created to be used with the following DeepSpeech | |
| recipe, developed by the Language and Voice Lab (LVL) at Reykjavík University | |
| in 2022: | |
| https://github.com/cadia-lvl/samromur-asr/tree/d5_samromur/d5_samromur | |
| Nevertheless, due to the flexibility of this kind of resources and their | |
| possible application in other tasks, systems or code recipes; it was | |
| decided to publish this resource as an independent item. | |
| ------------------------------------------------------------------------------- | |
| The Language Model | |
| ------------------------------------------------------------------------------- | |
| The language model was created using the Icelandic Gigaword Corpus [3]. The | |
| Gigaword corpus contains text from newspaper articles, parliamentary speeches, | |
| adjudications, books, transcribed radio/television news and more. The | |
| normalization process of the sentences utilized to generate the language | |
| model includes to allowing only characters belonging to the Icelandic alphabet, | |
| expanding numbers and abbreviations, and removing punctuation marks [4]. The | |
| resulting text has a length of more than 44 million lines of text (5.3GB | |
| approximately), and it was used to create the scorer. | |
| ------------------------------------------------------------------------------- | |
| Citation | |
| ------------------------------------------------------------------------------- | |
| When publishing results based on the models please refer to: | |
| Mena, Carlos; "DeepSpeech Scorer for Icelandic 22.06". Web Download. | |
| Reykjavik University: Language and Voice Lab, 2022. | |
| Contact: Carlos Mena (carlosm@ru.is) | |
| License: CC BY 4.0 | |
| ------------------------------------------------------------------------------- | |
| Acknowledgements | |
| ------------------------------------------------------------------------------- | |
| This initiative was funded by the Language Technology Programme for Icelandic | |
| 2019-2023. The programme, which is managed and coordinated by Almannarómur, | |
| is funded by the Icelandic Ministry of Education, Science and Culture. | |
| ------------------------------------------------------------------------------- | |
| References | |
| ------------------------------------------------------------------------------- | |
| [1] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, | |
| E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end | |
| speech recognition in english and mandarin. In International conference | |
| on machine learning (pp. 173-182). PMLR. | |
| [2] Mozilla's DeepSpeech online documentation: | |
| https://deepspeech.readthedocs.io/en/r0.9/Scorer.html | |
| [3] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S., | |
| & Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text | |
| corpus. In Proceedings of the Eleventh International Conference on | |
| Language Resources and Evaluation (LREC 2018). | |
| [4] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason, | |
| J. (2018, May). Open ASR for Icelandic: Resources and a baseline system. | |
| In Proceedings of the Eleventh International Conference on Language | |
| Resources and Evaluation (LREC 2018). | |
| ------------------------------------------------------------------------------- | |
| ------------------------------------------------------------------------------- | |