deepspeech_scorer / README.md

Copied from Clarin: http://hdl.handle.net/20.500.12537/227

9b39732 over 1 year ago

4.17 kB

	-------------------------------------------------------------------------------
	DeepSpeech Scorer for Icelandic 22.06
	-------------------------------------------------------------------------------

	Authors : Carlos Daniel Hernández Mena (carlosm@ru.is).

	Language : Icelandic.

	Recommended use : speech recognition.

	-------------------------------------------------------------------------------
	Description
	-------------------------------------------------------------------------------

	"DeepSpeech Scorer for Icelandic 22.06" is a scorer suitable for recognizers
	based on the Mozilla's DeepSpeech recognizer [1]. A "scorer" is a single file
	used to perform language modeling. It is composed of two sub-components, a
	KenLM language model and a trie data structure containing all words in the
	vocabulary [2].

	This scorer was originally created to be used with the following DeepSpeech
	recipe, developed by the Language and Voice Lab (LVL) at Reykjavík University
	in 2022:

	https://github.com/cadia-lvl/samromur-asr/tree/d5_samromur/d5_samromur

	Nevertheless, due to the flexibility of this kind of resources and their
	possible application in other tasks, systems or code recipes; it was
	decided to publish this resource as an independent item.

	-------------------------------------------------------------------------------
	The Language Model
	-------------------------------------------------------------------------------

	The language model was created using the Icelandic Gigaword Corpus [3]. The
	Gigaword corpus contains text from newspaper articles, parliamentary speeches,
	adjudications, books, transcribed radio/television news and more. The
	normalization process of the sentences utilized to generate the language
	model includes to allowing only characters belonging to the Icelandic alphabet,
	expanding numbers and abbreviations, and removing punctuation marks [4]. The
	resulting text has a length of more than 44 million lines of text (5.3GB
	approximately), and it was used to create the scorer.

	-------------------------------------------------------------------------------
	Citation
	-------------------------------------------------------------------------------

	When publishing results based on the models please refer to:

	Mena, Carlos; "DeepSpeech Scorer for Icelandic 22.06". Web Download.
	Reykjavik University: Language and Voice Lab, 2022.

	Contact: Carlos Mena (carlosm@ru.is)

	License: CC BY 4.0

	-------------------------------------------------------------------------------
	Acknowledgements
	-------------------------------------------------------------------------------

	This initiative was funded by the Language Technology Programme for Icelandic
	2019-2023. The programme, which is managed and coordinated by Almannarómur,
	is funded by the Icelandic Ministry of Education, Science and Culture.

	-------------------------------------------------------------------------------
	References
	-------------------------------------------------------------------------------

	[1] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
	E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end
	speech recognition in english and mandarin. In International conference
	on machine learning (pp. 173-182). PMLR.

	[2] Mozilla's DeepSpeech online documentation:
	https://deepspeech.readthedocs.io/en/r0.9/Scorer.html

	[3] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
	& Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
	corpus. In Proceedings of the Eleventh International Conference on
	Language Resources and Evaluation (LREC 2018).

	[4] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
	J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
	In Proceedings of the Eleventh International Conference on Language
	Resources and Evaluation (LREC 2018).

	-------------------------------------------------------------------------------
	-------------------------------------------------------------------------------