Commit ·
9b39732
1
Parent(s): d2e439b
Copied from Clarin: http://hdl.handle.net/20.500.12537/227
Browse filesUse original Readme.txt => README.md
Signed-off-by: Daniel Schnell <dschnell@grammatek.com>
- .gitattributes +1 -0
- 10_trials_optim_kenlm.scorer +3 -0
- README.md +89 -3
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
10_trials_optim_kenlm.scorer filter=lfs diff=lfs merge=lfs -text
|
10_trials_optim_kenlm.scorer
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6082bcc551041a630d54c01746b8e8b6d4c2368d9ba7f1e774e32a4b6c95ab11
|
| 3 |
+
size 1043308192
|
README.md
CHANGED
|
@@ -1,3 +1,89 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
-------------------------------------------------------------------------------
|
| 2 |
+
DeepSpeech Scorer for Icelandic 22.06
|
| 3 |
+
-------------------------------------------------------------------------------
|
| 4 |
+
|
| 5 |
+
Authors : Carlos Daniel Hernández Mena (carlosm@ru.is).
|
| 6 |
+
|
| 7 |
+
Language : Icelandic.
|
| 8 |
+
|
| 9 |
+
Recommended use : speech recognition.
|
| 10 |
+
|
| 11 |
+
-------------------------------------------------------------------------------
|
| 12 |
+
Description
|
| 13 |
+
-------------------------------------------------------------------------------
|
| 14 |
+
|
| 15 |
+
"DeepSpeech Scorer for Icelandic 22.06" is a scorer suitable for recognizers
|
| 16 |
+
based on the Mozilla's DeepSpeech recognizer [1]. A "scorer" is a single file
|
| 17 |
+
used to perform language modeling. It is composed of two sub-components, a
|
| 18 |
+
KenLM language model and a trie data structure containing all words in the
|
| 19 |
+
vocabulary [2].
|
| 20 |
+
|
| 21 |
+
This scorer was originally created to be used with the following DeepSpeech
|
| 22 |
+
recipe, developed by the Language and Voice Lab (LVL) at Reykjavík University
|
| 23 |
+
in 2022:
|
| 24 |
+
|
| 25 |
+
https://github.com/cadia-lvl/samromur-asr/tree/d5_samromur/d5_samromur
|
| 26 |
+
|
| 27 |
+
Nevertheless, due to the flexibility of this kind of resources and their
|
| 28 |
+
possible application in other tasks, systems or code recipes; it was
|
| 29 |
+
decided to publish this resource as an independent item.
|
| 30 |
+
|
| 31 |
+
-------------------------------------------------------------------------------
|
| 32 |
+
The Language Model
|
| 33 |
+
-------------------------------------------------------------------------------
|
| 34 |
+
|
| 35 |
+
The language model was created using the Icelandic Gigaword Corpus [3]. The
|
| 36 |
+
Gigaword corpus contains text from newspaper articles, parliamentary speeches,
|
| 37 |
+
adjudications, books, transcribed radio/television news and more. The
|
| 38 |
+
normalization process of the sentences utilized to generate the language
|
| 39 |
+
model includes to allowing only characters belonging to the Icelandic alphabet,
|
| 40 |
+
expanding numbers and abbreviations, and removing punctuation marks [4]. The
|
| 41 |
+
resulting text has a length of more than 44 million lines of text (5.3GB
|
| 42 |
+
approximately), and it was used to create the scorer.
|
| 43 |
+
|
| 44 |
+
-------------------------------------------------------------------------------
|
| 45 |
+
Citation
|
| 46 |
+
-------------------------------------------------------------------------------
|
| 47 |
+
|
| 48 |
+
When publishing results based on the models please refer to:
|
| 49 |
+
|
| 50 |
+
Mena, Carlos; "DeepSpeech Scorer for Icelandic 22.06". Web Download.
|
| 51 |
+
Reykjavik University: Language and Voice Lab, 2022.
|
| 52 |
+
|
| 53 |
+
Contact: Carlos Mena (carlosm@ru.is)
|
| 54 |
+
|
| 55 |
+
License: CC BY 4.0
|
| 56 |
+
|
| 57 |
+
-------------------------------------------------------------------------------
|
| 58 |
+
Acknowledgements
|
| 59 |
+
-------------------------------------------------------------------------------
|
| 60 |
+
|
| 61 |
+
This initiative was funded by the Language Technology Programme for Icelandic
|
| 62 |
+
2019-2023. The programme, which is managed and coordinated by Almannarómur,
|
| 63 |
+
is funded by the Icelandic Ministry of Education, Science and Culture.
|
| 64 |
+
|
| 65 |
+
-------------------------------------------------------------------------------
|
| 66 |
+
References
|
| 67 |
+
-------------------------------------------------------------------------------
|
| 68 |
+
|
| 69 |
+
[1] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
|
| 70 |
+
E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end
|
| 71 |
+
speech recognition in english and mandarin. In International conference
|
| 72 |
+
on machine learning (pp. 173-182). PMLR.
|
| 73 |
+
|
| 74 |
+
[2] Mozilla's DeepSpeech online documentation:
|
| 75 |
+
https://deepspeech.readthedocs.io/en/r0.9/Scorer.html
|
| 76 |
+
|
| 77 |
+
[3] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
|
| 78 |
+
& Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
|
| 79 |
+
corpus. In Proceedings of the Eleventh International Conference on
|
| 80 |
+
Language Resources and Evaluation (LREC 2018).
|
| 81 |
+
|
| 82 |
+
[4] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
|
| 83 |
+
J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
|
| 84 |
+
In Proceedings of the Eleventh International Conference on Language
|
| 85 |
+
Resources and Evaluation (LREC 2018).
|
| 86 |
+
|
| 87 |
+
-------------------------------------------------------------------------------
|
| 88 |
+
-------------------------------------------------------------------------------
|
| 89 |
+
|