| # IndT5: A Text-to-Text Transformer for 10 Indigenous Languages | |
| | |
| <img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/> | |
| In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish. | |
| | |
| # IndT5 | |
| We train an Indigenous language model adopting the unified and flexible | |
| text-to-text transfer Transformer (T5) approach. T5 treats every | |
| text-based language task as a “text-to-text" problem, taking text format | |
| as input and producing new text format as output. T5 is essentially an | |
| encoder-decoder Transformer, with the encoder and decoder similar in | |
| configuration and size to a BERT<sub>Base</sub> but with some | |
| architectural modifications. Modifications include applying a | |
| normalization layer before a sub-block and adding a pre-norm (i.e., | |
| initial input to the sub-block output). | |
| # IndCourpus | |
| We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible. | |
| ### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible) | |
| | **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**| | |
| |-------------------|------------------|-------------------|------------------------|-| | |
| |Hñähñu | - | - | 1.4 | 7.5K | | |
| |Wixarika | - | - | 1.3 | 7.5K| | |
| |Nahuatl | 5.8 | 61.1K | 1.5 | 7.5K| | |
| |Guarani | 3.7 | 28.2K | 1.3 | 7.5K | | |
| |Bribri | - | - | 1.5 | 7.5K | | |
| |Rarámuri | - | - | 1.9 | 7.5K | | |
| |Quechua | 5.9 | 97.3K | 4.9 | 31.1K | | |
| |Aymara | 1.7 | 32.9K | 5 | 30.7K| | |
| |Shipibo-Konibo | - | - | 1 | 7.9K | | |
| |Asháninka | - | - | 1.4 | 7.8K | | |
| |Spanish | 1.13K | 5M | - | - | | |
| |Total | 1.15K | 5.22M | 19.8 | 125.3K| | |
| # Github | |
| More details about our model can be found here: https://github.com/UBC-NLP/IndT5 | |
| # BibTex | |
| ```bibtex | |
| @inproceedings{nagoudi-etal-2021-indt5, | |
| title = "{I}nd{T}5: A Text-to-Text Transformer for 10 Indigenous Languages", | |
| author = "Nagoudi, El Moatez Billah and Chen, Wei-Rui and Abdul-Mageed, Muhammad and Cavusoglu, Hasan", | |
| booktitle = "Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas", | |
| month = jun, | |
| year = "2021", | |
| address = "Online", | |
| publisher = "Association for Computational Linguistics", | |
| url = "https://aclanthology.org/2021.americasnlp-1.30", | |
| doi = "10.18653/v1/2021.americasnlp-1.30", | |
| pages = "265--271" | |
| } | |
| ``` | |