Update README.md
Browse files
README.md
CHANGED
|
@@ -1,32 +1,26 @@
|
|
| 1 |
-
IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
|
|
|
|
| 2 |
<img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
|
| 3 |
-
In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.
|
|
|
|
| 4 |
|
|
|
|
| 5 |
# IndT5
|
|
|
|
| 6 |
We train an Indigenous language model adopting the unified and flexible
|
| 7 |
-
text-to-text transfer Transformer (T5) approach
|
| 8 |
text-based language task as a “text-to-text" problem, taking text format
|
| 9 |
as input and producing new text format as output. T5 is essentially an
|
| 10 |
-
encoder-decoder Transformer
|
| 11 |
configuration and size to a BERT<sub>Base</sub> but with some
|
| 12 |
architectural modifications. Modifications include applying a
|
| 13 |
normalization layer before a sub-block and adding a pre-norm (i.e.,
|
| 14 |
initial input to the sub-block output).
|
|
|
|
| 15 |
# IndCourpus
|
|
|
|
| 16 |
We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
|
| 17 |
-
|
| 18 |
-
| **Language** | **Language Code** | **Main Location** | **Number of Speakers** |
|
| 19 |
-
|------------------|-------------------|-------------------|------------------------|
|
| 20 |
-
| Aymara | aym | Bolivia | 1,677,100 |
|
| 21 |
-
| Asháninka | cni | Peru | 35,200 |
|
| 22 |
-
| Bribri | bzd | Costa Rica | 7,000 |
|
| 23 |
-
| Guarani | gn | Paraguay | 6,652,790 |
|
| 24 |
-
| Hñähñu | oto | Mexico | 88,500 |
|
| 25 |
-
| Nahuatl | nah | Mexico | 410,000 |
|
| 26 |
-
| Quechua | quy | Peru | 7,384,920 |
|
| 27 |
-
| Rarámuri | tar | Mexico | 9,230 |
|
| 28 |
-
| Shipibo-Konibo | shp | Peru | 22,500 |
|
| 29 |
-
| Wixarika | hch | Mexico | 52,500 |
|
| 30 |
### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
|
| 31 |
| **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**|
|
| 32 |
|-------------------|------------------|-------------------|------------------------|-|
|
|
|
|
| 1 |
+
# IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
|
| 2 |
+
|
| 3 |
<img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
|
| 4 |
+
In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.
|
| 5 |
+
|
| 6 |
|
| 7 |
+
|
| 8 |
# IndT5
|
| 9 |
+
|
| 10 |
We train an Indigenous language model adopting the unified and flexible
|
| 11 |
+
text-to-text transfer Transformer (T5) approach. T5 treats every
|
| 12 |
text-based language task as a “text-to-text" problem, taking text format
|
| 13 |
as input and producing new text format as output. T5 is essentially an
|
| 14 |
+
encoder-decoder Transformer, with the encoder and decoder similar in
|
| 15 |
configuration and size to a BERT<sub>Base</sub> but with some
|
| 16 |
architectural modifications. Modifications include applying a
|
| 17 |
normalization layer before a sub-block and adding a pre-norm (i.e.,
|
| 18 |
initial input to the sub-block output).
|
| 19 |
+
|
| 20 |
# IndCourpus
|
| 21 |
+
|
| 22 |
We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
|
| 23 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
|
| 25 |
| **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**|
|
| 26 |
|-------------------|------------------|-------------------|------------------------|-|
|