Commit
·
56423fb
1
Parent(s):
da9fda5
Update app.py
Browse files
app.py
CHANGED
|
@@ -2,9 +2,13 @@ import gradio as gr
|
|
| 2 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
| 3 |
|
| 4 |
article='''
|
| 5 |
-
#
|
| 6 |
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
|
| 7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
## Model description
|
| 10 |
This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
|
|
@@ -34,18 +38,18 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
|
|
| 34 |
|:-----------------------------------------------------:|
|
| 35 |
| Anales de Tlatelolco |
|
| 36 |
| Diario |
|
| 37 |
-
| Documentos nauas de la Ciudad de
|
| 38 |
-
| Historia de
|
| 39 |
-
| La tinta negra y roja (
|
| 40 |
| Memorial Breve (Libro las ocho relaciones) |
|
| 41 |
-
|
|
| 42 |
| Nican Mopohua |
|
| 43 |
-
| Quinta
|
| 44 |
| Recetario Nahua de Milpa Alta D.F |
|
| 45 |
| Testimonios de la antigua palabra |
|
| 46 |
| Trece Poetas del Mundo Azteca |
|
| 47 |
-
| Una tortillita
|
| 48 |
-
| Vida
|
| 49 |
|
| 50 |
Also, to increase the amount of data we collected 3,000 extra samples from the web.
|
| 51 |
|
|
@@ -72,6 +76,12 @@ For a fair comparison, the models are evaluated on the same 505 validation Nahu
|
|
| 72 |
|
| 73 |
The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence.
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
## References
|
| 76 |
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
|
| 77 |
of transfer learning with a unified Text-to-Text transformer.
|
|
|
|
| 2 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
| 3 |
|
| 4 |
article='''
|
| 5 |
+
# Spanish Nahuatl Automatic Translation
|
| 6 |
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
|
| 7 |
|
| 8 |
+
## Motivation
|
| 9 |
+
|
| 10 |
+
One of the Sustainable Development Goals is "Reduced Inequalities". We know for sure that language is one
|
| 11 |
+
|
| 12 |
|
| 13 |
## Model description
|
| 14 |
This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
|
|
|
|
| 38 |
|:-----------------------------------------------------:|
|
| 39 |
| Anales de Tlatelolco |
|
| 40 |
| Diario |
|
| 41 |
+
| Documentos nauas de la Ciudad de México del siglo XVI |
|
| 42 |
+
| Historia de México narrada en náhuatl y español |
|
| 43 |
+
| La tinta negra y roja (antología de poesía náhuatl) |
|
| 44 |
| Memorial Breve (Libro las ocho relaciones) |
|
| 45 |
+
| Método auto-didáctico náhuatl-español |
|
| 46 |
| Nican Mopohua |
|
| 47 |
+
| Quinta Relación (Libro las ocho relaciones) |
|
| 48 |
| Recetario Nahua de Milpa Alta D.F |
|
| 49 |
| Testimonios de la antigua palabra |
|
| 50 |
| Trece Poetas del Mundo Azteca |
|
| 51 |
+
| Una tortillita nomás - Se taxkaltsin saj |
|
| 52 |
+
| Vida económica de Tenochtitlan |
|
| 53 |
|
| 54 |
Also, to increase the amount of data we collected 3,000 extra samples from the web.
|
| 55 |
|
|
|
|
| 76 |
|
| 77 |
The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence.
|
| 78 |
|
| 79 |
+
# Team members
|
| 80 |
+
7 - Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
|
| 81 |
+
8 - Rodrigo Martínez Arzate [(rockdrigoma)](https://huggingface.co/rockdrigoma)
|
| 82 |
+
9 - Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
|
| 83 |
+
10 - Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)
|
| 84 |
+
|
| 85 |
## References
|
| 86 |
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
|
| 87 |
of transfer learning with a unified Text-to-Text transformer.
|