guillermoruiz
/

bilma_CL

Model card Files Files and versions

guillermoruiz commited on Apr 2, 2024

Commit

c0b77c8

·

verified ·

1 Parent(s): 390260f

Create README.md

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+license: mit
+language:
+- es
+metrics:
+- accuracy
+tags:
+- code
+- nlp
+- custom
+- bilma
+tokenizer:
+- yes
+---
+# BILMA (Bert In Latin aMericA)
+Bilma is a BERT implementation in tensorflow and trained on the Masked Language Model task under the
+https://sadit.github.io/regional-spanish-models-talk-2022/ datasets. It is a model trained on regionalized
+Spanish short texts from the Twitter (now X) platform.
+We have pretrained models for the countries of Argentina, Chile, Colombia, Spain, Mexico, United States, Uruguay, and Venezuela.
+The accuracy of the models trained on the MLM task for different regions are:
+![bilma-mlm-comp](https://user-images.githubusercontent.com/392873/163045798-89bd45c5-b654-4f16-b3e2-5cf404e12ddd.png)
+# Pre-requisites
+You will need TensorFlow 2.4 or newer.
+# Quick guide
+Install the following version for the transformers library
+```
+!pip install transformers==4.30.2
+```
+Instanciate the tokenizer and the trained model
+```
+from transformers import AutoTokenizer
+from transformers import TFAutoModel
+tok = AutoTokenizer.from_pretrained("guillermoruiz/bilma_mx")
+model = TFAutoModel.from_pretrained("guillermoruiz/bilma_mx", trust_remote_code=True)
+```
+Now,we will need some text and then pass it through the tokenizer:
+```
+text = ["Vamos a comer [MASK].",
+        "Hace mucho que no voy al [MASK]."]
+t = tok(text, padding="max_length", return_tensors="tf", max_length=280)
+```
+With this, we are ready to use the model
+```
+p = model(t)
+```
+Now, we get the most likely words with:
+```
+import tensorflow as tf
+tok.batch_decode(tf.argmax(p["logits"], 2)[:,1:], skip_special_tokens=True)
+```
+which produces the output:
+```
+['vamos a comer tacos.', 'hace mucho que no voy al gym.']
+```
+If you find this model useful for your research, please cite the following paper:
+```
+@misc{tellez2022regionalized,
+      title={Regionalized models for Spanish language variations based on Twitter},
+      author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
+      year={2022},
+      eprint={2110.06128},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```