| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - es |
| | base_model: |
| | - dccuchile/bert-base-spanish-wwm-uncased |
| | datasets: |
| | - manueltonneau/spanish-hate-speech-superset |
| | tags: |
| | - BETO |
| | - beto |
| | - hate_speech |
| | - immigrant |
| | - misogyny |
| | - BERT |
| | - spanish |
| | pipeline_tag: fill-mask |
| | library_name: transformers |
| | widget: |
| | - text: Los [MASK] son los causantes del aumento del desempleo |
| | --- |
| | # immisoBETO |
| |
|
| | immisoBETO is a domain adaptation of a [Spanish BERT](https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased) language model, specifically adapted to the immigrant and misogyny domain. |
| |
|
| | It was adapted using a guided lexical masking strategy during masked language model (MLM) pretraining. |
| | Instead of randomly masking tokens, we prioritized masking words appearing in a [immigrant](https://github.com/fmplaza/hate-speech-spanish-lexicons/blob/master/immigrant_lexicon.txt) and [misogyny](https://github.com/fmplaza/hate-speech-spanish-lexicons/blob/master/misogyny_lexicon.txt)-specific lexicon. |
| | The base corpus used for domain adaptation was the [Spanish Hate Speech Superset](https://huggingface.co/datasets/manueltonneau/spanish-hate-speech-superset). |
| |
|
| |
|
| | For training the model we used a batch size of 8, with a learning rate of 2e-5. We trained the model for four epochs using a NVIDIA GeForce RTX 5090 GPU. |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import pipeline |
| | pipe = pipeline("fill-mask", model="citiusLTL/immisoBETO") |
| | text = pipe("Los [MASK] son los causantes del aumento del desempleo") |
| | print(text) |
| | ``` |
| |
|
| | ## Load model directly |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | tokenizer = AutoTokenizer.from_pretrained("citiusLTL/immisoBETO") |
| | model = AutoModelForMaskedLM.from_pretrained("citiusLTL/immisoBETO") |
| | ``` |