|
|
--- |
|
|
library_name: transformers |
|
|
license: other |
|
|
license_name: rigoberta-nc |
|
|
license_link: https://huggingface.co/IIC/RigoBERTa-2.0/blob/main/LICENSE |
|
|
language: |
|
|
- es |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# RigoBERTa 2.0 |
|
|
|
|
|
 |
|
|
|
|
|
**RigoBERTa 2.0** is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
**RigoBERTa 2.0** was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language. |
|
|
|
|
|
- **Developed by:** IIC |
|
|
- **Model type:** Encoder |
|
|
- **Language(s) (NLP):** Spanish |
|
|
- **License:** rigoberta-nc (permissive Non Commercial) |
|
|
- **Finetuned from model:** FacebookAI/xlm-roberta-large |
|
|
|
|
|
## Intended Use & Limitations |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
**RigoBERTa 2.0** is designed for: |
|
|
|
|
|
- General text understanding in Spanish. |
|
|
- Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks. |
|
|
- Research and development purposes, including benchmarking and further model adaptation. |
|
|
|
|
|
Note that the license is **non-commercial**. For a commercial use, please contact us. |
|
|
|
|
|
### Limitations & Caveats |
|
|
|
|
|
- **Data Biases:** While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data. |
|
|
- **Operational Cost:** Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
- **Tokenizer:** Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model. |
|
|
- **Handling Long Sequences:** Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary. |
|
|
- **OOV Handling:** Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models. |
|
|
|
|
|
**Key Results:** |
|
|
|
|
|
- Achieves top performance on most of the tested datasets. |
|
|
|
|
|
Breakdown of the results: |
|
|
|
|
|
 |
|
|
[García Subies et al.](https://academic.oup.com/jamia/article-abstract/31/9/2137/7630016) |
|
|
|
|
|
 |
|
|
|
|
|
 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use RigoBERTa 2.0 in your research, please cite the associated paper: |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@misc{rigoberta2, |
|
|
author = { Instituto de Ingeniería del Conocimiento }, |
|
|
title = { RigoBERTa-2.0 }, |
|
|
year = 2025, |
|
|
url = { https://huggingface.co/IIC/RigoBERTa-2.0 }, |
|
|
doi = { 10.57967/hf/7048 }, |
|
|
publisher = { Hugging Face } |
|
|
} |
|
|
``` |
|
|
|