RigoBERTa-2.0 / README.md
GuillemGSubies's picture
Updated citing information to add the DOI
a429e2b verified
---
library_name: transformers
license: other
license_name: rigoberta-nc
license_link: https://huggingface.co/IIC/RigoBERTa-2.0/blob/main/LICENSE
language:
- es
pipeline_tag: fill-mask
---
# RigoBERTa 2.0
![Logo](./data/rigoberta.jpg)
**RigoBERTa 2.0** is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding.
## Model Details
### Model Description
**RigoBERTa 2.0** was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language.
- **Developed by:** IIC
- **Model type:** Encoder
- **Language(s) (NLP):** Spanish
- **License:** rigoberta-nc (permissive Non Commercial)
- **Finetuned from model:** FacebookAI/xlm-roberta-large
## Intended Use & Limitations
### Intended Use
**RigoBERTa 2.0** is designed for:
- General text understanding in Spanish.
- Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks.
- Research and development purposes, including benchmarking and further model adaptation.
Note that the license is **non-commercial**. For a commercial use, please contact us.
### Limitations & Caveats
- **Data Biases:** While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data.
- **Operational Cost:** Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated.
## Training Details
### Training Procedure
#### Preprocessing
- **Tokenizer:** Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model.
- **Handling Long Sequences:** Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary.
- **OOV Handling:** Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text.
## Evaluation
RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models.
**Key Results:**
- Achieves top performance on most of the tested datasets.
Breakdown of the results:
![Clinical Bench](./data/becnh1.jpg)
[García Subies et al.](https://academic.oup.com/jamia/article-abstract/31/9/2137/7630016)
![Bench2](./data/becnh2.jpg)
![Bench3](./data/becnh3.jpg)
## Citation
If you use RigoBERTa 2.0 in your research, please cite the associated paper:
**BibTeX:**
```bibtex
@misc{rigoberta2,
author = { Instituto de Ingeniería del Conocimiento },
title = { RigoBERTa-2.0 },
year = 2025,
url = { https://huggingface.co/IIC/RigoBERTa-2.0 },
doi = { 10.57967/hf/7048 },
publisher = { Hugging Face }
}
```