marbonora's picture
Updated README.md - Funding
1298548 verified
---
license: apache-2.0
language:
- es
base_model:
- BSC-LT/mRoBERTa
pipeline_tag: text-classification
library_name: transformers
---
# mRoBERTa_FT2_DFT2_lenguaje_claro
## Description
This model is fine-tuned from `BSC-LT/mRoBERTa` for the task of **clear language classification** in Spanish texts.
It predicts among **three categories of linguistic clarity**:
- **TXT**: Original text
- **FAC**: Facilitated text
- **LF**: Easy-to-read text
## Dataset
The dataset consists of **Spanish texts annotated with clarity levels**:
- **Training set**: 9,299 instances
- **Test set**: 3,723 instances
- **Extra test set**: 465 instances (texts from non-contiguous categories not seen during training, used to evaluate generalization)
## Training Parameters
- learning_rate: 2e-5
- num_train_epochs: 2
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- overwrite_output_dir: true
- logging_strategy: steps
- logging_steps: 10
- seed: 852
- fp16: true
## Results
### Combined test set (4,188 instances)
**Confusion Matrix**
| | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 1373 | 15 | 8 |
| **True LF** | 29 | 1367 | 0 |
| **True TXT** | 16 | 1 | 1379 |
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC | 0.9683 | 0.9835 | 0.9758 | 1396 |
| LF | 0.9884 | 0.9792 | 0.9838 | 1396 |
| TXT | 0.9942 | 0.9878 | 0.9910 | 1396 |
- Accuracy: **0.9835**
- Macro Avg F1: **0.9836**
---
### Test set (3,723 instances)
**Confusion Matrix**
| | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 1220 | 13 | 8 |
| **True LF** | 28 | 1213 | 0 |
| **True TXT** | 13 | 1 | 1227 |
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC | 0.9675 | 0.9831 | 0.9752 | 1241 |
| LF | 0.9886 | 0.9774 | 0.9830 | 1241 |
| TXT | 0.9935 | 0.9887 | 0.9911 | 1241 |
- Accuracy: **0.9831**
- Macro Avg F1: **0.9831**
---
### Extra test set (465 instances)
**Confusion Matrix**
| | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 153 | 2 | 0 |
| **True LF** | 1 | 154 | 0 |
| **True TXT** | 3 | 0 | 152 |
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC | 0.9745 | 0.9871 | 0.9808 | 155 |
| LF | 0.9872 | 0.9936 | 0.9903 | 155 |
| TXT | 1.0000 | 0.9806 | 0.9902 | 155 |
- Accuracy: **0.9871**
- Macro Avg F1: **0.9871**
---
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA.
## Reference
```bibtex
@misc{gplsi-mroberta-lenguajeclaro,
author = {Sepúlveda-Torres, Robiert and Martínez-Murillo, Iván and Bonora, Mar and Consuegra-Ayala, Juan Pablo},
title = {mRoBERTa_FT2_DFT2_lenguaje_claro: Fine-tuned model for clear language classification (TXT, FAC, LF)},
year = {2025},
howpublished = {\url{https://huggingface.co/gplsi/mRoBERTa_FT2_DFT2_lenguaje_claro}},
note = {Accessed: 2025-10-03}
}