|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- es |
|
|
base_model: |
|
|
- BSC-LT/mRoBERTa |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# mRoBERTa_FT2_DFT2_lenguaje_claro |
|
|
|
|
|
## Description |
|
|
This model is fine-tuned from `BSC-LT/mRoBERTa` for the task of **clear language classification** in Spanish texts. |
|
|
|
|
|
It predicts among **three categories of linguistic clarity**: |
|
|
- **TXT**: Original text |
|
|
- **FAC**: Facilitated text |
|
|
- **LF**: Easy-to-read text |
|
|
|
|
|
|
|
|
## Dataset |
|
|
The dataset consists of **Spanish texts annotated with clarity levels**: |
|
|
|
|
|
- **Training set**: 9,299 instances |
|
|
- **Test set**: 3,723 instances |
|
|
- **Extra test set**: 465 instances (texts from non-contiguous categories not seen during training, used to evaluate generalization) |
|
|
|
|
|
## Training Parameters |
|
|
- learning_rate: 2e-5 |
|
|
- num_train_epochs: 2 |
|
|
- per_device_train_batch_size: 8 |
|
|
- per_device_eval_batch_size: 8 |
|
|
- overwrite_output_dir: true |
|
|
- logging_strategy: steps |
|
|
- logging_steps: 10 |
|
|
- seed: 852 |
|
|
- fp16: true |
|
|
|
|
|
## Results |
|
|
|
|
|
### Combined test set (4,188 instances) |
|
|
**Confusion Matrix** |
|
|
|
|
|
| | Pred FAC | Pred LF | Pred TXT | |
|
|
| ------------ | -------- | ------- | -------- | |
|
|
| **True FAC** | 1373 | 15 | 8 | |
|
|
| **True LF** | 29 | 1367 | 0 | |
|
|
| **True TXT** | 16 | 1 | 1379 | |
|
|
|
|
|
|
|
|
| Class | Precision | Recall | F1-score | Support | |
|
|
|-------|-----------|--------|----------|---------| |
|
|
| FAC | 0.9683 | 0.9835 | 0.9758 | 1396 | |
|
|
| LF | 0.9884 | 0.9792 | 0.9838 | 1396 | |
|
|
| TXT | 0.9942 | 0.9878 | 0.9910 | 1396 | |
|
|
|
|
|
- Accuracy: **0.9835** |
|
|
- Macro Avg F1: **0.9836** |
|
|
--- |
|
|
|
|
|
### Test set (3,723 instances) |
|
|
**Confusion Matrix** |
|
|
|
|
|
| | Pred FAC | Pred LF | Pred TXT | |
|
|
| ------------ | -------- | ------- | -------- | |
|
|
| **True FAC** | 1220 | 13 | 8 | |
|
|
| **True LF** | 28 | 1213 | 0 | |
|
|
| **True TXT** | 13 | 1 | 1227 | |
|
|
|
|
|
|
|
|
| Class | Precision | Recall | F1-score | Support | |
|
|
|-------|-----------|--------|----------|---------| |
|
|
| FAC | 0.9675 | 0.9831 | 0.9752 | 1241 | |
|
|
| LF | 0.9886 | 0.9774 | 0.9830 | 1241 | |
|
|
| TXT | 0.9935 | 0.9887 | 0.9911 | 1241 | |
|
|
|
|
|
- Accuracy: **0.9831** |
|
|
- Macro Avg F1: **0.9831** |
|
|
--- |
|
|
|
|
|
### Extra test set (465 instances) |
|
|
**Confusion Matrix** |
|
|
|
|
|
| | Pred FAC | Pred LF | Pred TXT | |
|
|
| ------------ | -------- | ------- | -------- | |
|
|
| **True FAC** | 153 | 2 | 0 | |
|
|
| **True LF** | 1 | 154 | 0 | |
|
|
| **True TXT** | 3 | 0 | 152 | |
|
|
|
|
|
|
|
|
| Class | Precision | Recall | F1-score | Support | |
|
|
|-------|-----------|--------|----------|---------| |
|
|
| FAC | 0.9745 | 0.9871 | 0.9808 | 155 | |
|
|
| LF | 0.9872 | 0.9936 | 0.9903 | 155 | |
|
|
| TXT | 1.0000 | 0.9806 | 0.9902 | 155 | |
|
|
|
|
|
- Accuracy: **0.9871** |
|
|
- Macro Avg F1: **0.9871** |
|
|
|
|
|
--- |
|
|
|
|
|
## Funding |
|
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA. |
|
|
|
|
|
## Reference |
|
|
```bibtex |
|
|
@misc{gplsi-mroberta-lenguajeclaro, |
|
|
author = {Sepúlveda-Torres, Robiert and Martínez-Murillo, Iván and Bonora, Mar and Consuegra-Ayala, Juan Pablo}, |
|
|
title = {mRoBERTa_FT2_DFT2_lenguaje_claro: Fine-tuned model for clear language classification (TXT, FAC, LF)}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/gplsi/mRoBERTa_FT2_DFT2_lenguaje_claro}}, |
|
|
note = {Accessed: 2025-10-03} |
|
|
} |