File size: 3,564 Bytes
fa8d7de 82c6dc0 fa8d7de 82c6dc0 fa8d7de 82c6dc0 fa8d7de 82c6dc0 fa8d7de 82c6dc0 5771d18 82c6dc0 5771d18 82c6dc0 fa8d7de 82c6dc0 5771d18 fa8d7de 82c6dc0 fa8d7de 82c6dc0 fa8d7de 1298548 fa8d7de 82c6dc0 f20bba7 82c6dc0 fa8d7de 82c6dc0 fa8d7de |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
license: apache-2.0
language:
- es
base_model:
- BSC-LT/mRoBERTa
pipeline_tag: text-classification
library_name: transformers
---
# mRoBERTa_FT2_DFT2_lenguaje_claro
## Description
This model is fine-tuned from `BSC-LT/mRoBERTa` for the task of **clear language classification** in Spanish texts.
It predicts among **three categories of linguistic clarity**:
- **TXT**: Original text
- **FAC**: Facilitated text
- **LF**: Easy-to-read text
## Dataset
The dataset consists of **Spanish texts annotated with clarity levels**:
- **Training set**: 9,299 instances
- **Test set**: 3,723 instances
- **Extra test set**: 465 instances (texts from non-contiguous categories not seen during training, used to evaluate generalization)
## Training Parameters
- learning_rate: 2e-5
- num_train_epochs: 2
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- overwrite_output_dir: true
- logging_strategy: steps
- logging_steps: 10
- seed: 852
- fp16: true
## Results
### Combined test set (4,188 instances)
**Confusion Matrix**
| | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 1373 | 15 | 8 |
| **True LF** | 29 | 1367 | 0 |
| **True TXT** | 16 | 1 | 1379 |
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC | 0.9683 | 0.9835 | 0.9758 | 1396 |
| LF | 0.9884 | 0.9792 | 0.9838 | 1396 |
| TXT | 0.9942 | 0.9878 | 0.9910 | 1396 |
- Accuracy: **0.9835**
- Macro Avg F1: **0.9836**
---
### Test set (3,723 instances)
**Confusion Matrix**
| | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 1220 | 13 | 8 |
| **True LF** | 28 | 1213 | 0 |
| **True TXT** | 13 | 1 | 1227 |
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC | 0.9675 | 0.9831 | 0.9752 | 1241 |
| LF | 0.9886 | 0.9774 | 0.9830 | 1241 |
| TXT | 0.9935 | 0.9887 | 0.9911 | 1241 |
- Accuracy: **0.9831**
- Macro Avg F1: **0.9831**
---
### Extra test set (465 instances)
**Confusion Matrix**
| | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 153 | 2 | 0 |
| **True LF** | 1 | 154 | 0 |
| **True TXT** | 3 | 0 | 152 |
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC | 0.9745 | 0.9871 | 0.9808 | 155 |
| LF | 0.9872 | 0.9936 | 0.9903 | 155 |
| TXT | 1.0000 | 0.9806 | 0.9902 | 155 |
- Accuracy: **0.9871**
- Macro Avg F1: **0.9871**
---
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA.
## Reference
```bibtex
@misc{gplsi-mroberta-lenguajeclaro,
author = {Sepúlveda-Torres, Robiert and Martínez-Murillo, Iván and Bonora, Mar and Consuegra-Ayala, Juan Pablo},
title = {mRoBERTa_FT2_DFT2_lenguaje_claro: Fine-tuned model for clear language classification (TXT, FAC, LF)},
year = {2025},
howpublished = {\url{https://huggingface.co/gplsi/mRoBERTa_FT2_DFT2_lenguaje_claro}},
note = {Accessed: 2025-10-03}
} |