File size: 3,564 Bytes
fa8d7de
 
 
 
 
 
 
 
 
 
82c6dc0
fa8d7de
 
82c6dc0
 
 
 
 
 
 
fa8d7de
 
82c6dc0
fa8d7de
82c6dc0
 
 
fa8d7de
 
 
 
 
 
 
 
 
 
 
 
82c6dc0
 
 
 
5771d18
 
 
 
 
 
 
 
82c6dc0
 
 
 
 
 
 
 
 
 
 
 
5771d18
 
 
 
 
 
 
 
82c6dc0
 
 
 
 
 
 
 
 
fa8d7de
82c6dc0
 
5771d18
 
 
 
 
 
 
 
fa8d7de
 
82c6dc0
 
 
fa8d7de
82c6dc0
 
fa8d7de
 
 
1298548
 
 
fa8d7de
 
82c6dc0
f20bba7
82c6dc0
fa8d7de
82c6dc0
fa8d7de
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
language:
- es
base_model:
- BSC-LT/mRoBERTa
pipeline_tag: text-classification
library_name: transformers
---

# mRoBERTa_FT2_DFT2_lenguaje_claro

## Description
This model is fine-tuned from `BSC-LT/mRoBERTa` for the task of **clear language classification** in Spanish texts.  

It predicts among **three categories of linguistic clarity**:  
- **TXT**: Original text  
- **FAC**: Facilitated text  
- **LF**: Easy-to-read text  


## Dataset
The dataset consists of **Spanish texts annotated with clarity levels**:  

- **Training set**: 9,299 instances  
- **Test set**: 3,723 instances  
- **Extra test set**: 465 instances (texts from non-contiguous categories not seen during training, used to evaluate generalization)  

## Training Parameters
- learning_rate: 2e-5  
- num_train_epochs: 2  
- per_device_train_batch_size: 8  
- per_device_eval_batch_size: 8  
- overwrite_output_dir: true  
- logging_strategy: steps  
- logging_steps: 10  
- seed: 852  
- fp16: true  

## Results

### Combined test set (4,188 instances)
**Confusion Matrix**

|              | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 1373     | 15      | 8        |
| **True LF**  | 29       | 1367    | 0        |
| **True TXT** | 16       | 1       | 1379     |


| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC   | 0.9683    | 0.9835 | 0.9758   | 1396    |
| LF    | 0.9884    | 0.9792 | 0.9838   | 1396    |
| TXT   | 0.9942    | 0.9878 | 0.9910   | 1396    |

- Accuracy: **0.9835**  
- Macro Avg F1: **0.9836**  
---

### Test set (3,723 instances)
**Confusion Matrix**

|              | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 1220     | 13      | 8        |
| **True LF**  | 28       | 1213    | 0        |
| **True TXT** | 13       | 1       | 1227     |


| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC   | 0.9675    | 0.9831 | 0.9752   | 1241    |
| LF    | 0.9886    | 0.9774 | 0.9830   | 1241    |
| TXT   | 0.9935    | 0.9887 | 0.9911   | 1241    |

- Accuracy: **0.9831**  
- Macro Avg F1: **0.9831**  
---

### Extra test set (465 instances)
**Confusion Matrix**

|              | Pred FAC | Pred LF | Pred TXT |
| ------------ | -------- | ------- | -------- |
| **True FAC** | 153      | 2       | 0        |
| **True LF**  | 1        | 154     | 0        |
| **True TXT** | 3        | 0       | 152      |


| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| FAC   | 0.9745    | 0.9871 | 0.9808   | 155     |
| LF    | 0.9872    | 0.9936 | 0.9903   | 155     |
| TXT   | 1.0000    | 0.9806 | 0.9902   | 155     |

- Accuracy: **0.9871**  
- Macro Avg F1: **0.9871**  

---

## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA.

## Reference
```bibtex
@misc{gplsi-mroberta-lenguajeclaro,
  author       = {Sepúlveda-Torres, Robiert and Martínez-Murillo, Iván and Bonora, Mar and Consuegra-Ayala, Juan Pablo},
  title        = {mRoBERTa_FT2_DFT2_lenguaje_claro: Fine-tuned model for clear language classification (TXT, FAC, LF)},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/gplsi/mRoBERTa_FT2_DFT2_lenguaje_claro}},
  note         = {Accessed: 2025-10-03}
}