Aitana-ClearLangDetection-R-1.0 / README.md

Updated README.md - Funding

1298548 verified 2 months ago

3.56 kB

	---
	license: apache-2.0
	language:
	- es
	base_model:
	- BSC-LT/mRoBERTa
	pipeline_tag: text-classification
	library_name: transformers
	---

	# mRoBERTa_FT2_DFT2_lenguaje_claro

	## Description
	This model is fine-tuned from `BSC-LT/mRoBERTa` for the task of clear language classification in Spanish texts.

	It predicts among three categories of linguistic clarity:
	- TXT: Original text
	- FAC: Facilitated text
	- LF: Easy-to-read text


	## Dataset
	The dataset consists of Spanish texts annotated with clarity levels:

	- Training set: 9,299 instances
	- Test set: 3,723 instances
	- Extra test set: 465 instances (texts from non-contiguous categories not seen during training, used to evaluate generalization)

	## Training Parameters
	- learning_rate: 2e-5
	- num_train_epochs: 2
	- per_device_train_batch_size: 8
	- per_device_eval_batch_size: 8
	- overwrite_output_dir: true
	- logging_strategy: steps
	- logging_steps: 10
	- seed: 852
	- fp16: true

	## Results

	### Combined test set (4,188 instances)
	Confusion Matrix

	\| \| Pred FAC \| Pred LF \| Pred TXT \|
	\| ------------ \| -------- \| ------- \| -------- \|
	\| True FAC \| 1373 \| 15 \| 8 \|
	\| True LF \| 29 \| 1367 \| 0 \|
	\| True TXT \| 16 \| 1 \| 1379 \|


	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| FAC \| 0.9683 \| 0.9835 \| 0.9758 \| 1396 \|
	\| LF \| 0.9884 \| 0.9792 \| 0.9838 \| 1396 \|
	\| TXT \| 0.9942 \| 0.9878 \| 0.9910 \| 1396 \|

	- Accuracy: 0.9835
	- Macro Avg F1: 0.9836
	---

	### Test set (3,723 instances)
	Confusion Matrix

	\| \| Pred FAC \| Pred LF \| Pred TXT \|
	\| ------------ \| -------- \| ------- \| -------- \|
	\| True FAC \| 1220 \| 13 \| 8 \|
	\| True LF \| 28 \| 1213 \| 0 \|
	\| True TXT \| 13 \| 1 \| 1227 \|


	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| FAC \| 0.9675 \| 0.9831 \| 0.9752 \| 1241 \|
	\| LF \| 0.9886 \| 0.9774 \| 0.9830 \| 1241 \|
	\| TXT \| 0.9935 \| 0.9887 \| 0.9911 \| 1241 \|

	- Accuracy: 0.9831
	- Macro Avg F1: 0.9831
	---

	### Extra test set (465 instances)
	Confusion Matrix

	\| \| Pred FAC \| Pred LF \| Pred TXT \|
	\| ------------ \| -------- \| ------- \| -------- \|
	\| True FAC \| 153 \| 2 \| 0 \|
	\| True LF \| 1 \| 154 \| 0 \|
	\| True TXT \| 3 \| 0 \| 152 \|


	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| FAC \| 0.9745 \| 0.9871 \| 0.9808 \| 155 \|
	\| LF \| 0.9872 \| 0.9936 \| 0.9903 \| 155 \|
	\| TXT \| 1.0000 \| 0.9806 \| 0.9902 \| 155 \|

	- Accuracy: 0.9871
	- Macro Avg F1: 0.9871

	---

	## Funding
	This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA.

	## Reference
	```bibtex
	@misc{gplsi-mroberta-lenguajeclaro,
	author = {Sepúlveda-Torres, Robiert and Martínez-Murillo, Iván and Bonora, Mar and Consuegra-Ayala, Juan Pablo},
	title = {mRoBERTa_FT2_DFT2_lenguaje_claro: Fine-tuned model for clear language classification (TXT, FAC, LF)},
	year = {2025},
	howpublished = {\url{https://huggingface.co/gplsi/mRoBERTa_FT2_DFT2_lenguaje_claro}},
	note = {Accessed: 2025-10-03}
	}

	---
	license: apache-2.0
	language:
	- es
	base_model:
	- BSC-LT/mRoBERTa
	pipeline_tag: text-classification
	library_name: transformers
	---

	# mRoBERTa_FT2_DFT2_lenguaje_claro

	## Description
	This model is fine-tuned from `BSC-LT/mRoBERTa` for the task of clear language classification in Spanish texts.

	It predicts among three categories of linguistic clarity:
	- TXT: Original text
	- FAC: Facilitated text
	- LF: Easy-to-read text


	## Dataset
	The dataset consists of Spanish texts annotated with clarity levels:

	- Training set: 9,299 instances
	- Test set: 3,723 instances
	- Extra test set: 465 instances (texts from non-contiguous categories not seen during training, used to evaluate generalization)

	## Training Parameters
	- learning_rate: 2e-5
	- num_train_epochs: 2
	- per_device_train_batch_size: 8
	- per_device_eval_batch_size: 8
	- overwrite_output_dir: true
	- logging_strategy: steps
	- logging_steps: 10
	- seed: 852
	- fp16: true

	## Results

	### Combined test set (4,188 instances)
	Confusion Matrix

	\| \| Pred FAC \| Pred LF \| Pred TXT \|
	\| ------------ \| -------- \| ------- \| -------- \|
	\| True FAC \| 1373 \| 15 \| 8 \|
	\| True LF \| 29 \| 1367 \| 0 \|
	\| True TXT \| 16 \| 1 \| 1379 \|


	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| FAC \| 0.9683 \| 0.9835 \| 0.9758 \| 1396 \|
	\| LF \| 0.9884 \| 0.9792 \| 0.9838 \| 1396 \|
	\| TXT \| 0.9942 \| 0.9878 \| 0.9910 \| 1396 \|

	- Accuracy: 0.9835
	- Macro Avg F1: 0.9836
	---

	### Test set (3,723 instances)
	Confusion Matrix

	\| \| Pred FAC \| Pred LF \| Pred TXT \|
	\| ------------ \| -------- \| ------- \| -------- \|
	\| True FAC \| 1220 \| 13 \| 8 \|
	\| True LF \| 28 \| 1213 \| 0 \|
	\| True TXT \| 13 \| 1 \| 1227 \|


	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| FAC \| 0.9675 \| 0.9831 \| 0.9752 \| 1241 \|
	\| LF \| 0.9886 \| 0.9774 \| 0.9830 \| 1241 \|
	\| TXT \| 0.9935 \| 0.9887 \| 0.9911 \| 1241 \|

	- Accuracy: 0.9831
	- Macro Avg F1: 0.9831
	---

	### Extra test set (465 instances)
	Confusion Matrix

	\| \| Pred FAC \| Pred LF \| Pred TXT \|
	\| ------------ \| -------- \| ------- \| -------- \|
	\| True FAC \| 153 \| 2 \| 0 \|
	\| True LF \| 1 \| 154 \| 0 \|
	\| True TXT \| 3 \| 0 \| 152 \|


	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| FAC \| 0.9745 \| 0.9871 \| 0.9808 \| 155 \|
	\| LF \| 0.9872 \| 0.9936 \| 0.9903 \| 155 \|
	\| TXT \| 1.0000 \| 0.9806 \| 0.9902 \| 155 \|

	- Accuracy: 0.9871
	- Macro Avg F1: 0.9871

	---

	## Funding
	This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA.

	## Reference
	```bibtex
	@misc{gplsi-mroberta-lenguajeclaro,
	author = {Sepúlveda-Torres, Robiert and Martínez-Murillo, Iván and Bonora, Mar and Consuegra-Ayala, Juan Pablo},
	title = {mRoBERTa_FT2_DFT2_lenguaje_claro: Fine-tuned model for clear language classification (TXT, FAC, LF)},
	year = {2025},
	howpublished = {\url{https://huggingface.co/gplsi/mRoBERTa_FT2_DFT2_lenguaje_claro}},
	note = {Accessed: 2025-10-03}
	}