Update README.md

2d9c3b7 verified 5 months ago

10.6 kB

	---
	identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL
	name: RoBERTaSense-FACIL
	version: 0.1.0
	keywords:
	- easy-to-read
	- meaning preservation
	- accessibility
	- spanish
	- text pair classification
	headline: >-
	Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read
	(E2R) adaptations.
	description: >
	RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning
	preservation in Easy-to-Read (E2R) adaptations. Given a pair {original,
	adapted}, it predicts whether the adaptation preserves the meaning of the
	original. ⚠️ Deprecation notice (base model): fine-tuned from
	PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively
	maintained Spanish RoBERTa models, see BSC-LT.
	task:
	- Text classification
	- Pairwise classification
	modelCategory:
	- Supervised classification
	language:
	- es
	license: apache-2.0
	parameterSize: 125M
	developmentStatus: Active
	dateCreated: 25-09-2025
	dateModified: 06-10-2025
	citation: >
	Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning
	Preservation for Easy-to-Read in Spanish. Retrieved from
	https://huggingface.co/oeg/RoBERTaSense-FACIL
	codeRepository: ''
	referencePublication: ''
	developmentLibrary: PyTorch + Transformers
	usageInstructions: >
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	repo = "oeg/RoBERTaSense-FACIL" model =
	AutoModelForSequenceClassification.from_pretrained(repo) tokenizer =
	AutoTokenizer.from_pretrained(repo)

	original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El
	lobo parecía amable. El lobo engañó a Caperucita."

	inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True,
	max_length=512) with torch.no_grad():
	logits = model(**inputs).logits
	probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]:
	probs[i] for i in range(len(probs))})
	modelRisks:
	- Trained for Spanish E2R; out-of-domain performance may degrade.
	- >-
	Binary labels compress nuanced cases; borderline adaptations may require human
	review.
	- Synthetic negatives do not cover all real-world human errors.
	- Base model is deprecated; security/robustness updates will not be inherited.
	evaluationMetrics:
	- Accuracy
	- F1
	- ROC-AUC
	evaluationResults: \|
	80/20 stratified split (seed=42). Example results:
	- Accuracy: 0.81
	- F1: 0.84
	- ROC-AUC: 0.83
	softwareRequirements:
	- python>=3.9
	- torch>=2.0
	- transformers>=4.40
	- datasets>=2.18
	storageRequirements:
	- ~500 MB
	memoryRequirements:
	- >-
	>= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch
	inference
	operatingSystem:
	- Linux
	- macOS
	- Windows
	processorRequirements:
	- x86_64 CPU (AVX recommended)
	GPURequirements:
	- >-
	Not required for single-pair inference; CUDA GPU recommended for batch
	processing
	distribution:
	- encodingFormat: ''
	contentUrl: ''
	contentSize: ''
	quantizationBits: ''
	quantizationMethod: ''
	trainedOn:
	- identifier: internal:e2r-positives
	name: Expert-validated E2R pairs (Spanish)
	description: >
	Positive pairs (original↔adapted) from an existing corpus validated by
	experts; used as the positive class.
	url: ''
	- identifier: internal:synthetic-negatives
	name: Synthetic hard negatives (Spanish)
	description: >
	Negatives generated via sentence shuffle, dropout, mismatch (derangement),
	paraphrase-with-distortion, and zero-shot NLI contradictions; trivial pairs
	filtered by BLEU/ROUGE-L thresholds.
	url: ''
	testedOn:
	- identifier: internal:heldout-20
	name: Held-out 20% stratified split
	description: >
	Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512
	tokens.
	evaluatedOn:
	- identifier: internal:heldout-20
	name: Held-out 20% stratified split
	description: >
	Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J
	(ROC).
	validatedOn: ''
	author:
	- name: Isam Diab Lozano
	identifier: https://orcid.org/0000-0002-3967-0672
	- name: Mari Carmen Suárez-Figueroa
	identifier: https://orcid.org/0000-0003-3807-5019
	successorOf: ''
	funder:
	- name: Comunidad de Madrid — PIPF-2022/COM-25762
	identifier: ''
	sharedBy:
	- name: Ontology Engineering Group (UPM)
	identifier: https://oeg.fi.upm.es/index.php/en/index.html
	wasGeneratedBy:
	- trainingRegion:
	- name: Europe (West)
	cloudProvider:
	- name: ''
	url: ''
	duration: ''
	hardwareType: ''
	fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
	sdPublisher:
	- name: Ontology Engineering Group
	url: https://oeg.fi.upm.es/index.php/en/index.html
	sdLicense: apache-2.0
	metrics:
	- accuracy
	- f1
	- roc_auc
	base_model:
	- PlanTL-GOB-ES/roberta-base-bne
	pipeline_tag: text-classification
	tags:
	- easy-to-read
	- meaning-preservation
	---

	## Model Card for RoBERTaSense-FACIL

	RoBERTaSense-FACIL (RoBERTa Fine-tuned for Accessible Comprehension In Language) is a Spanish RoBERTa model fine-tuned to assess meaning preservation in Easy-to-Read (E2R) adaptations. Given a pair of texts {original, adapted}, it predicts whether the adaptation preserves the meaning of the original.

	⚠️ Deprecation notice (base model): This model was fine-tuned from `PlanTL-GOB-ES/roberta-base-bne`. As for September 2025, this checkpoint is deprecated and no longer actively maintained. For actively maintained Spanish RoBERTa models, please see the BSC-LT organization: https://huggingface.co/BSC-LT

	---

	## 🚀 How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	repo = "oeg/RoBERTaSense-FACIL"
	model = AutoModelForSequenceClassification.from_pretrained(repo)
	tokenizer = AutoTokenizer.from_pretrained(repo)

	original = "El lobo, que parecía amable, engañó a Caperucita."
	adapted = "El lobo parecía amable.
	El lobo engañó a Caperucita."

	# Encode the pair (original, adapted)
	inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512)

	with torch.no_grad():
	logits = model(**inputs).logits

	probs = logits.softmax(-1).squeeze().tolist()
	print({model.config.id2label[i]: probs[i] for i in range(len(probs))})
	````

	Suggested labels (adjust to your checkpoint):

	```json
	{
	"id2label": {"0": "DOES_NOT_PRESERVE", "1": "PRESERVES_MEANING"},
	"label2id": {"DOES_NOT_PRESERVE": 0, "PRESERVES_MEANING": 1}
	}
	```

	---

	## Model Description

	* Developed by: Ontology Engineering Group (UPM) / Authors: Isam Diab Lozano and Mari Carmen Suárez-Figueroa
	* Funded by: "Ayudas para la contratación de personal investigador predoctoral en formación para el año 2022" (Reference: PIPF-2022/COM-25762) by Comunidad Autónoma de Madrid (Spain)
	* Model type: Encoder-only Transformer (RoBERTa) with a classification head
	* Language: Spanish (es)
	* License: Apache-2.0
	* Finetuned from model: `PlanTL-GOB-ES/roberta-base-bne` (deprecated; see notice above)

	---

	## Uses

	### Direct Use

	* Automatic scoring of meaning preservation for Spanish Easy-to-Read adaptations.
	* As a signal in content quality checks for accessibility pipelines.

	### Out-of-Scope Use

	* Clinical, legal, or other high-stakes decisions without human expert oversight.
	* Non-Spanish or out-of-domain texts without prior adaptation or re-training.

	---

	## Bias, Risks, and Limitations

	* Domain limitation: trained for Spanish E2R; performance may degrade on other genres/domains.
	* Binary labels: compress nuanced cases; borderline adaptations may require human review.
	* Synthetic negatives: not all human errors are covered by synthetic negative strategies.
	* Base deprecation: the upstream base model is deprecated; security/robustness updates won’t be inherited.

	### Recommendations

	* Calibrate probabilities (e.g., temperature scaling) and expose confidence scores.
	* Use threshold tuning (e.g., Youden’s J) to trade precision/recall for your setting.
	* Keep a human-in-the-loop for critical use cases and periodic error audits.

	---

	## How to Get Started with the Model

	See How to Use above. For pairwise inputs, encode as sentence pairs:

	```python
	inputs = tokenizer(text_original, text_adapted, return_tensors="pt", truncation=True, max_length=512)
	```

	---

	## Training Details

	### Training Data

	* Source: Spanish pairs (original - adapted) curated/validated by experts.
	* Columns: `text1` (original), `text2` (adaptation), `Label` (0/1), `neg_type`.
	* Labels: `1 = PRESERVES_MEANING`, `0 = DOES_NOT_PRESERVE`.
	* Negative types used in training data construction: `shuffle`, `dropout`, `mismatch` (derangement), `paraphrase_distortion`, `nli_contradiction`.
	* Split: 80/20, stratified by `Label` (random_state=42).

	### Training Procedure

	#### Preprocessing

	* Pair tokenization with truncation at 512 tokens:

	```python
	tokenizer(text1, text2, truncation=True, max_length=512)
	```

	#### Training Hyperparameters

	* Training regime: fp16 mixed precision (if supported; otherwise fp32)
	* Arguments:

	* `num_train_epochs=5`
	* `per_device_train_batch_size=32`
	* `per_device_eval_batch_size=16`
	* `learning_rate=2e-5`
	* `weight_decay=0.01`
	* `warmup_ratio=0.1`
	* `evaluation_strategy="epoch"`, `save_strategy="epoch"`
	* `load_best_model_at_end=True`, `metric_for_best_model="f1"`
	* Optimizer: AdamW
	* Loss: CrossEntropy (2 logits)


	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	* Held-out 20% stratified split of the curated E2R pairs.

	#### Factors

	* Report per-negative-type breakdown (e.g., performance on `mismatch`, `paraphrase_distortion`, etc.).

	#### Metrics

	* Accuracy, F1, ROC-AUC.

	### Results

	* Accuracy: `0.81`
	* F1: `0.84`
	* ROC-AUC: `0.83`
	* Threshold tuned via Youden’s J for operating point selection.

	## Technical Specifications

	### Model Architecture and Objective

	* Encoder-only RoBERTa with a classification head (`Linear(hidden → 2)`).
	* Objective: supervised cross-entropy on binary label.

	---

	## Citation

	BibTeX:

	```bibtex
	@software{roberta_facil_2025,
	title = {RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish},
	author = {Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen},
	year = {2025},
	url = {https://huggingface.co/oeg/RoBERTaSense-FACIL}
	}
	```

	APA:
	Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen. (2025). RoBERTa-FACIL: Meaning Preservation for Easy-to-Read in Spanish. Hugging Face. [https://huggingface.co/oeg/RoBERTaSense-FACIL](https://huggingface.co/oeg/RoBERTaSense-FACIL)