RoBERTaSense-FACIL / README.md
isamdiablo's picture
Update README.md
2d9c3b7 verified
---
identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL
name: RoBERTaSense-FACIL
version: 0.1.0
keywords:
- easy-to-read
- meaning preservation
- accessibility
- spanish
- text pair classification
headline: >-
Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read
(E2R) adaptations.
description: >
RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning
preservation in Easy-to-Read (E2R) adaptations. Given a pair {original,
adapted}, it predicts whether the adaptation preserves the meaning of the
original. ⚠️ Deprecation notice (base model): fine-tuned from
PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively
maintained Spanish RoBERTa models, see BSC-LT.
task:
- Text classification
- Pairwise classification
modelCategory:
- Supervised classification
language:
- es
license: apache-2.0
parameterSize: 125M
developmentStatus: Active
dateCreated: 25-09-2025
dateModified: 06-10-2025
citation: >
Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning
Preservation for Easy-to-Read in Spanish. Retrieved from
https://huggingface.co/oeg/RoBERTaSense-FACIL
codeRepository: ''
referencePublication: ''
developmentLibrary: PyTorch + Transformers
usageInstructions: >
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "oeg/RoBERTaSense-FACIL" model =
AutoModelForSequenceClassification.from_pretrained(repo) tokenizer =
AutoTokenizer.from_pretrained(repo)
original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El
lobo parecía amable. El lobo engañó a Caperucita."
inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True,
max_length=512) with torch.no_grad():
logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]:
probs[i] for i in range(len(probs))})
modelRisks:
- Trained for Spanish E2R; out-of-domain performance may degrade.
- >-
Binary labels compress nuanced cases; borderline adaptations may require human
review.
- Synthetic negatives do not cover all real-world human errors.
- Base model is deprecated; security/robustness updates will not be inherited.
evaluationMetrics:
- Accuracy
- F1
- ROC-AUC
evaluationResults: |
80/20 stratified split (seed=42). Example results:
- Accuracy: 0.81
- F1: 0.84
- ROC-AUC: 0.83
softwareRequirements:
- python>=3.9
- torch>=2.0
- transformers>=4.40
- datasets>=2.18
storageRequirements:
- ~500 MB
memoryRequirements:
- >-
>= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch
inference
operatingSystem:
- Linux
- macOS
- Windows
processorRequirements:
- x86_64 CPU (AVX recommended)
GPURequirements:
- >-
Not required for single-pair inference; CUDA GPU recommended for batch
processing
distribution:
- encodingFormat: ''
contentUrl: ''
contentSize: ''
quantizationBits: ''
quantizationMethod: ''
trainedOn:
- identifier: internal:e2r-positives
name: Expert-validated E2R pairs (Spanish)
description: >
Positive pairs (original↔adapted) from an existing corpus validated by
experts; used as the positive class.
url: ''
- identifier: internal:synthetic-negatives
name: Synthetic hard negatives (Spanish)
description: >
Negatives generated via sentence shuffle, dropout, mismatch (derangement),
paraphrase-with-distortion, and zero-shot NLI contradictions; trivial pairs
filtered by BLEU/ROUGE-L thresholds.
url: ''
testedOn:
- identifier: internal:heldout-20
name: Held-out 20% stratified split
description: >
Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512
tokens.
evaluatedOn:
- identifier: internal:heldout-20
name: Held-out 20% stratified split
description: >
Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J
(ROC).
validatedOn: ''
author:
- name: Isam Diab Lozano
identifier: https://orcid.org/0000-0002-3967-0672
- name: Mari Carmen Suárez-Figueroa
identifier: https://orcid.org/0000-0003-3807-5019
successorOf: ''
funder:
- name: Comunidad de Madrid PIPF-2022/COM-25762
identifier: ''
sharedBy:
- name: Ontology Engineering Group (UPM)
identifier: https://oeg.fi.upm.es/index.php/en/index.html
wasGeneratedBy:
- trainingRegion:
- name: Europe (West)
cloudProvider:
- name: ''
url: ''
duration: ''
hardwareType: ''
fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
sdPublisher:
- name: Ontology Engineering Group
url: https://oeg.fi.upm.es/index.php/en/index.html
sdLicense: apache-2.0
metrics:
- accuracy
- f1
- roc_auc
base_model:
- PlanTL-GOB-ES/roberta-base-bne
pipeline_tag: text-classification
tags:
- easy-to-read
- meaning-preservation
---
## Model Card for RoBERTaSense-FACIL
**RoBERTaSense-FACIL** (RoBERTa Fine-tuned for Accessible Comprehension In Language) is a Spanish RoBERTa model fine-tuned to assess **meaning preservation** in **Easy-to-Read (E2R)** adaptations. Given a pair of texts {original, adapted}, it predicts whether the adaptation **preserves** the meaning of the original.
⚠️ **Deprecation notice (base model):** This model was fine-tuned from `PlanTL-GOB-ES/roberta-base-bne`. As for September 2025, this checkpoint is **deprecated** and no longer actively maintained. For actively maintained Spanish RoBERTa models, please see the **BSC-LT** organization: https://huggingface.co/BSC-LT
---
## 🚀 How to Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "oeg/RoBERTaSense-FACIL"
model = AutoModelForSequenceClassification.from_pretrained(repo)
tokenizer = AutoTokenizer.from_pretrained(repo)
original = "El lobo, que parecía amable, engañó a Caperucita."
adapted = "El lobo parecía amable.
El lobo engañó a Caperucita."
# Encode the pair (original, adapted)
inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze().tolist()
print({model.config.id2label[i]: probs[i] for i in range(len(probs))})
````
**Suggested labels (adjust to your checkpoint):**
```json
{
"id2label": {"0": "DOES_NOT_PRESERVE", "1": "PRESERVES_MEANING"},
"label2id": {"DOES_NOT_PRESERVE": 0, "PRESERVES_MEANING": 1}
}
```
---
## Model Description
* **Developed by:** Ontology Engineering Group (UPM) / Authors: Isam Diab Lozano and Mari Carmen Suárez-Figueroa
* **Funded by:** "Ayudas para la contratación de personal investigador predoctoral en formación para el año 2022" (Reference: PIPF-2022/COM-25762) by Comunidad Autónoma de Madrid (Spain)
* **Model type:** Encoder-only Transformer (RoBERTa) with a classification head
* **Language:** Spanish (es)
* **License:** Apache-2.0
* **Finetuned from model:** `PlanTL-GOB-ES/roberta-base-bne` (deprecated; see notice above)
---
## Uses
### Direct Use
* Automatic scoring of **meaning preservation** for Spanish **Easy-to-Read** adaptations.
* As a signal in content quality checks for accessibility pipelines.
### Out-of-Scope Use
* Clinical, legal, or other high-stakes decisions without human expert oversight.
* Non-Spanish or out-of-domain texts without prior adaptation or re-training.
---
## Bias, Risks, and Limitations
* **Domain limitation:** trained for Spanish E2R; performance may degrade on other genres/domains.
* **Binary labels:** compress nuanced cases; borderline adaptations may require human review.
* **Synthetic negatives:** not all human errors are covered by synthetic negative strategies.
* **Base deprecation:** the upstream base model is deprecated; security/robustness updates won’t be inherited.
### Recommendations
* Calibrate probabilities (e.g., temperature scaling) and expose confidence scores.
* Use threshold tuning (e.g., Youden’s J) to trade precision/recall for your setting.
* Keep a **human-in-the-loop** for critical use cases and periodic error audits.
---
## How to Get Started with the Model
See **How to Use** above. For pairwise inputs, encode as sentence pairs:
```python
inputs = tokenizer(text_original, text_adapted, return_tensors="pt", truncation=True, max_length=512)
```
---
## Training Details
### Training Data
* **Source:** Spanish pairs (*original - adapted*) curated/validated by experts.
* **Columns:** `text1` (original), `text2` (adaptation), `Label` (0/1), `neg_type`.
* **Labels:** `1 = PRESERVES_MEANING`, `0 = DOES_NOT_PRESERVE`.
* **Negative types** used in training data construction: `shuffle`, `dropout`, `mismatch` (derangement), `paraphrase_distortion`, `nli_contradiction`.
* **Split:** 80/20, stratified by `Label` (random_state=42).
### Training Procedure
#### Preprocessing
* Pair tokenization with truncation at 512 tokens:
```python
tokenizer(text1, text2, truncation=True, max_length=512)
```
#### Training Hyperparameters
* **Training regime:** fp16 mixed precision (if supported; otherwise fp32)
* **Arguments:**
* `num_train_epochs=5`
* `per_device_train_batch_size=32`
* `per_device_eval_batch_size=16`
* `learning_rate=2e-5`
* `weight_decay=0.01`
* `warmup_ratio=0.1`
* `evaluation_strategy="epoch"`, `save_strategy="epoch"`
* `load_best_model_at_end=True`, `metric_for_best_model="f1"`
* **Optimizer:** AdamW
* **Loss:** CrossEntropy (2 logits)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
* Held-out 20% stratified split of the curated E2R pairs.
#### Factors
* Report per-negative-type breakdown (e.g., performance on `mismatch`, `paraphrase_distortion`, etc.).
#### Metrics
* Accuracy, F1, ROC-AUC.
### Results
* Accuracy: `0.81`
* F1: `0.84`
* ROC-AUC: `0.83`
* Threshold tuned via Youden’s J for operating point selection.
## Technical Specifications
### Model Architecture and Objective
* Encoder-only RoBERTa with a classification head (`Linear(hidden → 2)`).
* Objective: supervised cross-entropy on binary label.
---
## Citation
**BibTeX:**
```bibtex
@software{roberta_facil_2025,
title = {RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish},
author = {Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen},
year = {2025},
url = {https://huggingface.co/oeg/RoBERTaSense-FACIL}
}
```
**APA:**
Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen. (2025). *RoBERTa-FACIL: Meaning Preservation for Easy-to-Read in Spanish*. Hugging Face. [https://huggingface.co/oeg/RoBERTaSense-FACIL](https://huggingface.co/oeg/RoBERTaSense-FACIL)