| | --- |
| | identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL |
| | name: RoBERTaSense-FACIL |
| | version: 0.1.0 |
| | keywords: |
| | - easy-to-read |
| | - meaning preservation |
| | - accessibility |
| | - spanish |
| | - text pair classification |
| | headline: >- |
| | Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read |
| | (E2R) adaptations. |
| | description: > |
| | RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning |
| | preservation in Easy-to-Read (E2R) adaptations. Given a pair {original, |
| | adapted}, it predicts whether the adaptation preserves the meaning of the |
| | original. ⚠️ Deprecation notice (base model): fine-tuned from |
| | PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively |
| | maintained Spanish RoBERTa models, see BSC-LT. |
| | task: |
| | - Text classification |
| | - Pairwise classification |
| | modelCategory: |
| | - Supervised classification |
| | language: |
| | - es |
| | license: apache-2.0 |
| | parameterSize: 125M |
| | developmentStatus: Active |
| | dateCreated: 25-09-2025 |
| | dateModified: 06-10-2025 |
| | citation: > |
| | Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning |
| | Preservation for Easy-to-Read in Spanish. Retrieved from |
| | https://huggingface.co/oeg/RoBERTaSense-FACIL |
| | codeRepository: '' |
| | referencePublication: '' |
| | developmentLibrary: PyTorch + Transformers |
| | usageInstructions: > |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | repo = "oeg/RoBERTaSense-FACIL" model = |
| | AutoModelForSequenceClassification.from_pretrained(repo) tokenizer = |
| | AutoTokenizer.from_pretrained(repo) |
| |
|
| | original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El |
| | lobo parecía amable. El lobo engañó a Caperucita." |
| |
|
| | inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, |
| | max_length=512) with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]: |
| | probs[i] for i in range(len(probs))}) |
| | modelRisks: |
| | - Trained for Spanish E2R; out-of-domain performance may degrade. |
| | - >- |
| | Binary labels compress nuanced cases; borderline adaptations may require human |
| | review. |
| | - Synthetic negatives do not cover all real-world human errors. |
| | - Base model is deprecated; security/robustness updates will not be inherited. |
| | evaluationMetrics: |
| | - Accuracy |
| | - F1 |
| | - ROC-AUC |
| | evaluationResults: | |
| | 80/20 stratified split (seed=42). Example results: |
| | - Accuracy: 0.81 |
| | - F1: 0.84 |
| | - ROC-AUC: 0.83 |
| | softwareRequirements: |
| | - python>=3.9 |
| | - torch>=2.0 |
| | - transformers>=4.40 |
| | - datasets>=2.18 |
| | storageRequirements: |
| | - ~500 MB |
| | memoryRequirements: |
| | - >- |
| | >= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch |
| | inference |
| | operatingSystem: |
| | - Linux |
| | - macOS |
| | - Windows |
| | processorRequirements: |
| | - x86_64 CPU (AVX recommended) |
| | GPURequirements: |
| | - >- |
| | Not required for single-pair inference; CUDA GPU recommended for batch |
| | processing |
| | distribution: |
| | - encodingFormat: '' |
| | contentUrl: '' |
| | contentSize: '' |
| | quantizationBits: '' |
| | quantizationMethod: '' |
| | trainedOn: |
| | - identifier: internal:e2r-positives |
| | name: Expert-validated E2R pairs (Spanish) |
| | description: > |
| | Positive pairs (original↔adapted) from an existing corpus validated by |
| | experts; used as the positive class. |
| | url: '' |
| | - identifier: internal:synthetic-negatives |
| | name: Synthetic hard negatives (Spanish) |
| | description: > |
| | Negatives generated via sentence shuffle, dropout, mismatch (derangement), |
| | paraphrase-with-distortion, and zero-shot NLI contradictions; trivial pairs |
| | filtered by BLEU/ROUGE-L thresholds. |
| | url: '' |
| | testedOn: |
| | - identifier: internal:heldout-20 |
| | name: Held-out 20% stratified split |
| | description: > |
| | Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512 |
| | tokens. |
| | evaluatedOn: |
| | - identifier: internal:heldout-20 |
| | name: Held-out 20% stratified split |
| | description: > |
| | Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J |
| | (ROC). |
| | validatedOn: '' |
| | author: |
| | - name: Isam Diab Lozano |
| | identifier: https://orcid.org/0000-0002-3967-0672 |
| | - name: Mari Carmen Suárez-Figueroa |
| | identifier: https://orcid.org/0000-0003-3807-5019 |
| | successorOf: '' |
| | funder: |
| | - name: Comunidad de Madrid — PIPF-2022/COM-25762 |
| | identifier: '' |
| | sharedBy: |
| | - name: Ontology Engineering Group (UPM) |
| | identifier: https://oeg.fi.upm.es/index.php/en/index.html |
| | wasGeneratedBy: |
| | - trainingRegion: |
| | - name: Europe (West) |
| | cloudProvider: |
| | - name: '' |
| | url: '' |
| | duration: '' |
| | hardwareType: '' |
| | fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne |
| | sdPublisher: |
| | - name: Ontology Engineering Group |
| | url: https://oeg.fi.upm.es/index.php/en/index.html |
| | sdLicense: apache-2.0 |
| | metrics: |
| | - accuracy |
| | - f1 |
| | - roc_auc |
| | base_model: |
| | - PlanTL-GOB-ES/roberta-base-bne |
| | pipeline_tag: text-classification |
| | tags: |
| | - easy-to-read |
| | - meaning-preservation |
| | --- |
| | |
| | ## Model Card for RoBERTaSense-FACIL |
| |
|
| | **RoBERTaSense-FACIL** (RoBERTa Fine-tuned for Accessible Comprehension In Language) is a Spanish RoBERTa model fine-tuned to assess **meaning preservation** in **Easy-to-Read (E2R)** adaptations. Given a pair of texts {original, adapted}, it predicts whether the adaptation **preserves** the meaning of the original. |
| |
|
| | ⚠️ **Deprecation notice (base model):** This model was fine-tuned from `PlanTL-GOB-ES/roberta-base-bne`. As for September 2025, this checkpoint is **deprecated** and no longer actively maintained. For actively maintained Spanish RoBERTa models, please see the **BSC-LT** organization: https://huggingface.co/BSC-LT |
| |
|
| | --- |
| |
|
| | ## 🚀 How to Use |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | repo = "oeg/RoBERTaSense-FACIL" |
| | model = AutoModelForSequenceClassification.from_pretrained(repo) |
| | tokenizer = AutoTokenizer.from_pretrained(repo) |
| | |
| | original = "El lobo, que parecía amable, engañó a Caperucita." |
| | adapted = "El lobo parecía amable. |
| | El lobo engañó a Caperucita." |
| | |
| | # Encode the pair (original, adapted) |
| | inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512) |
| | |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | |
| | probs = logits.softmax(-1).squeeze().tolist() |
| | print({model.config.id2label[i]: probs[i] for i in range(len(probs))}) |
| | ```` |
| |
|
| | **Suggested labels (adjust to your checkpoint):** |
| |
|
| | ```json |
| | { |
| | "id2label": {"0": "DOES_NOT_PRESERVE", "1": "PRESERVES_MEANING"}, |
| | "label2id": {"DOES_NOT_PRESERVE": 0, "PRESERVES_MEANING": 1} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Model Description |
| |
|
| | * **Developed by:** Ontology Engineering Group (UPM) / Authors: Isam Diab Lozano and Mari Carmen Suárez-Figueroa |
| | * **Funded by:** "Ayudas para la contratación de personal investigador predoctoral en formación para el año 2022" (Reference: PIPF-2022/COM-25762) by Comunidad Autónoma de Madrid (Spain) |
| | * **Model type:** Encoder-only Transformer (RoBERTa) with a classification head |
| | * **Language:** Spanish (es) |
| | * **License:** Apache-2.0 |
| | * **Finetuned from model:** `PlanTL-GOB-ES/roberta-base-bne` (deprecated; see notice above) |
| |
|
| | --- |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | * Automatic scoring of **meaning preservation** for Spanish **Easy-to-Read** adaptations. |
| | * As a signal in content quality checks for accessibility pipelines. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | * Clinical, legal, or other high-stakes decisions without human expert oversight. |
| | * Non-Spanish or out-of-domain texts without prior adaptation or re-training. |
| |
|
| | --- |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | * **Domain limitation:** trained for Spanish E2R; performance may degrade on other genres/domains. |
| | * **Binary labels:** compress nuanced cases; borderline adaptations may require human review. |
| | * **Synthetic negatives:** not all human errors are covered by synthetic negative strategies. |
| | * **Base deprecation:** the upstream base model is deprecated; security/robustness updates won’t be inherited. |
| |
|
| | ### Recommendations |
| |
|
| | * Calibrate probabilities (e.g., temperature scaling) and expose confidence scores. |
| | * Use threshold tuning (e.g., Youden’s J) to trade precision/recall for your setting. |
| | * Keep a **human-in-the-loop** for critical use cases and periodic error audits. |
| |
|
| | --- |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | See **How to Use** above. For pairwise inputs, encode as sentence pairs: |
| |
|
| | ```python |
| | inputs = tokenizer(text_original, text_adapted, return_tensors="pt", truncation=True, max_length=512) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | * **Source:** Spanish pairs (*original - adapted*) curated/validated by experts. |
| | * **Columns:** `text1` (original), `text2` (adaptation), `Label` (0/1), `neg_type`. |
| | * **Labels:** `1 = PRESERVES_MEANING`, `0 = DOES_NOT_PRESERVE`. |
| | * **Negative types** used in training data construction: `shuffle`, `dropout`, `mismatch` (derangement), `paraphrase_distortion`, `nli_contradiction`. |
| | * **Split:** 80/20, stratified by `Label` (random_state=42). |
| | |
| | ### Training Procedure |
| | |
| | #### Preprocessing |
| | |
| | * Pair tokenization with truncation at 512 tokens: |
| | |
| | ```python |
| | tokenizer(text1, text2, truncation=True, max_length=512) |
| | ``` |
| | |
| | #### Training Hyperparameters |
| | |
| | * **Training regime:** fp16 mixed precision (if supported; otherwise fp32) |
| | * **Arguments:** |
| | |
| | * `num_train_epochs=5` |
| | * `per_device_train_batch_size=32` |
| | * `per_device_eval_batch_size=16` |
| | * `learning_rate=2e-5` |
| | * `weight_decay=0.01` |
| | * `warmup_ratio=0.1` |
| | * `evaluation_strategy="epoch"`, `save_strategy="epoch"` |
| | * `load_best_model_at_end=True`, `metric_for_best_model="f1"` |
| | * **Optimizer:** AdamW |
| | * **Loss:** CrossEntropy (2 logits) |
| | |
| | |
| | ## Evaluation |
| | |
| | ### Testing Data, Factors & Metrics |
| | |
| | #### Testing Data |
| | |
| | * Held-out 20% stratified split of the curated E2R pairs. |
| | |
| | #### Factors |
| | |
| | * Report per-negative-type breakdown (e.g., performance on `mismatch`, `paraphrase_distortion`, etc.). |
| | |
| | #### Metrics |
| | |
| | * Accuracy, F1, ROC-AUC. |
| | |
| | ### Results |
| | |
| | * Accuracy: `0.81` |
| | * F1: `0.84` |
| | * ROC-AUC: `0.83` |
| | * Threshold tuned via Youden’s J for operating point selection. |
| | |
| | ## Technical Specifications |
| | |
| | ### Model Architecture and Objective |
| | |
| | * Encoder-only RoBERTa with a classification head (`Linear(hidden → 2)`). |
| | * Objective: supervised cross-entropy on binary label. |
| | |
| | --- |
| | |
| | ## Citation |
| | |
| | **BibTeX:** |
| | |
| | ```bibtex |
| | @software{roberta_facil_2025, |
| | title = {RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish}, |
| | author = {Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen}, |
| | year = {2025}, |
| | url = {https://huggingface.co/oeg/RoBERTaSense-FACIL} |
| | } |
| | ``` |
| | |
| | **APA:** |
| | Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen. (2025). *RoBERTa-FACIL: Meaning Preservation for Easy-to-Read in Spanish*. Hugging Face. [https://huggingface.co/oeg/RoBERTaSense-FACIL](https://huggingface.co/oeg/RoBERTaSense-FACIL) |