isamdiablo commited on
Commit
11dfd26
·
verified ·
1 Parent(s): 52b37d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -95
README.md CHANGED
@@ -3,137 +3,158 @@ identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL
3
  name: RoBERTaSense-FACIL
4
  version: 0.1.0
5
  keywords:
6
- - easy-to-read
7
- - meaning preservation
8
- - accessibility
9
- - spanish
10
- - text pair classification
11
- headline: "Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read (E2R) adaptations."
 
 
12
  description: >
13
- RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning preservation in
14
- Easy-to-Read (E2R) adaptations. Given a pair {original, adapted}, it predicts whether the adaptation
15
- preserves the meaning of the original.
16
- ⚠️ Deprecation notice (base model): fine-tuned from PlanTL-GOB-ES/roberta-base-bne, which is deprecated
17
- as of 2025. For actively maintained Spanish RoBERTa models, see BSC-LT.
 
18
  task:
19
- - Text classification
20
- - Pairwise classification
21
  modelCategory:
22
- - Supervised classification
23
  language:
24
- - es
25
  license: apache-2.0
26
- parameterSize: "125M"
27
  developmentStatus: Active
28
- dateCreated: "06-10-2025"
29
- dateModified: "06-10-2025"
30
  citation: >
31
- Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning Preservation for
32
- Easy-to-Read in Spanish. Retrieved from https://huggingface.co/oeg/RoBERTaSense-FACIL
33
- codeRepository: ""
34
- referencePublication: ""
35
- developmentLibrary: "PyTorch + Transformers"
 
36
  usageInstructions: >
37
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
38
  import torch
39
 
40
- repo = "oeg/RoBERTaSense-FACIL"
41
- model = AutoModelForSequenceClassification.from_pretrained(repo)
42
- tokenizer = AutoTokenizer.from_pretrained(repo)
43
 
44
- original = "El lobo, que parecía amable, engañó a Caperucita."
45
- adapted = "El lobo parecía amable. El lobo engañó a Caperucita."
46
 
47
- inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512)
48
- with torch.no_grad():
49
  logits = model(**inputs).logits
50
- probs = logits.softmax(-1).squeeze().tolist()
51
- print({model.config.id2label[i]: probs[i] for i in range(len(probs))})
52
  modelRisks:
53
- - "Trained for Spanish E2R; out-of-domain performance may degrade."
54
- - "Binary labels compress nuanced cases; borderline adaptations may require human review."
55
- - "Synthetic negatives do not cover all real-world human errors."
56
- - "Base model is deprecated; security/robustness updates will not be inherited."
 
 
57
  evaluationMetrics:
58
- - Accuracy
59
- - F1
60
- - ROC-AUC
61
- evaluationResults: >
62
  80/20 stratified split (seed=42). Example results:
63
  - Accuracy: 0.81
64
  - F1: 0.84
65
  - ROC-AUC: 0.83
66
  softwareRequirements:
67
- - "python>=3.9"
68
- - "torch>=2.0"
69
- - "transformers>=4.40"
70
- - "datasets>=2.18"
71
  storageRequirements:
72
- - "~500 MB"
73
  memoryRequirements:
74
- - ">= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch inference"
 
 
75
  operatingSystem:
76
- - Linux
77
- - macOS
78
- - Windows
79
  processorRequirements:
80
- - "x86_64 CPU (AVX recommended)"
81
  GPURequirements:
82
- - "Not required for single-pair inference; CUDA GPU recommended for batch processing"
 
 
83
  distribution:
84
- - encodingFormat: ""
85
- contentUrl: ""
86
- contentSize: ""
87
- quantizationBits: ""
88
- quantizationMethod: ""
89
  trainedOn:
90
- - identifier: internal:e2r-positives
91
- name: Expert-validated E2R pairs (Spanish)
92
- description: >
93
- Positive pairs (original↔adapted) from an existing corpus validated by experts; used as the positive class.
94
- url: ""
95
- - identifier: internal:synthetic-negatives
96
- name: Synthetic hard negatives (Spanish)
97
- description: >
98
- Negatives generated via sentence shuffle, dropout, mismatch (derangement), paraphrase-with-distortion,
99
- and zero-shot NLI contradictions; trivial pairs filtered by BLEU/ROUGE-L thresholds.
100
- url: ""
 
 
101
  testedOn:
102
- - identifier: internal:heldout-20
103
- name: Held-out 20% stratified split
104
- description: >
105
- Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512 tokens.
 
106
  evaluatedOn:
107
- - identifier: internal:heldout-20
108
- name: Held-out 20% stratified split
109
- description: >
110
- Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J (ROC).
111
- validatedOn: ""
 
112
  author:
113
- - name: Isam Diab Lozano
114
- identifier: "https://orcid.org/0000-0002-3967-0672"
115
- - name: Mari Carmen Suárez-Figueroa
116
- identifier: "https://orcid.org/0000-0003-3807-5019"
117
- successorOf: ""
118
  funder:
119
- - name: "Comunidad de Madrid — PIPF-2022/COM-25762"
120
- identifier: ""
121
  sharedBy:
122
- - name: "Ontology Engineering Group (UPM)"
123
- identifier: "https://oeg.fi.upm.es/index.php/en/index.html"
124
  wasGeneratedBy:
125
- - trainingRegion:
126
- - name: "Europe (West)"
127
- cloudProvider:
128
- - name: ""
129
- url: ""
130
- duration: ""
131
- hardwareType: ""
132
- fineTunedFromModel: "https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne"
133
  sdPublisher:
134
- - name: "Ontology Engineering Group"
135
- url: "https://oeg.fi.upm.es/index.php/en/index.html"
136
  sdLicense: apache-2.0
 
 
 
 
 
 
 
137
  ---
138
 
139
  ## Model Card for RoBERTaSense-FACIL
 
3
  name: RoBERTaSense-FACIL
4
  version: 0.1.0
5
  keywords:
6
+ - easy-to-read
7
+ - meaning preservation
8
+ - accessibility
9
+ - spanish
10
+ - text pair classification
11
+ headline: >-
12
+ Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read
13
+ (E2R) adaptations.
14
  description: >
15
+ RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning
16
+ preservation in Easy-to-Read (E2R) adaptations. Given a pair {original,
17
+ adapted}, it predicts whether the adaptation preserves the meaning of the
18
+ original. ⚠️ Deprecation notice (base model): fine-tuned from
19
+ PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively
20
+ maintained Spanish RoBERTa models, see BSC-LT.
21
  task:
22
+ - Text classification
23
+ - Pairwise classification
24
  modelCategory:
25
+ - Supervised classification
26
  language:
27
+ - es
28
  license: apache-2.0
29
+ parameterSize: 125M
30
  developmentStatus: Active
31
+ dateCreated: 25-09-2025
32
+ dateModified: 06-10-2025
33
  citation: >
34
+ Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning
35
+ Preservation for Easy-to-Read in Spanish. Retrieved from
36
+ https://huggingface.co/oeg/RoBERTaSense-FACIL
37
+ codeRepository: ''
38
+ referencePublication: ''
39
+ developmentLibrary: PyTorch + Transformers
40
  usageInstructions: >
41
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
42
  import torch
43
 
44
+ repo = "oeg/RoBERTaSense-FACIL" model =
45
+ AutoModelForSequenceClassification.from_pretrained(repo) tokenizer =
46
+ AutoTokenizer.from_pretrained(repo)
47
 
48
+ original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El
49
+ lobo parecía amable. El lobo engañó a Caperucita."
50
 
51
+ inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True,
52
+ max_length=512) with torch.no_grad():
53
  logits = model(**inputs).logits
54
+ probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]:
55
+ probs[i] for i in range(len(probs))})
56
  modelRisks:
57
+ - Trained for Spanish E2R; out-of-domain performance may degrade.
58
+ - >-
59
+ Binary labels compress nuanced cases; borderline adaptations may require human
60
+ review.
61
+ - Synthetic negatives do not cover all real-world human errors.
62
+ - Base model is deprecated; security/robustness updates will not be inherited.
63
  evaluationMetrics:
64
+ - Accuracy
65
+ - F1
66
+ - ROC-AUC
67
+ evaluationResults: |
68
  80/20 stratified split (seed=42). Example results:
69
  - Accuracy: 0.81
70
  - F1: 0.84
71
  - ROC-AUC: 0.83
72
  softwareRequirements:
73
+ - python>=3.9
74
+ - torch>=2.0
75
+ - transformers>=4.40
76
+ - datasets>=2.18
77
  storageRequirements:
78
+ - ~500 MB
79
  memoryRequirements:
80
+ - >-
81
+ >= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch
82
+ inference
83
  operatingSystem:
84
+ - Linux
85
+ - macOS
86
+ - Windows
87
  processorRequirements:
88
+ - x86_64 CPU (AVX recommended)
89
  GPURequirements:
90
+ - >-
91
+ Not required for single-pair inference; CUDA GPU recommended for batch
92
+ processing
93
  distribution:
94
+ - encodingFormat: ''
95
+ contentUrl: ''
96
+ contentSize: ''
97
+ quantizationBits: ''
98
+ quantizationMethod: ''
99
  trainedOn:
100
+ - identifier: internal:e2r-positives
101
+ name: Expert-validated E2R pairs (Spanish)
102
+ description: >
103
+ Positive pairs (original↔adapted) from an existing corpus validated by
104
+ experts; used as the positive class.
105
+ url: ''
106
+ - identifier: internal:synthetic-negatives
107
+ name: Synthetic hard negatives (Spanish)
108
+ description: >
109
+ Negatives generated via sentence shuffle, dropout, mismatch (derangement),
110
+ paraphrase-with-distortion, and zero-shot NLI contradictions; trivial pairs
111
+ filtered by BLEU/ROUGE-L thresholds.
112
+ url: ''
113
  testedOn:
114
+ - identifier: internal:heldout-20
115
+ name: Held-out 20% stratified split
116
+ description: >
117
+ Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512
118
+ tokens.
119
  evaluatedOn:
120
+ - identifier: internal:heldout-20
121
+ name: Held-out 20% stratified split
122
+ description: >
123
+ Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J
124
+ (ROC).
125
+ validatedOn: ''
126
  author:
127
+ - name: Isam Diab Lozano
128
+ identifier: https://orcid.org/0000-0002-3967-0672
129
+ - name: Mari Carmen Suárez-Figueroa
130
+ identifier: https://orcid.org/0000-0003-3807-5019
131
+ successorOf: ''
132
  funder:
133
+ - name: Comunidad de Madrid — PIPF-2022/COM-25762
134
+ identifier: ''
135
  sharedBy:
136
+ - name: Ontology Engineering Group (UPM)
137
+ identifier: https://oeg.fi.upm.es/index.php/en/index.html
138
  wasGeneratedBy:
139
+ - trainingRegion:
140
+ - name: Europe (West)
141
+ cloudProvider:
142
+ - name: ''
143
+ url: ''
144
+ duration: ''
145
+ hardwareType: ''
146
+ fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
147
  sdPublisher:
148
+ - name: Ontology Engineering Group
149
+ url: https://oeg.fi.upm.es/index.php/en/index.html
150
  sdLicense: apache-2.0
151
+ metrics:
152
+ - accuracy
153
+ - f1
154
+ - roc_auc
155
+ base_model:
156
+ - PlanTL-GOB-ES/roberta-base-bne
157
+ pipeline_tag: text-classification
158
  ---
159
 
160
  ## Model Card for RoBERTaSense-FACIL