nicolauduran45 commited on
Commit
e046a5f
·
verified ·
1 Parent(s): f86f0fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +218 -143
README.md CHANGED
@@ -1,199 +1,274 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
9
 
 
 
 
 
10
 
 
11
 
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
 
 
 
 
 
 
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
 
 
 
 
 
 
 
 
 
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
 
 
 
130
 
131
- #### Summary
132
 
 
 
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
 
140
 
141
- ## Environmental Impact
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
 
 
 
 
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
 
 
 
 
 
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
 
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
 
 
 
 
 
 
 
 
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
 
 
 
 
 
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
 
193
- ## Model Card Authors [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ datasets:
5
+ - SIRIS-Lab/erc-classification-dataset
6
+ base_model:
7
+ - allenai/specter2_base
8
+ pipeline_tag: text-classification
9
  ---
10
 
11
+ # ERC Panels Classifier
12
 
13
+ This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**.
14
+ It achieves the following results on the held-out test set:
15
 
16
+ - **Best validation loss:** 0.0361
17
+ - **Micro F1:** 0.9386
18
+ - **Micro ROC-AUC:** 0.9718
19
+ - **Subset accuracy:** 0.7943
20
 
21
+ ---
22
 
23
+ ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels.
26
 
27
+ The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**.
28
+ Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels.
29
 
30
+ ### Key characteristics
31
 
32
+ - **Base model:** allenai/specter2_base
33
+ - **Task:** multilabel document classification
34
+ - **Labels:** 28 ERC scientific panels
35
+ - **Activation:** sigmoid (independent scores per label)
36
+ - **Loss:** BCEWithLogitsLoss
37
+ - **Output:** list of predicted panels with associated probabilities
38
+ - **Decision threshold:** 0.5 (tunable)
39
 
40
+ This model enables automatic research-domain tagging aligned with the ERC panel structure.
41
 
42
+ ---
43
 
44
+ ## Intended uses & limitations
45
 
46
+ ### Intended uses
47
 
48
+ This model is designed for:
49
 
50
+ - Automatic assignment of ERC research panels
51
+ - Metadata enrichment for:
52
+ - research project databases
53
+ - institutional repositories
54
+ - funding and grant analysis pipelines
55
+ - Large-scale analytics such as:
56
+ - portfolio mapping
57
+ - thematic analysis of research outputs
58
+ - monitoring disciplinary coverage of funded projects
59
+ - Predicting subject areas for documents lacking structured domain metadata
60
 
61
+ The model supports:
62
 
63
+ - title only
64
+ - abstract only
65
+ - **title + abstract (recommended)**
66
 
67
+ ### Limitations
68
 
69
+ - ERC panels are **high-level categories** and do not represent fine-grained subdisciplines
70
+ - Labels are derived from curated datasets, semi-automatically annotated data
71
+ - Class imbalance may affect recall for underrepresented panels
72
+ - The model does not encode explicit hierarchical relationships between panels
73
 
74
+ Not suited for:
75
 
76
+ - fine-grained subfield classification
77
+ - journal recommendation
78
+ - evaluation of research quality or impact
79
+ - clinical, legal, or regulatory decision-making
80
 
81
+ Predictions should be treated as **supportive metadata**, not authoritative classifications.
82
 
83
+ ---
84
 
85
+ ## How to use
86
 
87
+ ```
88
+ from transformers import pipeline
89
 
90
+ # Replace with your actual model repo name on HuggingFace
91
+ MODEL_NAME = "nicolauduran45/erc_classifier_demo"
92
 
93
+ classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)
94
 
95
+ text = ["Climate change impacts on Arctic ecosystems."]
96
 
97
+ classifier(text)
98
+ ```
99
+ ---
 
 
100
 
101
+ ## Training and evaluation data
102
 
103
+ ### Training data
104
 
105
+ - Scientific documents with ERC-style panel annotations
106
+ - Inputs:
107
+ - title
108
+ - abstract
109
+ - Task type: **multilabel classification**
110
 
111
+ ### Dataset characteristics
112
 
113
+ | Property | Value |
114
+ |--------|------|
115
+ | Documents | ~40k |
116
+ | Labels | 28 panels |
117
+ | Input fields | Title, Abstract |
118
+ | Task type | Multilabel |
119
+ | License | Dataset-dependent |
120
 
121
+ ---
122
 
123
+ ## Training procedure
124
 
125
+ ### Preprocessing
126
 
127
+ - Input text constructed as:
128
 
129
+ `title + ". " + abstract`
130
 
131
+ - Tokenization using the SPECTER2 tokenizer
132
+ - Maximum sequence length: **512 tokens**
133
 
134
+ ### Model
135
 
136
+ - Base model: `allenai/specter2_base`
137
+ - Classification head: linear → sigmoid
138
+ - Loss function: BCEWithLogitsLoss
139
+ - Predictions: independent probability per label
140
 
141
+ ### Training hyperparameters
142
 
143
+ | Hyperparameter | Value |
144
+ |--------------|------|
145
+ | Learning rate | 2e-5 |
146
+ | Train batch size | 16 |
147
+ | Eval batch size | 16 |
148
+ | Epochs | 6 |
149
+ | Weight decay | 0.01 |
150
+ | Optimizer | AdamW |
151
+ | Metric for best model | Micro F1 |
152
 
153
+ ---
154
 
155
+ ## Training results
156
 
157
+ | Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy |
158
+ |------|---------------|-----------------|----------|---------|----------|
159
+ | 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 |
160
+ | 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 |
161
+ | 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 |
162
+ | 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 |
163
+ | 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 |
164
+ | 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** |
165
 
166
+ ---
167
 
168
+ ## Evaluation results (multilabel test set)
169
+
170
+ | Panel | Precision | Recall | F1-score | Support |
171
+ |------|-----------|--------|----------|---------|
172
+ | Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 |
173
+ | Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 |
174
+ | Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 |
175
+ | Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 |
176
+ | Earth System Science | 0.94 | 0.98 | 0.96 | 64 |
177
+ | Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 |
178
+ | Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 |
179
+ | Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 |
180
+ | Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 |
181
+ | Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 |
182
+ | Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 |
183
+ | Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 |
184
+ | Materials Engineering | 0.81 | 0.93 | 0.87 | 75 |
185
+ | Mathematics | 1.00 | 1.00 | 1.00 | 36 |
186
+ | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 |
187
+ | Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 |
188
+ | Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 |
189
+ | Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 |
190
+ | Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 |
191
+ | Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 |
192
+ | Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 |
193
+ | Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 |
194
+ | Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 |
195
+ | Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 |
196
+ | The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 |
197
+ | The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 |
198
+ | The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 |
199
+ | Universe Sciences | 1.00 | 1.00 | 1.00 | 25 |
200
+
201
+
202
+ **Overall performance**
203
+ | | Precision | Recall | F1-score | Support |
204
+ |------|-----------|--------|----------|---------|
205
+ | **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** |
206
+ | **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** |
207
+ | **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** |
208
+ | **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** |
209
+
210
+ ---
211
 
212
+ ## ERC-funded projects evaluation (multiclass recall)
213
+
214
+ This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**.
215
+ Only **recall** is reported.
216
+
217
+ | Panel | Recall |
218
+ |------|--------|
219
+ | Biotechnology and Biosystems Engineering | 0.26 |
220
+ | Cell Biology, Development, Stem Cells and Regeneration | 0.81 |
221
+ | Computer Science and Informatics | 1.00 |
222
+ | Condensed Matter Physics | 0.77 |
223
+ | Earth System Science | 0.92 |
224
+ | Environmental Biology, Ecology and Evolution | 0.85 |
225
+ | Fundamental Constituents of Matter | 0.84 |
226
+ | Human Mobility, Environment, and Space | 0.61 |
227
+ | Immunity, Infection and Immunotherapy | 0.83 |
228
+ | Individuals, Markets and Organisations | 0.96 |
229
+ | Institutions, Governance and Legal Systems | 0.58 |
230
+ | Integrative Biology: from Genes and Genomes to Systems | 0.73 |
231
+ | Materials Engineering | 0.75 |
232
+ | Mathematics | 0.96 |
233
+ | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 |
234
+ | Neuroscience and Disorders of the Nervous System | 0.92 |
235
+ | Physical and Analytical Chemical Sciences | 0.83 |
236
+ | Physiology in Health, Disease and Ageing | 0.60 |
237
+ | Prevention, Diagnosis and Treatment of Human Diseases | 0.94 |
238
+ | Products and Processes Engineering | 0.58 |
239
+ | Studies of Cultures and Arts | 0.27 |
240
+ | Synthetic Chemistry and Materials | 0.67 |
241
+ | Systems and Communication Engineering | 0.75 |
242
+ | Texts and Concepts | 0.62 |
243
+ | The Human Mind and Its Complexity | 0.85 |
244
+ | The Social World and Its Interactions | 0.73 |
245
+ | The Study of the Human Past | 0.83 |
246
+ | Universe Sciences | 1.00 |
247
+
248
+ **Overall performance**
249
+ **Overall recall**
250
+
251
+ - **Micro recall:** 0.77
252
+ - **Macro recall:** 0.76
253
+
254
+ ## Citation
255
+
256
+ ```
257
+ @inproceedings{bovenzi2022mapping,
258
+ title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
259
+ author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
260
+ booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
261
+ pages={495--499},
262
+ year={2022},
263
+ publisher={Springer International Publishing}
264
+ }
265
+ ```
266
 
267
+ ---
268
 
269
+ ## Framework versions
270
 
271
+ - **Transformers:** 4.57.x
272
+ - **PyTorch:** 2.8.0
273
+ - **Datasets:** 3.x
274
+ - **Tokenizers:** 0.22.x