|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- SIRIS-Lab/erc-classification-dataset |
|
|
base_model: |
|
|
- allenai/specter2_base |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# ERC Panels Classifier |
|
|
|
|
|
This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**. |
|
|
It achieves the following results on the held-out test set: |
|
|
|
|
|
- **Best validation loss:** 0.0361 |
|
|
- **Micro F1:** 0.9386 |
|
|
- **Micro ROC-AUC:** 0.9718 |
|
|
- **Subset accuracy:** 0.7943 |
|
|
|
|
|
--- |
|
|
|
|
|
## Model description |
|
|
|
|
|
This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels. |
|
|
|
|
|
The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**. |
|
|
Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels. |
|
|
|
|
|
### Key characteristics |
|
|
|
|
|
- **Base model:** allenai/specter2_base |
|
|
- **Task:** multilabel document classification |
|
|
- **Labels:** 28 ERC scientific panels |
|
|
- **Activation:** sigmoid (independent scores per label) |
|
|
- **Loss:** BCEWithLogitsLoss |
|
|
- **Output:** list of predicted panels with associated probabilities |
|
|
- **Decision threshold:** 0.5 (tunable) |
|
|
|
|
|
This model enables automatic research-domain tagging aligned with the ERC panel structure. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
### Intended uses |
|
|
|
|
|
This model is designed for: |
|
|
|
|
|
- Automatic assignment of ERC research panels |
|
|
- Metadata enrichment for: |
|
|
- research project databases |
|
|
- institutional repositories |
|
|
- funding and grant analysis pipelines |
|
|
- Large-scale analytics such as: |
|
|
- portfolio mapping |
|
|
- thematic analysis of research outputs |
|
|
- monitoring disciplinary coverage of funded projects |
|
|
- Predicting subject areas for documents lacking structured domain metadata |
|
|
|
|
|
The model supports: |
|
|
|
|
|
- title only |
|
|
- abstract only |
|
|
- **title + abstract (recommended)** |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- ERC panels are **high-level categories** and do not represent fine-grained subdisciplines |
|
|
- Labels are derived from curated datasets, semi-automatically annotated data |
|
|
- Class imbalance may affect recall for underrepresented panels |
|
|
- The model does not encode explicit hierarchical relationships between panels |
|
|
|
|
|
Not suited for: |
|
|
|
|
|
- fine-grained subfield classification |
|
|
- journal recommendation |
|
|
- evaluation of research quality or impact |
|
|
- clinical, legal, or regulatory decision-making |
|
|
|
|
|
Predictions should be treated as **supportive metadata**, not authoritative classifications. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to use |
|
|
|
|
|
``` |
|
|
from transformers import pipeline |
|
|
|
|
|
# Replace with your actual model repo name on HuggingFace |
|
|
MODEL_NAME = "nicolauduran45/erc_classifier_demo" |
|
|
|
|
|
classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME) |
|
|
|
|
|
text = ["Climate change impacts on Arctic ecosystems."] |
|
|
|
|
|
classifier(text) |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
### Training data |
|
|
|
|
|
- Scientific documents with ERC-style panel annotations |
|
|
- Inputs: |
|
|
- title |
|
|
- abstract |
|
|
- Task type: **multilabel classification** |
|
|
|
|
|
### Dataset characteristics |
|
|
|
|
|
| Property | Value | |
|
|
|--------|------| |
|
|
| Documents | ~40k | |
|
|
| Labels | 28 panels | |
|
|
| Input fields | Title, Abstract | |
|
|
| Task type | Multilabel | |
|
|
| License | Dataset-dependent | |
|
|
|
|
|
--- |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
- Input text constructed as: |
|
|
|
|
|
`title + ". " + abstract` |
|
|
|
|
|
- Tokenization using the SPECTER2 tokenizer |
|
|
- Maximum sequence length: **512 tokens** |
|
|
|
|
|
### Model |
|
|
|
|
|
- Base model: `allenai/specter2_base` |
|
|
- Classification head: linear → sigmoid |
|
|
- Loss function: BCEWithLogitsLoss |
|
|
- Predictions: independent probability per label |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
| Hyperparameter | Value | |
|
|
|--------------|------| |
|
|
| Learning rate | 2e-5 | |
|
|
| Train batch size | 16 | |
|
|
| Eval batch size | 16 | |
|
|
| Epochs | 6 | |
|
|
| Weight decay | 0.01 | |
|
|
| Optimizer | AdamW | |
|
|
| Metric for best model | Micro F1 | |
|
|
|
|
|
--- |
|
|
|
|
|
## Training results |
|
|
|
|
|
| Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy | |
|
|
|------|---------------|-----------------|----------|---------|----------| |
|
|
| 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 | |
|
|
| 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 | |
|
|
| 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 | |
|
|
| 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 | |
|
|
| 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 | |
|
|
| 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** | |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation results (multilabel test set) |
|
|
|
|
|
| Panel | Precision | Recall | F1-score | Support | |
|
|
|------|-----------|--------|----------|---------| |
|
|
| Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 | |
|
|
| Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 | |
|
|
| Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 | |
|
|
| Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 | |
|
|
| Earth System Science | 0.94 | 0.98 | 0.96 | 64 | |
|
|
| Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 | |
|
|
| Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 | |
|
|
| Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 | |
|
|
| Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 | |
|
|
| Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 | |
|
|
| Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 | |
|
|
| Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 | |
|
|
| Materials Engineering | 0.81 | 0.93 | 0.87 | 75 | |
|
|
| Mathematics | 1.00 | 1.00 | 1.00 | 36 | |
|
|
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 | |
|
|
| Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 | |
|
|
| Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 | |
|
|
| Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 | |
|
|
| Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 | |
|
|
| Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 | |
|
|
| Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 | |
|
|
| Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 | |
|
|
| Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 | |
|
|
| Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 | |
|
|
| The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 | |
|
|
| The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 | |
|
|
| The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 | |
|
|
| Universe Sciences | 1.00 | 1.00 | 1.00 | 25 | |
|
|
|
|
|
|
|
|
**Overall performance** |
|
|
|
|
|
| | Precision | Recall | F1-score | Support | |
|
|
|------|-----------|--------|----------|---------| |
|
|
| **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** | |
|
|
| **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** | |
|
|
| **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** | |
|
|
| **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** | |
|
|
|
|
|
--- |
|
|
|
|
|
## ERC-funded projects evaluation (multiclass recall) |
|
|
|
|
|
This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**. |
|
|
Only **recall** is reported. |
|
|
|
|
|
| Panel | Recall | |
|
|
|------|--------| |
|
|
| Biotechnology and Biosystems Engineering | 0.26 | |
|
|
| Cell Biology, Development, Stem Cells and Regeneration | 0.81 | |
|
|
| Computer Science and Informatics | 1.00 | |
|
|
| Condensed Matter Physics | 0.77 | |
|
|
| Earth System Science | 0.92 | |
|
|
| Environmental Biology, Ecology and Evolution | 0.85 | |
|
|
| Fundamental Constituents of Matter | 0.84 | |
|
|
| Human Mobility, Environment, and Space | 0.61 | |
|
|
| Immunity, Infection and Immunotherapy | 0.83 | |
|
|
| Individuals, Markets and Organisations | 0.96 | |
|
|
| Institutions, Governance and Legal Systems | 0.58 | |
|
|
| Integrative Biology: from Genes and Genomes to Systems | 0.73 | |
|
|
| Materials Engineering | 0.75 | |
|
|
| Mathematics | 0.96 | |
|
|
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 | |
|
|
| Neuroscience and Disorders of the Nervous System | 0.92 | |
|
|
| Physical and Analytical Chemical Sciences | 0.83 | |
|
|
| Physiology in Health, Disease and Ageing | 0.60 | |
|
|
| Prevention, Diagnosis and Treatment of Human Diseases | 0.94 | |
|
|
| Products and Processes Engineering | 0.58 | |
|
|
| Studies of Cultures and Arts | 0.27 | |
|
|
| Synthetic Chemistry and Materials | 0.67 | |
|
|
| Systems and Communication Engineering | 0.75 | |
|
|
| Texts and Concepts | 0.62 | |
|
|
| The Human Mind and Its Complexity | 0.85 | |
|
|
| The Social World and Its Interactions | 0.73 | |
|
|
| The Study of the Human Past | 0.83 | |
|
|
| Universe Sciences | 1.00 | |
|
|
|
|
|
**Overall performance** |
|
|
**Overall recall** |
|
|
|
|
|
- **Micro recall:** 0.77 |
|
|
- **Macro recall:** 0.76 |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@inproceedings{bovenzi2022mapping, |
|
|
title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark}, |
|
|
author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep}, |
|
|
booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)}, |
|
|
pages={495--499}, |
|
|
year={2022}, |
|
|
publisher={Springer International Publishing} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Framework versions |
|
|
|
|
|
- **Transformers:** 4.57.x |
|
|
- **PyTorch:** 2.8.0 |
|
|
- **Datasets:** 3.x |
|
|
- **Tokenizers:** 0.22.x |