ERC Panels Classifier

This model is a fine-tuned version of allenai/specter2_base for multilabel scientific domain classification aligned with ERC panel taxonomy.
It achieves the following results on the held-out test set:

  • Best validation loss: 0.0361
  • Micro F1: 0.9386
  • Micro ROC-AUC: 0.9718
  • Subset accuracy: 0.7943

Model description

This model is a fine-tuned variant of SPECTER2 (allenai/specter2_base) adapted for multilabel classification of scientific documents into ERC research panels.

The model takes as input the title and abstract of a scientific publication and predicts one or more research panels.
Since scientific outputs may legitimately span multiple domains, the model is trained using sigmoid activation with binary cross-entropy loss, allowing independent assignment of multiple labels.

Key characteristics

  • Base model: allenai/specter2_base
  • Task: multilabel document classification
  • Labels: 28 ERC scientific panels
  • Activation: sigmoid (independent scores per label)
  • Loss: BCEWithLogitsLoss
  • Output: list of predicted panels with associated probabilities
  • Decision threshold: 0.5 (tunable)

This model enables automatic research-domain tagging aligned with the ERC panel structure.


Intended uses & limitations

Intended uses

This model is designed for:

  • Automatic assignment of ERC research panels
  • Metadata enrichment for:
    • research project databases
    • institutional repositories
    • funding and grant analysis pipelines
  • Large-scale analytics such as:
    • portfolio mapping
    • thematic analysis of research outputs
    • monitoring disciplinary coverage of funded projects
  • Predicting subject areas for documents lacking structured domain metadata

The model supports:

  • title only
  • abstract only
  • title + abstract (recommended)

Limitations

  • ERC panels are high-level categories and do not represent fine-grained subdisciplines
  • Labels are derived from curated datasets, semi-automatically annotated data
  • Class imbalance may affect recall for underrepresented panels
  • The model does not encode explicit hierarchical relationships between panels

Not suited for:

  • fine-grained subfield classification
  • journal recommendation
  • evaluation of research quality or impact
  • clinical, legal, or regulatory decision-making

Predictions should be treated as supportive metadata, not authoritative classifications.


How to use

from transformers import pipeline

# Replace with your actual model repo name on HuggingFace
MODEL_NAME = "nicolauduran45/erc_classifier_demo"

classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)

text = ["Climate change impacts on Arctic ecosystems."]

classifier(text)

Training and evaluation data

Training data

  • Scientific documents with ERC-style panel annotations
  • Inputs:
    • title
    • abstract
  • Task type: multilabel classification

Dataset characteristics

Property Value
Documents ~40k
Labels 28 panels
Input fields Title, Abstract
Task type Multilabel
License Dataset-dependent

Training procedure

Preprocessing

  • Input text constructed as:

    title + ". " + abstract

  • Tokenization using the SPECTER2 tokenizer

  • Maximum sequence length: 512 tokens

Model

  • Base model: allenai/specter2_base
  • Classification head: linear → sigmoid
  • Loss function: BCEWithLogitsLoss
  • Predictions: independent probability per label

Training hyperparameters

Hyperparameter Value
Learning rate 2e-5
Train batch size 16
Eval batch size 16
Epochs 6
Weight decay 0.01
Optimizer AdamW
Metric for best model Micro F1

Training results

Epoch Training Loss Validation Loss Micro F1 ROC-AUC Accuracy
1 0.2089 0.0968 0.7576 0.8347 0.4043
2 0.0961 0.0713 0.8231 0.8888 0.5171
3 0.0719 0.0578 0.8614 0.9209 0.5829
4 0.0579 0.0458 0.9072 0.9546 0.7029
5 0.0479 0.0390 0.9264 0.9620 0.7614
6 0.0407 0.0361 0.9386 0.9718 0.7943

Evaluation results (multilabel test set)

Panel Precision Recall F1-score Support
Biotechnology and Biosystems Engineering 0.88 0.70 0.78 30
Cell Biology, Development, Stem Cells and Regeneration 0.98 0.94 0.96 54
Computer Science and Informatics 0.96 0.98 0.97 95
Condensed Matter Physics 0.97 0.99 0.98 68
Earth System Science 0.94 0.98 0.96 64
Environmental Biology, Ecology and Evolution 0.91 0.96 0.94 54
Fundamental Constituents of Matter 0.97 0.94 0.95 32
Human Mobility, Environment, and Space 0.81 0.81 0.81 21
Immunity, Infection and Immunotherapy 1.00 0.97 0.99 40
Individuals, Markets and Organisations 0.94 0.98 0.96 48
Institutions, Governance and Legal Systems 0.89 0.92 0.91 26
Integrative Biology: from Genes and Genomes to Systems 0.91 0.98 0.94 49
Materials Engineering 0.81 0.93 0.87 75
Mathematics 1.00 1.00 1.00 36
Molecules of Life: Biological Mechanisms, Structures and Functions 0.94 0.98 0.96 111
Neuroscience and Disorders of the Nervous System 1.00 1.00 1.00 30
Physical and Analytical Chemical Sciences 0.89 0.93 0.91 94
Physiology in Health, Disease and Ageing 0.94 1.00 0.97 34
Prevention, Diagnosis and Treatment of Human Diseases 0.97 0.96 0.96 68
Products and Processes Engineering 0.90 0.97 0.93 109
Studies of Cultures and Arts 1.00 0.78 0.88 9
Synthetic Chemistry and Materials 0.82 0.77 0.79 47
Systems and Communication Engineering 0.94 0.97 0.95 87
Texts and Concepts 0.87 0.93 0.90 14
The Human Mind and Its Complexity 1.00 0.93 0.97 30
The Social World and Its Interactions 0.97 0.94 0.96 34
The Study of the Human Past 0.89 0.94 0.91 17
Universe Sciences 1.00 1.00 1.00 25

Overall performance

Precision Recall F1-score Support
Micro avg 0.93 0.95 0.94 1401
Macro avg 0.93 0.94 0.93 1401
Weighted avg 0.93 0.95 0.94 1401
Samples avg 0.93 0.94 0.93 1401

ERC-funded projects evaluation (multiclass recall)

This evaluation uses ERC-funded projects, where each project belongs to exactly one panel.
Only recall is reported.

Panel Recall
Biotechnology and Biosystems Engineering 0.26
Cell Biology, Development, Stem Cells and Regeneration 0.81
Computer Science and Informatics 1.00
Condensed Matter Physics 0.77
Earth System Science 0.92
Environmental Biology, Ecology and Evolution 0.85
Fundamental Constituents of Matter 0.84
Human Mobility, Environment, and Space 0.61
Immunity, Infection and Immunotherapy 0.83
Individuals, Markets and Organisations 0.96
Institutions, Governance and Legal Systems 0.58
Integrative Biology: from Genes and Genomes to Systems 0.73
Materials Engineering 0.75
Mathematics 0.96
Molecules of Life: Biological Mechanisms, Structures and Functions 0.95
Neuroscience and Disorders of the Nervous System 0.92
Physical and Analytical Chemical Sciences 0.83
Physiology in Health, Disease and Ageing 0.60
Prevention, Diagnosis and Treatment of Human Diseases 0.94
Products and Processes Engineering 0.58
Studies of Cultures and Arts 0.27
Synthetic Chemistry and Materials 0.67
Systems and Communication Engineering 0.75
Texts and Concepts 0.62
The Human Mind and Its Complexity 0.85
The Social World and Its Interactions 0.73
The Study of the Human Past 0.83
Universe Sciences 1.00

Overall performance Overall recall

  • Micro recall: 0.77
  • Macro recall: 0.76

Citation

@inproceedings{bovenzi2022mapping,
  title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
  author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
  booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
  pages={495--499},
  year={2022},
  publisher={Springer International Publishing}
}

Framework versions

  • Transformers: 4.57.x
  • PyTorch: 2.8.0
  • Datasets: 3.x
  • Tokenizers: 0.22.x
Downloads last month
26
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nicolauduran45/erc_classifier_demo

Finetuned
(28)
this model

Dataset used to train nicolauduran45/erc_classifier_demo