--- library_name: transformers license: apache-2.0 datasets: - SIRIS-Lab/erc-classification-dataset base_model: - allenai/specter2_base pipeline_tag: text-classification --- # ERC Panels Classifier This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**. It achieves the following results on the held-out test set: - **Best validation loss:** 0.0361 - **Micro F1:** 0.9386 - **Micro ROC-AUC:** 0.9718 - **Subset accuracy:** 0.7943 --- ## Model description This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels. The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**. Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels. ### Key characteristics - **Base model:** allenai/specter2_base - **Task:** multilabel document classification - **Labels:** 28 ERC scientific panels - **Activation:** sigmoid (independent scores per label) - **Loss:** BCEWithLogitsLoss - **Output:** list of predicted panels with associated probabilities - **Decision threshold:** 0.5 (tunable) This model enables automatic research-domain tagging aligned with the ERC panel structure. --- ## Intended uses & limitations ### Intended uses This model is designed for: - Automatic assignment of ERC research panels - Metadata enrichment for: - research project databases - institutional repositories - funding and grant analysis pipelines - Large-scale analytics such as: - portfolio mapping - thematic analysis of research outputs - monitoring disciplinary coverage of funded projects - Predicting subject areas for documents lacking structured domain metadata The model supports: - title only - abstract only - **title + abstract (recommended)** ### Limitations - ERC panels are **high-level categories** and do not represent fine-grained subdisciplines - Labels are derived from curated datasets, semi-automatically annotated data - Class imbalance may affect recall for underrepresented panels - The model does not encode explicit hierarchical relationships between panels Not suited for: - fine-grained subfield classification - journal recommendation - evaluation of research quality or impact - clinical, legal, or regulatory decision-making Predictions should be treated as **supportive metadata**, not authoritative classifications. --- ## How to use ``` from transformers import pipeline # Replace with your actual model repo name on HuggingFace MODEL_NAME = "nicolauduran45/erc_classifier_demo" classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME) text = ["Climate change impacts on Arctic ecosystems."] classifier(text) ``` --- ## Training and evaluation data ### Training data - Scientific documents with ERC-style panel annotations - Inputs: - title - abstract - Task type: **multilabel classification** ### Dataset characteristics | Property | Value | |--------|------| | Documents | ~40k | | Labels | 28 panels | | Input fields | Title, Abstract | | Task type | Multilabel | | License | Dataset-dependent | --- ## Training procedure ### Preprocessing - Input text constructed as: `title + ". " + abstract` - Tokenization using the SPECTER2 tokenizer - Maximum sequence length: **512 tokens** ### Model - Base model: `allenai/specter2_base` - Classification head: linear → sigmoid - Loss function: BCEWithLogitsLoss - Predictions: independent probability per label ### Training hyperparameters | Hyperparameter | Value | |--------------|------| | Learning rate | 2e-5 | | Train batch size | 16 | | Eval batch size | 16 | | Epochs | 6 | | Weight decay | 0.01 | | Optimizer | AdamW | | Metric for best model | Micro F1 | --- ## Training results | Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy | |------|---------------|-----------------|----------|---------|----------| | 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 | | 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 | | 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 | | 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 | | 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 | | 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** | --- ## Evaluation results (multilabel test set) | Panel | Precision | Recall | F1-score | Support | |------|-----------|--------|----------|---------| | Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 | | Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 | | Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 | | Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 | | Earth System Science | 0.94 | 0.98 | 0.96 | 64 | | Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 | | Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 | | Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 | | Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 | | Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 | | Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 | | Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 | | Materials Engineering | 0.81 | 0.93 | 0.87 | 75 | | Mathematics | 1.00 | 1.00 | 1.00 | 36 | | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 | | Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 | | Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 | | Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 | | Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 | | Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 | | Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 | | Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 | | Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 | | Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 | | The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 | | The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 | | The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 | | Universe Sciences | 1.00 | 1.00 | 1.00 | 25 | **Overall performance** | | Precision | Recall | F1-score | Support | |------|-----------|--------|----------|---------| | **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** | | **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** | | **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** | | **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** | --- ## ERC-funded projects evaluation (multiclass recall) This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**. Only **recall** is reported. | Panel | Recall | |------|--------| | Biotechnology and Biosystems Engineering | 0.26 | | Cell Biology, Development, Stem Cells and Regeneration | 0.81 | | Computer Science and Informatics | 1.00 | | Condensed Matter Physics | 0.77 | | Earth System Science | 0.92 | | Environmental Biology, Ecology and Evolution | 0.85 | | Fundamental Constituents of Matter | 0.84 | | Human Mobility, Environment, and Space | 0.61 | | Immunity, Infection and Immunotherapy | 0.83 | | Individuals, Markets and Organisations | 0.96 | | Institutions, Governance and Legal Systems | 0.58 | | Integrative Biology: from Genes and Genomes to Systems | 0.73 | | Materials Engineering | 0.75 | | Mathematics | 0.96 | | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 | | Neuroscience and Disorders of the Nervous System | 0.92 | | Physical and Analytical Chemical Sciences | 0.83 | | Physiology in Health, Disease and Ageing | 0.60 | | Prevention, Diagnosis and Treatment of Human Diseases | 0.94 | | Products and Processes Engineering | 0.58 | | Studies of Cultures and Arts | 0.27 | | Synthetic Chemistry and Materials | 0.67 | | Systems and Communication Engineering | 0.75 | | Texts and Concepts | 0.62 | | The Human Mind and Its Complexity | 0.85 | | The Social World and Its Interactions | 0.73 | | The Study of the Human Past | 0.83 | | Universe Sciences | 1.00 | **Overall performance** **Overall recall** - **Micro recall:** 0.77 - **Macro recall:** 0.76 ## Citation ``` @inproceedings{bovenzi2022mapping, title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark}, author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep}, booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)}, pages={495--499}, year={2022}, publisher={Springer International Publishing} } ``` --- ## Framework versions - **Transformers:** 4.57.x - **PyTorch:** 2.8.0 - **Datasets:** 3.x - **Tokenizers:** 0.22.x