nicolauduran45
/

erc_classifier_demo

@@ -1,199 +1,274 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+license: apache-2.0
+datasets:
+- SIRIS-Lab/erc-classification-dataset
+base_model:
+- allenai/specter2_base
+pipeline_tag: text-classification
 ---
+# ERC Panels Classifier
+This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**.
+It achieves the following results on the held-out test set:
+- **Best validation loss:** 0.0361
+- **Micro F1:** 0.9386
+- **Micro ROC-AUC:** 0.9718
+- **Subset accuracy:** 0.7943
+---
+## Model description
+This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels.
+The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**.
+Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels.
+### Key characteristics
+- **Base model:** allenai/specter2_base
+- **Task:** multilabel document classification
+- **Labels:** 28 ERC scientific panels
+- **Activation:** sigmoid (independent scores per label)
+- **Loss:** BCEWithLogitsLoss
+- **Output:** list of predicted panels with associated probabilities
+- **Decision threshold:** 0.5 (tunable)
+This model enables automatic research-domain tagging aligned with the ERC panel structure.
+---
+## Intended uses & limitations
+### Intended uses
+This model is designed for:
+- Automatic assignment of ERC research panels
+- Metadata enrichment for:
+  - research project databases
+  - institutional repositories
+  - funding and grant analysis pipelines
+- Large-scale analytics such as:
+  - portfolio mapping
+  - thematic analysis of research outputs
+  - monitoring disciplinary coverage of funded projects
+- Predicting subject areas for documents lacking structured domain metadata
+The model supports:
+- title only
+- abstract only
+- **title + abstract (recommended)**
+### Limitations
+- ERC panels are **high-level categories** and do not represent fine-grained subdisciplines
+- Labels are derived from curated datasets, semi-automatically annotated data
+- Class imbalance may affect recall for underrepresented panels
+- The model does not encode explicit hierarchical relationships between panels
+Not suited for:
+- fine-grained subfield classification
+- journal recommendation
+- evaluation of research quality or impact
+- clinical, legal, or regulatory decision-making
+Predictions should be treated as **supportive metadata**, not authoritative classifications.
+---
+## How to use
+```
+from transformers import pipeline
+# Replace with your actual model repo name on HuggingFace
+MODEL_NAME = "nicolauduran45/erc_classifier_demo"
+classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)
+text = ["Climate change impacts on Arctic ecosystems."]
+classifier(text)
+```
+---
+## Training and evaluation data
+### Training data
+- Scientific documents with ERC-style panel annotations
+- Inputs:
+  - title
+  - abstract
+- Task type: **multilabel classification**
+### Dataset characteristics
+| Property | Value |
+|--------|------|
+| Documents | ~40k |
+| Labels | 28 panels |
+| Input fields | Title, Abstract |
+| Task type | Multilabel |
+| License | Dataset-dependent |
+---
+## Training procedure
+### Preprocessing
+- Input text constructed as:
+  `title + ". " + abstract`
+- Tokenization using the SPECTER2 tokenizer
+- Maximum sequence length: **512 tokens**
+### Model
+- Base model: `allenai/specter2_base`
+- Classification head: linear → sigmoid
+- Loss function: BCEWithLogitsLoss
+- Predictions: independent probability per label
+### Training hyperparameters
+| Hyperparameter | Value |
+|--------------|------|
+| Learning rate | 2e-5 |
+| Train batch size | 16 |
+| Eval batch size | 16 |
+| Epochs | 6 |
+| Weight decay | 0.01 |
+| Optimizer | AdamW |
+| Metric for best model | Micro F1 |
+---
+## Training results
+| Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy |
+|------|---------------|-----------------|----------|---------|----------|
+| 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 |
+| 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 |
+| 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 |
+| 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 |
+| 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 |
+| 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** |
+---
+## Evaluation results (multilabel test set)
+| Panel | Precision | Recall | F1-score | Support |
+|------|-----------|--------|----------|---------|
+| Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 |
+| Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 |
+| Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 |
+| Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 |
+| Earth System Science | 0.94 | 0.98 | 0.96 | 64 |
+| Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 |
+| Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 |
+| Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 |
+| Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 |
+| Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 |
+| Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 |
+| Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 |
+| Materials Engineering | 0.81 | 0.93 | 0.87 | 75 |
+| Mathematics | 1.00 | 1.00 | 1.00 | 36 |
+| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 |
+| Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 |
+| Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 |
+| Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 |
+| Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 |
+| Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 |
+| Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 |
+| Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 |
+| Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 |
+| Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 |
+| The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 |
+| The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 |
+| The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 |
+| Universe Sciences | 1.00 | 1.00 | 1.00 | 25 |
+**Overall performance**
+|  | Precision | Recall | F1-score | Support |
+|------|-----------|--------|----------|---------|
+| **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** |
+| **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** |
+| **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** |
+| **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** |
+---
+## ERC-funded projects evaluation (multiclass recall)
+This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**.
+Only **recall** is reported.
+| Panel | Recall |
+|------|--------|
+| Biotechnology and Biosystems Engineering | 0.26 |
+| Cell Biology, Development, Stem Cells and Regeneration | 0.81 |
+| Computer Science and Informatics | 1.00 |
+| Condensed Matter Physics | 0.77 |
+| Earth System Science | 0.92 |
+| Environmental Biology, Ecology and Evolution | 0.85 |
+| Fundamental Constituents of Matter | 0.84 |
+| Human Mobility, Environment, and Space | 0.61 |
+| Immunity, Infection and Immunotherapy | 0.83 |
+| Individuals, Markets and Organisations | 0.96 |
+| Institutions, Governance and Legal Systems | 0.58 |
+| Integrative Biology: from Genes and Genomes to Systems | 0.73 |
+| Materials Engineering | 0.75 |
+| Mathematics | 0.96 |
+| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 |
+| Neuroscience and Disorders of the Nervous System | 0.92 |
+| Physical and Analytical Chemical Sciences | 0.83 |
+| Physiology in Health, Disease and Ageing | 0.60 |
+| Prevention, Diagnosis and Treatment of Human Diseases | 0.94 |
+| Products and Processes Engineering | 0.58 |
+| Studies of Cultures and Arts | 0.27 |
+| Synthetic Chemistry and Materials | 0.67 |
+| Systems and Communication Engineering | 0.75 |
+| Texts and Concepts | 0.62 |
+| The Human Mind and Its Complexity | 0.85 |
+| The Social World and Its Interactions | 0.73 |
+| The Study of the Human Past | 0.83 |
+| Universe Sciences | 1.00 |
+**Overall performance**
+**Overall recall**
+- **Micro recall:** 0.77
+- **Macro recall:** 0.76
+## Citation
+```
+@inproceedings{bovenzi2022mapping,
+  title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
+  author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
+  booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
+  pages={495--499},
+  year={2022},
+  publisher={Springer International Publishing}
+}
+```
+---
+## Framework versions
+- **Transformers:** 4.57.x
+- **PyTorch:** 2.8.0
+- **Datasets:** 3.x
+- **Tokenizers:** 0.22.x