Update README.md

dfc353e verified about 1 month ago

9.35 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- SIRIS-Lab/erc-classification-dataset
	base_model:
	- allenai/specter2_base
	pipeline_tag: text-classification
	---

	# ERC Panels Classifier

	This model is a fine-tuned version of allenai/specter2_base for multilabel scientific domain classification aligned with ERC panel taxonomy.
	It achieves the following results on the held-out test set:

	- Best validation loss: 0.0361
	- Micro F1: 0.9386
	- Micro ROC-AUC: 0.9718
	- Subset accuracy: 0.7943

	---

	## Model description

	This model is a fine-tuned variant of SPECTER2 (`allenai/specter2_base`) adapted for multilabel classification of scientific documents into ERC research panels.

	The model takes as input the title and abstract of a scientific publication and predicts one or more research panels.
	Since scientific outputs may legitimately span multiple domains, the model is trained using sigmoid activation with binary cross-entropy loss, allowing independent assignment of multiple labels.

	### Key characteristics

	- Base model: allenai/specter2_base
	- Task: multilabel document classification
	- Labels: 28 ERC scientific panels
	- Activation: sigmoid (independent scores per label)
	- Loss: BCEWithLogitsLoss
	- Output: list of predicted panels with associated probabilities
	- Decision threshold: 0.5 (tunable)

	This model enables automatic research-domain tagging aligned with the ERC panel structure.

	---

	## Intended uses & limitations

	### Intended uses

	This model is designed for:

	- Automatic assignment of ERC research panels
	- Metadata enrichment for:
	- research project databases
	- institutional repositories
	- funding and grant analysis pipelines
	- Large-scale analytics such as:
	- portfolio mapping
	- thematic analysis of research outputs
	- monitoring disciplinary coverage of funded projects
	- Predicting subject areas for documents lacking structured domain metadata

	The model supports:

	- title only
	- abstract only
	- title + abstract (recommended)

	### Limitations

	- ERC panels are high-level categories and do not represent fine-grained subdisciplines
	- Labels are derived from curated datasets, semi-automatically annotated data
	- Class imbalance may affect recall for underrepresented panels
	- The model does not encode explicit hierarchical relationships between panels

	Not suited for:

	- fine-grained subfield classification
	- journal recommendation
	- evaluation of research quality or impact
	- clinical, legal, or regulatory decision-making

	Predictions should be treated as supportive metadata, not authoritative classifications.

	---

	## How to use

	```
	from transformers import pipeline

	# Replace with your actual model repo name on HuggingFace
	MODEL_NAME = "nicolauduran45/erc_classifier_demo"

	classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)

	text = ["Climate change impacts on Arctic ecosystems."]

	classifier(text)
	```
	---

	## Training and evaluation data

	### Training data

	- Scientific documents with ERC-style panel annotations
	- Inputs:
	- title
	- abstract
	- Task type: multilabel classification

	### Dataset characteristics

	\| Property \| Value \|
	\|--------\|------\|
	\| Documents \| ~40k \|
	\| Labels \| 28 panels \|
	\| Input fields \| Title, Abstract \|
	\| Task type \| Multilabel \|
	\| License \| Dataset-dependent \|

	---

	## Training procedure

	### Preprocessing

	- Input text constructed as:

	`title + ". " + abstract`

	- Tokenization using the SPECTER2 tokenizer
	- Maximum sequence length: 512 tokens

	### Model

	- Base model: `allenai/specter2_base`
	- Classification head: linear → sigmoid
	- Loss function: BCEWithLogitsLoss
	- Predictions: independent probability per label

	### Training hyperparameters

	\| Hyperparameter \| Value \|
	\|--------------\|------\|
	\| Learning rate \| 2e-5 \|
	\| Train batch size \| 16 \|
	\| Eval batch size \| 16 \|
	\| Epochs \| 6 \|
	\| Weight decay \| 0.01 \|
	\| Optimizer \| AdamW \|
	\| Metric for best model \| Micro F1 \|

	---

	## Training results

	\| Epoch \| Training Loss \| Validation Loss \| Micro F1 \| ROC-AUC \| Accuracy \|
	\|------\|---------------\|-----------------\|----------\|---------\|----------\|
	\| 1 \| 0.2089 \| 0.0968 \| 0.7576 \| 0.8347 \| 0.4043 \|
	\| 2 \| 0.0961 \| 0.0713 \| 0.8231 \| 0.8888 \| 0.5171 \|
	\| 3 \| 0.0719 \| 0.0578 \| 0.8614 \| 0.9209 \| 0.5829 \|
	\| 4 \| 0.0579 \| 0.0458 \| 0.9072 \| 0.9546 \| 0.7029 \|
	\| 5 \| 0.0479 \| 0.0390 \| 0.9264 \| 0.9620 \| 0.7614 \|
	\| 6 \| 0.0407 \| 0.0361 \| 0.9386 \| 0.9718 \| 0.7943 \|

	---

	## Evaluation results (multilabel test set)

	\| Panel \| Precision \| Recall \| F1-score \| Support \|
	\|------\|-----------\|--------\|----------\|---------\|
	\| Biotechnology and Biosystems Engineering \| 0.88 \| 0.70 \| 0.78 \| 30 \|
	\| Cell Biology, Development, Stem Cells and Regeneration \| 0.98 \| 0.94 \| 0.96 \| 54 \|
	\| Computer Science and Informatics \| 0.96 \| 0.98 \| 0.97 \| 95 \|
	\| Condensed Matter Physics \| 0.97 \| 0.99 \| 0.98 \| 68 \|
	\| Earth System Science \| 0.94 \| 0.98 \| 0.96 \| 64 \|
	\| Environmental Biology, Ecology and Evolution \| 0.91 \| 0.96 \| 0.94 \| 54 \|
	\| Fundamental Constituents of Matter \| 0.97 \| 0.94 \| 0.95 \| 32 \|
	\| Human Mobility, Environment, and Space \| 0.81 \| 0.81 \| 0.81 \| 21 \|
	\| Immunity, Infection and Immunotherapy \| 1.00 \| 0.97 \| 0.99 \| 40 \|
	\| Individuals, Markets and Organisations \| 0.94 \| 0.98 \| 0.96 \| 48 \|
	\| Institutions, Governance and Legal Systems \| 0.89 \| 0.92 \| 0.91 \| 26 \|
	\| Integrative Biology: from Genes and Genomes to Systems \| 0.91 \| 0.98 \| 0.94 \| 49 \|
	\| Materials Engineering \| 0.81 \| 0.93 \| 0.87 \| 75 \|
	\| Mathematics \| 1.00 \| 1.00 \| 1.00 \| 36 \|
	\| Molecules of Life: Biological Mechanisms, Structures and Functions \| 0.94 \| 0.98 \| 0.96 \| 111 \|
	\| Neuroscience and Disorders of the Nervous System \| 1.00 \| 1.00 \| 1.00 \| 30 \|
	\| Physical and Analytical Chemical Sciences \| 0.89 \| 0.93 \| 0.91 \| 94 \|
	\| Physiology in Health, Disease and Ageing \| 0.94 \| 1.00 \| 0.97 \| 34 \|
	\| Prevention, Diagnosis and Treatment of Human Diseases \| 0.97 \| 0.96 \| 0.96 \| 68 \|
	\| Products and Processes Engineering \| 0.90 \| 0.97 \| 0.93 \| 109 \|
	\| Studies of Cultures and Arts \| 1.00 \| 0.78 \| 0.88 \| 9 \|
	\| Synthetic Chemistry and Materials \| 0.82 \| 0.77 \| 0.79 \| 47 \|
	\| Systems and Communication Engineering \| 0.94 \| 0.97 \| 0.95 \| 87 \|
	\| Texts and Concepts \| 0.87 \| 0.93 \| 0.90 \| 14 \|
	\| The Human Mind and Its Complexity \| 1.00 \| 0.93 \| 0.97 \| 30 \|
	\| The Social World and Its Interactions \| 0.97 \| 0.94 \| 0.96 \| 34 \|
	\| The Study of the Human Past \| 0.89 \| 0.94 \| 0.91 \| 17 \|
	\| Universe Sciences \| 1.00 \| 1.00 \| 1.00 \| 25 \|


	Overall performance

	\| \| Precision \| Recall \| F1-score \| Support \|
	\|------\|-----------\|--------\|----------\|---------\|
	\| Micro avg \| 0.93 \| 0.95 \| 0.94 \| 1401 \|
	\| Macro avg \| 0.93 \| 0.94 \| 0.93 \| 1401 \|
	\| Weighted avg \| 0.93 \| 0.95 \| 0.94 \| 1401 \|
	\| Samples avg \| 0.93 \| 0.94 \| 0.93 \| 1401 \|

	---

	## ERC-funded projects evaluation (multiclass recall)

	This evaluation uses ERC-funded projects, where each project belongs to exactly one panel.
	Only recall is reported.

	\| Panel \| Recall \|
	\|------\|--------\|
	\| Biotechnology and Biosystems Engineering \| 0.26 \|
	\| Cell Biology, Development, Stem Cells and Regeneration \| 0.81 \|
	\| Computer Science and Informatics \| 1.00 \|
	\| Condensed Matter Physics \| 0.77 \|
	\| Earth System Science \| 0.92 \|
	\| Environmental Biology, Ecology and Evolution \| 0.85 \|
	\| Fundamental Constituents of Matter \| 0.84 \|
	\| Human Mobility, Environment, and Space \| 0.61 \|
	\| Immunity, Infection and Immunotherapy \| 0.83 \|
	\| Individuals, Markets and Organisations \| 0.96 \|
	\| Institutions, Governance and Legal Systems \| 0.58 \|
	\| Integrative Biology: from Genes and Genomes to Systems \| 0.73 \|
	\| Materials Engineering \| 0.75 \|
	\| Mathematics \| 0.96 \|
	\| Molecules of Life: Biological Mechanisms, Structures and Functions \| 0.95 \|
	\| Neuroscience and Disorders of the Nervous System \| 0.92 \|
	\| Physical and Analytical Chemical Sciences \| 0.83 \|
	\| Physiology in Health, Disease and Ageing \| 0.60 \|
	\| Prevention, Diagnosis and Treatment of Human Diseases \| 0.94 \|
	\| Products and Processes Engineering \| 0.58 \|
	\| Studies of Cultures and Arts \| 0.27 \|
	\| Synthetic Chemistry and Materials \| 0.67 \|
	\| Systems and Communication Engineering \| 0.75 \|
	\| Texts and Concepts \| 0.62 \|
	\| The Human Mind and Its Complexity \| 0.85 \|
	\| The Social World and Its Interactions \| 0.73 \|
	\| The Study of the Human Past \| 0.83 \|
	\| Universe Sciences \| 1.00 \|

	Overall performance
	Overall recall

	- Micro recall: 0.77
	- Macro recall: 0.76

	## Citation

	```
	@inproceedings{bovenzi2022mapping,
	title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
	author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
	booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
	pages={495--499},
	year={2022},
	publisher={Springer International Publishing}
	}
	```

	---

	## Framework versions

	- Transformers: 4.57.x
	- PyTorch: 2.8.0
	- Datasets: 3.x
	- Tokenizers: 0.22.x