Update README.md

981d3ba verified 4 months ago

4.75 kB

	# ERC Classifiers

	This repository contains a model trained for multi-label classification of scientific papers in the ERC (European Research Council) context. The model predicts multiple categories for a paper, such as its research domain or topic, based on the abstract and title.

	## Model Description

	The model is based on SPECTER (a transformer-based model pre-trained on scientific literature), fine-tuned for multi-label classification on a dataset of scientific papers. The model classifies papers into several categories, which are defined by the ERC categories. The fine-tuned model is trained to predict these categories given the title and abstract of each paper.

	### Preprocessing

	The preprocessing pipeline involves:

	1. Data Loading: Papers are loaded from a Parquet file containing the title, abstract, and their respective categories.
	2. Label Cleaning: Labels (categories) are processed to remove any unnecessary information (like content within parentheses).
	3. Label Encoding: Categories are transformed into a binary matrix using the MultiLabelBinarizer from scikit-learn. Each category corresponds to a column, and the value is `1` if the paper belongs to that category, `0` otherwise.
	4. Statistics and Visualization: Basic statistics and visualizations, such as label distributions, are generated to help understand the dataset better.

	### Training

	The model is fine-tuned on the preprocessed dataset using the following setup:

	* Base Model: The model uses the `allenai/specter` transformer as the base model for sequence classification.
	* Optimizer: AdamW optimizer with a learning rate of `5e-5` is used.
	* Loss Function: Binary Cross-Entropy with logits (`BCEWithLogitsLoss`) is employed, as the task is multi-label classification.
	* Epochs: The model is trained for 5 epochs with a batch size of 4.
	* Training Data: The model is trained on a processed dataset stored in `train_ready.parquet`.

	### Evaluation

	The model is evaluated using both single-label and multi-label metrics:

	#### Single-Label Evaluation

	* Accuracy: The accuracy is measured by checking how often the true label appears in the predicted labels.
	* Precision, Recall, F1: These metrics are calculated for each class and averaged for the entire dataset.

	#### Multi-Label Evaluation

	* Micro and Macro Metrics: Precision, recall, and F1 scores are computed using both micro-averaging (overall performance) and macro-averaging (performance per label).
	* Label Frequency Plot: A plot showing the frequency distribution of labels in the test set.
	* Top and Bottom F1 Plot: A plot visualizing the top and bottom labels based on their F1 scores.

	## Dataset

	The dataset consists of scientific papers, each with the following columns:

	* title: The title of the paper.
	* abstract: The abstract of the paper.
	* label: A list of categories (labels) assigned to the paper.

	The dataset is preprocessed and stored in a `train_ready.parquet` file.

	## Files

	* `config.json`: Model configuration file.
	* `model.safetensors`: Saved fine-tuned model weights.
	* `tokenizer.json`: Tokenizer configuration for the fine-tuned model.
	* `tokenizer_config.json`: Tokenizer settings.
	* `special_tokens_map.json`: Special tokens used by the tokenizer.
	* `vocab.txt`: Vocabulary file for the fine-tuned tokenizer.

	## Usage

	To use the model, follow these steps:

	1. Install Dependencies:

	```bash
	pip install transformers torch datasets
	```

	2. Load the Model and Tokenizer:

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model_name = "SIRIS-Lab/erc-classifiers"

	# Load fine-tuned model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	```

	3. Use the Model for Prediction:

	```python
	# Example paper title and abstract
	text = "Example title and abstract of a scientific paper."

	# Tokenize the input text
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

	# Make predictions
	with torch.no_grad():
	logits = model(**inputs).logits

	# Apply sigmoid activation to get probabilities
	probabilities = torch.sigmoid(logits)

	# Get predicted labels (threshold at 0.5)
	predicted_labels = (probabilities >= 0.5).long().cpu().numpy()
	print(predicted_labels)
	```

	## Conclusion

	This model provides an efficient solution for classifying scientific papers into multiple categories based on their content. It uses state-of-the-art transformer-based techniques and is fine-tuned on a real-world dataset of ERC-related scientific papers.