paper-classifier / README.md

Upload folder using huggingface_hub

2be4558 verified 10 days ago

4.13 kB

	---
	license: mit
	base_model: distilbert-base-uncased
	tags:
	- text-classification
	- arxiv
	- academic-papers
	- distilbert
	datasets:
	- ccdv/arxiv-classification
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	---

	# Academic Paper Classifier

	A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv
	subject categories. Given the abstract of a research paper, the model predicts
	which area of computer science or statistics the paper belongs to.

	## Intended Use

	This model is designed for:

	- Automated paper triage -- quickly routing new submissions to the
	appropriate reviewers or reading lists.
	- Literature search -- filtering large collections of papers by
	predicted subject area.
	- Research tooling -- as a building block in larger academic-paper
	analysis pipelines.

	The model is not intended for high-stakes decisions such as publication
	acceptance or funding allocation.

	## Labels

	\| Id \| Label \| Description \|
	\|----\|----------\|-----------------------------------\|
	\| 0 \| cs.AI \| Artificial Intelligence \|
	\| 1 \| cs.CL \| Computation and Language (NLP) \|
	\| 2 \| cs.CV \| Computer Vision \|
	\| 3 \| cs.LG \| Machine Learning \|
	\| 4 \| cs.NE \| Neural and Evolutionary Computing \|
	\| 5 \| cs.RO \| Robotics \|
	\| 6 \| math.ST \| Statistics Theory \|
	\| 7 \| stat.ML \| Machine Learning (Statistics) \|

	## Training Procedure

	### Base Model

	[`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) --
	a distilled version of BERT that is 60% faster while retaining 97% of BERT's
	language-understanding performance.

	### Dataset

	[`ccdv/arxiv-classification`](https://huggingface.co/datasets/ccdv/arxiv-classification)
	-- a curated collection of arxiv paper abstracts with subject category labels.

	### Hyperparameters

	\| Parameter \| Value \|
	\|------------------------\|--------\|
	\| Learning rate \| 2e-5 \|
	\| LR scheduler \| Linear with warmup \|
	\| Warmup ratio \| 0.1 \|
	\| Weight decay \| 0.01 \|
	\| Epochs \| 5 \|
	\| Batch size (train) \| 16 \|
	\| Batch size (eval) \| 32 \|
	\| Max sequence length \| 512 \|
	\| Early stopping patience\| 3 \|
	\| Seed \| 42 \|

	### Metrics

	The model is evaluated on accuracy, weighted F1, weighted precision, and
	weighted recall. The best checkpoint is selected by weighted F1.

	## How to Use

	### With the `transformers` pipeline

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="gr8monk3ys/paper-classifier-model",
	)

	abstract = (
	"We introduce a new method for neural machine translation that uses "
	"attention mechanisms to align source and target sentences, achieving "
	"state-of-the-art results on WMT benchmarks."
	)

	result = classifier(abstract)
	print(result)
	# [{'label': 'cs.CL', 'score': 0.95}]
	```

	### With the included inference script

	```bash
	python inference.py \
	--model_path gr8monk3ys/paper-classifier-model \
	--abstract "We propose a convolutional neural network for image recognition..."
	```

	### Training from scratch

	```bash
	pip install -r requirements.txt

	python train.py \
	--num_train_epochs 5 \
	--learning_rate 2e-5 \
	--per_device_train_batch_size 16 \
	--push_to_hub
	```

	## Limitations

	- The model only covers a fixed set of 8 arxiv categories. Papers from other
	fields will be forced into one of these buckets.
	- Performance may degrade on abstracts that are unusually short, written in a
	language other than English, or that span multiple subject areas.
	- The model inherits any biases present in the DistilBERT base weights and in
	the training dataset.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{scaturchio2025paperclassifier,
	title = {Academic Paper Classifier},
	author = {Lorenzo Scaturchio},
	year = {2025},
	url = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
	}
	```