Academic Paper Classifier

A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv subject categories. Given the abstract of a research paper, the model predicts which area of computer science or statistics the paper belongs to.

Intended Use

This model is designed for:

  • Automated paper triage -- quickly routing new submissions to the appropriate reviewers or reading lists.
  • Literature search -- filtering large collections of papers by predicted subject area.
  • Research tooling -- as a building block in larger academic-paper analysis pipelines.

The model is not intended for high-stakes decisions such as publication acceptance or funding allocation.

Labels

Id Label Description
0 cs.AI Artificial Intelligence
1 cs.CL Computation and Language (NLP)
2 cs.CV Computer Vision
3 cs.LG Machine Learning
4 cs.NE Neural and Evolutionary Computing
5 cs.RO Robotics
6 math.ST Statistics Theory
7 stat.ML Machine Learning (Statistics)

Training Procedure

Base Model

distilbert-base-uncased -- a distilled version of BERT that is 60% faster while retaining 97% of BERT's language-understanding performance.

Dataset

ccdv/arxiv-classification -- a curated collection of arxiv paper abstracts with subject category labels.

Hyperparameters

Parameter Value
Learning rate 2e-5
LR scheduler Linear with warmup
Warmup ratio 0.1
Weight decay 0.01
Epochs 5
Batch size (train) 16
Batch size (eval) 32
Max sequence length 512
Early stopping patience 3
Seed 42

Metrics

The model is evaluated on accuracy, weighted F1, weighted precision, and weighted recall. The best checkpoint is selected by weighted F1.

How to Use

With the transformers pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="gr8monk3ys/paper-classifier-model",
)

abstract = (
    "We introduce a new method for neural machine translation that uses "
    "attention mechanisms to align source and target sentences, achieving "
    "state-of-the-art results on WMT benchmarks."
)

result = classifier(abstract)
print(result)
# [{'label': 'cs.CL', 'score': 0.95}]

With the included inference script

python inference.py \
    --model_path gr8monk3ys/paper-classifier-model \
    --abstract "We propose a convolutional neural network for image recognition..."

Training from scratch

pip install -r requirements.txt

python train.py \
    --num_train_epochs 5 \
    --learning_rate 2e-5 \
    --per_device_train_batch_size 16 \
    --push_to_hub

Limitations

  • The model only covers a fixed set of 8 arxiv categories. Papers from other fields will be forced into one of these buckets.
  • Performance may degrade on abstracts that are unusually short, written in a language other than English, or that span multiple subject areas.
  • The model inherits any biases present in the DistilBERT base weights and in the training dataset.

Citation

If you use this model in your research, please cite:

@misc{scaturchio2025paperclassifier,
    title  = {Academic Paper Classifier},
    author = {Lorenzo Scaturchio},
    year   = {2025},
    url    = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gr8monk3ys/paper-classifier

Finetuned
(10752)
this model

Dataset used to train gr8monk3ys/paper-classifier