Academic Paper Classifier
A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv subject categories. Given the abstract of a research paper, the model predicts which area of computer science or statistics the paper belongs to.
Intended Use
This model is designed for:
- Automated paper triage -- quickly routing new submissions to the appropriate reviewers or reading lists.
- Literature search -- filtering large collections of papers by predicted subject area.
- Research tooling -- as a building block in larger academic-paper analysis pipelines.
The model is not intended for high-stakes decisions such as publication acceptance or funding allocation.
Labels
| Id | Label | Description |
|---|---|---|
| 0 | cs.AI | Artificial Intelligence |
| 1 | cs.CL | Computation and Language (NLP) |
| 2 | cs.CV | Computer Vision |
| 3 | cs.LG | Machine Learning |
| 4 | cs.NE | Neural and Evolutionary Computing |
| 5 | cs.RO | Robotics |
| 6 | math.ST | Statistics Theory |
| 7 | stat.ML | Machine Learning (Statistics) |
Training Procedure
Base Model
distilbert-base-uncased --
a distilled version of BERT that is 60% faster while retaining 97% of BERT's
language-understanding performance.
Dataset
ccdv/arxiv-classification
-- a curated collection of arxiv paper abstracts with subject category labels.
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| LR scheduler | Linear with warmup |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Epochs | 5 |
| Batch size (train) | 16 |
| Batch size (eval) | 32 |
| Max sequence length | 512 |
| Early stopping patience | 3 |
| Seed | 42 |
Metrics
The model is evaluated on accuracy, weighted F1, weighted precision, and weighted recall. The best checkpoint is selected by weighted F1.
How to Use
With the transformers pipeline
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="gr8monk3ys/paper-classifier-model",
)
abstract = (
"We introduce a new method for neural machine translation that uses "
"attention mechanisms to align source and target sentences, achieving "
"state-of-the-art results on WMT benchmarks."
)
result = classifier(abstract)
print(result)
# [{'label': 'cs.CL', 'score': 0.95}]
With the included inference script
python inference.py \
--model_path gr8monk3ys/paper-classifier-model \
--abstract "We propose a convolutional neural network for image recognition..."
Training from scratch
pip install -r requirements.txt
python train.py \
--num_train_epochs 5 \
--learning_rate 2e-5 \
--per_device_train_batch_size 16 \
--push_to_hub
Limitations
- The model only covers a fixed set of 8 arxiv categories. Papers from other fields will be forced into one of these buckets.
- Performance may degrade on abstracts that are unusually short, written in a language other than English, or that span multiple subject areas.
- The model inherits any biases present in the DistilBERT base weights and in the training dataset.
Citation
If you use this model in your research, please cite:
@misc{scaturchio2025paperclassifier,
title = {Academic Paper Classifier},
author = {Lorenzo Scaturchio},
year = {2025},
url = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
}
Model tree for gr8monk3ys/paper-classifier
Base model
distilbert/distilbert-base-uncased