paper-classifier / README.md
gr8monk3ys's picture
Upload folder using huggingface_hub
2be4558 verified
---
license: mit
base_model: distilbert-base-uncased
tags:
- text-classification
- arxiv
- academic-papers
- distilbert
datasets:
- ccdv/arxiv-classification
metrics:
- accuracy
- f1
pipeline_tag: text-classification
---
# Academic Paper Classifier
A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv
subject categories. Given the abstract of a research paper, the model predicts
which area of computer science or statistics the paper belongs to.
## Intended Use
This model is designed for:
- **Automated paper triage** -- quickly routing new submissions to the
appropriate reviewers or reading lists.
- **Literature search** -- filtering large collections of papers by
predicted subject area.
- **Research tooling** -- as a building block in larger academic-paper
analysis pipelines.
The model is **not** intended for high-stakes decisions such as publication
acceptance or funding allocation.
## Labels
| Id | Label | Description |
|----|----------|-----------------------------------|
| 0 | cs.AI | Artificial Intelligence |
| 1 | cs.CL | Computation and Language (NLP) |
| 2 | cs.CV | Computer Vision |
| 3 | cs.LG | Machine Learning |
| 4 | cs.NE | Neural and Evolutionary Computing |
| 5 | cs.RO | Robotics |
| 6 | math.ST | Statistics Theory |
| 7 | stat.ML | Machine Learning (Statistics) |
## Training Procedure
### Base Model
[`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) --
a distilled version of BERT that is 60% faster while retaining 97% of BERT's
language-understanding performance.
### Dataset
[`ccdv/arxiv-classification`](https://huggingface.co/datasets/ccdv/arxiv-classification)
-- a curated collection of arxiv paper abstracts with subject category labels.
### Hyperparameters
| Parameter | Value |
|------------------------|--------|
| Learning rate | 2e-5 |
| LR scheduler | Linear with warmup |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Epochs | 5 |
| Batch size (train) | 16 |
| Batch size (eval) | 32 |
| Max sequence length | 512 |
| Early stopping patience| 3 |
| Seed | 42 |
### Metrics
The model is evaluated on accuracy, weighted F1, weighted precision, and
weighted recall. The best checkpoint is selected by weighted F1.
## How to Use
### With the `transformers` pipeline
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="gr8monk3ys/paper-classifier-model",
)
abstract = (
"We introduce a new method for neural machine translation that uses "
"attention mechanisms to align source and target sentences, achieving "
"state-of-the-art results on WMT benchmarks."
)
result = classifier(abstract)
print(result)
# [{'label': 'cs.CL', 'score': 0.95}]
```
### With the included inference script
```bash
python inference.py \
--model_path gr8monk3ys/paper-classifier-model \
--abstract "We propose a convolutional neural network for image recognition..."
```
### Training from scratch
```bash
pip install -r requirements.txt
python train.py \
--num_train_epochs 5 \
--learning_rate 2e-5 \
--per_device_train_batch_size 16 \
--push_to_hub
```
## Limitations
- The model only covers a fixed set of 8 arxiv categories. Papers from other
fields will be forced into one of these buckets.
- Performance may degrade on abstracts that are unusually short, written in a
language other than English, or that span multiple subject areas.
- The model inherits any biases present in the DistilBERT base weights and in
the training dataset.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{scaturchio2025paperclassifier,
title = {Academic Paper Classifier},
author = {Lorenzo Scaturchio},
year = {2025},
url = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
}
```