File size: 4,130 Bytes

2be4558

---
license: mit
base_model: distilbert-base-uncased
tags:
  - text-classification
  - arxiv
  - academic-papers
  - distilbert
datasets:
  - ccdv/arxiv-classification
metrics:
  - accuracy
  - f1
pipeline_tag: text-classification
---

# Academic Paper Classifier

A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv
subject categories. Given the abstract of a research paper, the model predicts
which area of computer science or statistics the paper belongs to.

## Intended Use

This model is designed for:

- **Automated paper triage** -- quickly routing new submissions to the
  appropriate reviewers or reading lists.
- **Literature search** -- filtering large collections of papers by
  predicted subject area.
- **Research tooling** -- as a building block in larger academic-paper
  analysis pipelines.

The model is **not** intended for high-stakes decisions such as publication
acceptance or funding allocation.

## Labels

| Id | Label    | Description                       |
|----|----------|-----------------------------------|
| 0  | cs.AI    | Artificial Intelligence           |
| 1  | cs.CL    | Computation and Language (NLP)    |
| 2  | cs.CV    | Computer Vision                   |
| 3  | cs.LG    | Machine Learning                  |
| 4  | cs.NE    | Neural and Evolutionary Computing |
| 5  | cs.RO    | Robotics                          |
| 6  | math.ST  | Statistics Theory                 |
| 7  | stat.ML  | Machine Learning (Statistics)     |

## Training Procedure

### Base Model

[`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) --
a distilled version of BERT that is 60% faster while retaining 97% of BERT's
language-understanding performance.

### Dataset

[`ccdv/arxiv-classification`](https://huggingface.co/datasets/ccdv/arxiv-classification)
-- a curated collection of arxiv paper abstracts with subject category labels.

### Hyperparameters

| Parameter              | Value  |
|------------------------|--------|
| Learning rate          | 2e-5   |
| LR scheduler           | Linear with warmup |
| Warmup ratio           | 0.1    |
| Weight decay           | 0.01   |
| Epochs                 | 5      |
| Batch size (train)     | 16     |
| Batch size (eval)      | 32     |
| Max sequence length    | 512    |
| Early stopping patience| 3      |
| Seed                   | 42     |

### Metrics

The model is evaluated on accuracy, weighted F1, weighted precision, and
weighted recall. The best checkpoint is selected by weighted F1.

## How to Use

### With the `transformers` pipeline

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="gr8monk3ys/paper-classifier-model",
)

abstract = (
    "We introduce a new method for neural machine translation that uses "
    "attention mechanisms to align source and target sentences, achieving "
    "state-of-the-art results on WMT benchmarks."
)

result = classifier(abstract)
print(result)
# [{'label': 'cs.CL', 'score': 0.95}]
```

### With the included inference script

```bash
python inference.py \
    --model_path gr8monk3ys/paper-classifier-model \
    --abstract "We propose a convolutional neural network for image recognition..."
```

### Training from scratch

```bash
pip install -r requirements.txt

python train.py \
    --num_train_epochs 5 \
    --learning_rate 2e-5 \
    --per_device_train_batch_size 16 \
    --push_to_hub
```

## Limitations

- The model only covers a fixed set of 8 arxiv categories. Papers from other
  fields will be forced into one of these buckets.
- Performance may degrade on abstracts that are unusually short, written in a
  language other than English, or that span multiple subject areas.
- The model inherits any biases present in the DistilBERT base weights and in
  the training dataset.

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{scaturchio2025paperclassifier,
    title  = {Academic Paper Classifier},
    author = {Lorenzo Scaturchio},
    year   = {2025},
    url    = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
}
```