File size: 4,130 Bytes
2be4558
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
license: mit
base_model: distilbert-base-uncased
tags:
  - text-classification
  - arxiv
  - academic-papers
  - distilbert
datasets:
  - ccdv/arxiv-classification
metrics:
  - accuracy
  - f1
pipeline_tag: text-classification
---

# Academic Paper Classifier

A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv
subject categories. Given the abstract of a research paper, the model predicts
which area of computer science or statistics the paper belongs to.

## Intended Use

This model is designed for:

- **Automated paper triage** -- quickly routing new submissions to the
  appropriate reviewers or reading lists.
- **Literature search** -- filtering large collections of papers by
  predicted subject area.
- **Research tooling** -- as a building block in larger academic-paper
  analysis pipelines.

The model is **not** intended for high-stakes decisions such as publication
acceptance or funding allocation.

## Labels

| Id | Label    | Description                       |
|----|----------|-----------------------------------|
| 0  | cs.AI    | Artificial Intelligence           |
| 1  | cs.CL    | Computation and Language (NLP)    |
| 2  | cs.CV    | Computer Vision                   |
| 3  | cs.LG    | Machine Learning                  |
| 4  | cs.NE    | Neural and Evolutionary Computing |
| 5  | cs.RO    | Robotics                          |
| 6  | math.ST  | Statistics Theory                 |
| 7  | stat.ML  | Machine Learning (Statistics)     |

## Training Procedure

### Base Model

[`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) --
a distilled version of BERT that is 60% faster while retaining 97% of BERT's
language-understanding performance.

### Dataset

[`ccdv/arxiv-classification`](https://huggingface.co/datasets/ccdv/arxiv-classification)
-- a curated collection of arxiv paper abstracts with subject category labels.

### Hyperparameters

| Parameter              | Value  |
|------------------------|--------|
| Learning rate          | 2e-5   |
| LR scheduler           | Linear with warmup |
| Warmup ratio           | 0.1    |
| Weight decay           | 0.01   |
| Epochs                 | 5      |
| Batch size (train)     | 16     |
| Batch size (eval)      | 32     |
| Max sequence length    | 512    |
| Early stopping patience| 3      |
| Seed                   | 42     |

### Metrics

The model is evaluated on accuracy, weighted F1, weighted precision, and
weighted recall. The best checkpoint is selected by weighted F1.

## How to Use

### With the `transformers` pipeline

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="gr8monk3ys/paper-classifier-model",
)

abstract = (
    "We introduce a new method for neural machine translation that uses "
    "attention mechanisms to align source and target sentences, achieving "
    "state-of-the-art results on WMT benchmarks."
)

result = classifier(abstract)
print(result)
# [{'label': 'cs.CL', 'score': 0.95}]
```

### With the included inference script

```bash
python inference.py \
    --model_path gr8monk3ys/paper-classifier-model \
    --abstract "We propose a convolutional neural network for image recognition..."
```

### Training from scratch

```bash
pip install -r requirements.txt

python train.py \
    --num_train_epochs 5 \
    --learning_rate 2e-5 \
    --per_device_train_batch_size 16 \
    --push_to_hub
```

## Limitations

- The model only covers a fixed set of 8 arxiv categories. Papers from other
  fields will be forced into one of these buckets.
- Performance may degrade on abstracts that are unusually short, written in a
  language other than English, or that span multiple subject areas.
- The model inherits any biases present in the DistilBERT base weights and in
  the training dataset.

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{scaturchio2025paperclassifier,
    title  = {Academic Paper Classifier},
    author = {Lorenzo Scaturchio},
    year   = {2025},
    url    = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
}
```