File size: 4,130 Bytes
2be4558 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: mit
base_model: distilbert-base-uncased
tags:
- text-classification
- arxiv
- academic-papers
- distilbert
datasets:
- ccdv/arxiv-classification
metrics:
- accuracy
- f1
pipeline_tag: text-classification
---
# Academic Paper Classifier
A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv
subject categories. Given the abstract of a research paper, the model predicts
which area of computer science or statistics the paper belongs to.
## Intended Use
This model is designed for:
- **Automated paper triage** -- quickly routing new submissions to the
appropriate reviewers or reading lists.
- **Literature search** -- filtering large collections of papers by
predicted subject area.
- **Research tooling** -- as a building block in larger academic-paper
analysis pipelines.
The model is **not** intended for high-stakes decisions such as publication
acceptance or funding allocation.
## Labels
| Id | Label | Description |
|----|----------|-----------------------------------|
| 0 | cs.AI | Artificial Intelligence |
| 1 | cs.CL | Computation and Language (NLP) |
| 2 | cs.CV | Computer Vision |
| 3 | cs.LG | Machine Learning |
| 4 | cs.NE | Neural and Evolutionary Computing |
| 5 | cs.RO | Robotics |
| 6 | math.ST | Statistics Theory |
| 7 | stat.ML | Machine Learning (Statistics) |
## Training Procedure
### Base Model
[`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) --
a distilled version of BERT that is 60% faster while retaining 97% of BERT's
language-understanding performance.
### Dataset
[`ccdv/arxiv-classification`](https://huggingface.co/datasets/ccdv/arxiv-classification)
-- a curated collection of arxiv paper abstracts with subject category labels.
### Hyperparameters
| Parameter | Value |
|------------------------|--------|
| Learning rate | 2e-5 |
| LR scheduler | Linear with warmup |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Epochs | 5 |
| Batch size (train) | 16 |
| Batch size (eval) | 32 |
| Max sequence length | 512 |
| Early stopping patience| 3 |
| Seed | 42 |
### Metrics
The model is evaluated on accuracy, weighted F1, weighted precision, and
weighted recall. The best checkpoint is selected by weighted F1.
## How to Use
### With the `transformers` pipeline
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="gr8monk3ys/paper-classifier-model",
)
abstract = (
"We introduce a new method for neural machine translation that uses "
"attention mechanisms to align source and target sentences, achieving "
"state-of-the-art results on WMT benchmarks."
)
result = classifier(abstract)
print(result)
# [{'label': 'cs.CL', 'score': 0.95}]
```
### With the included inference script
```bash
python inference.py \
--model_path gr8monk3ys/paper-classifier-model \
--abstract "We propose a convolutional neural network for image recognition..."
```
### Training from scratch
```bash
pip install -r requirements.txt
python train.py \
--num_train_epochs 5 \
--learning_rate 2e-5 \
--per_device_train_batch_size 16 \
--push_to_hub
```
## Limitations
- The model only covers a fixed set of 8 arxiv categories. Papers from other
fields will be forced into one of these buckets.
- Performance may degrade on abstracts that are unusually short, written in a
language other than English, or that span multiple subject areas.
- The model inherits any biases present in the DistilBERT base weights and in
the training dataset.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{scaturchio2025paperclassifier,
title = {Academic Paper Classifier},
author = {Lorenzo Scaturchio},
year = {2025},
url = {https://huggingface.co/gr8monk3ys/paper-classifier-model}
}
```
|