--- license: mit base_model: distilbert-base-uncased tags: - text-classification - arxiv - academic-papers - distilbert datasets: - ccdv/arxiv-classification metrics: - accuracy - f1 pipeline_tag: text-classification --- # Academic Paper Classifier A DistilBERT model fine-tuned to classify academic paper abstracts into arxiv subject categories. Given the abstract of a research paper, the model predicts which area of computer science or statistics the paper belongs to. ## Intended Use This model is designed for: - **Automated paper triage** -- quickly routing new submissions to the appropriate reviewers or reading lists. - **Literature search** -- filtering large collections of papers by predicted subject area. - **Research tooling** -- as a building block in larger academic-paper analysis pipelines. The model is **not** intended for high-stakes decisions such as publication acceptance or funding allocation. ## Labels | Id | Label | Description | |----|----------|-----------------------------------| | 0 | cs.AI | Artificial Intelligence | | 1 | cs.CL | Computation and Language (NLP) | | 2 | cs.CV | Computer Vision | | 3 | cs.LG | Machine Learning | | 4 | cs.NE | Neural and Evolutionary Computing | | 5 | cs.RO | Robotics | | 6 | math.ST | Statistics Theory | | 7 | stat.ML | Machine Learning (Statistics) | ## Training Procedure ### Base Model [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) -- a distilled version of BERT that is 60% faster while retaining 97% of BERT's language-understanding performance. ### Dataset [`ccdv/arxiv-classification`](https://huggingface.co/datasets/ccdv/arxiv-classification) -- a curated collection of arxiv paper abstracts with subject category labels. ### Hyperparameters | Parameter | Value | |------------------------|--------| | Learning rate | 2e-5 | | LR scheduler | Linear with warmup | | Warmup ratio | 0.1 | | Weight decay | 0.01 | | Epochs | 5 | | Batch size (train) | 16 | | Batch size (eval) | 32 | | Max sequence length | 512 | | Early stopping patience| 3 | | Seed | 42 | ### Metrics The model is evaluated on accuracy, weighted F1, weighted precision, and weighted recall. The best checkpoint is selected by weighted F1. ## How to Use ### With the `transformers` pipeline ```python from transformers import pipeline classifier = pipeline( "text-classification", model="gr8monk3ys/paper-classifier-model", ) abstract = ( "We introduce a new method for neural machine translation that uses " "attention mechanisms to align source and target sentences, achieving " "state-of-the-art results on WMT benchmarks." ) result = classifier(abstract) print(result) # [{'label': 'cs.CL', 'score': 0.95}] ``` ### With the included inference script ```bash python inference.py \ --model_path gr8monk3ys/paper-classifier-model \ --abstract "We propose a convolutional neural network for image recognition..." ``` ### Training from scratch ```bash pip install -r requirements.txt python train.py \ --num_train_epochs 5 \ --learning_rate 2e-5 \ --per_device_train_batch_size 16 \ --push_to_hub ``` ## Limitations - The model only covers a fixed set of 8 arxiv categories. Papers from other fields will be forced into one of these buckets. - Performance may degrade on abstracts that are unusually short, written in a language other than English, or that span multiple subject areas. - The model inherits any biases present in the DistilBERT base weights and in the training dataset. ## Citation If you use this model in your research, please cite: ```bibtex @misc{scaturchio2025paperclassifier, title = {Academic Paper Classifier}, author = {Lorenzo Scaturchio}, year = {2025}, url = {https://huggingface.co/gr8monk3ys/paper-classifier-model} } ```