File size: 3,192 Bytes
0af1ff2
70b2ea0
 
 
 
 
 
 
0af1ff2
ae1c8b3
70b2ea0
0af1ff2
 
70b2ea0
0af1ff2
70b2ea0
0af1ff2
70b2ea0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: arXiv Topic Classifier
emoji: 📚
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.33.0
app_file: app.py
pinned: false
license: mit
short_description: Transformer-powered topic classification for arXiv papers
---

# arXiv Topic Classifier

`arXiv Topic Classifier` is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%.

The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface.

## Features

- topic prediction from `title` and `abstract`
- inference from `title` only when abstract is missing
- top-95% cumulative probability output
- full ranked list of class probabilities
- cached model loading for faster repeated requests
- self-contained deployment with local model weights

## Categories

The current model predicts 10 categories:

- `astro-ph.GA`
- `cond-mat.mtrl-sci`
- `cs.CL`
- `cs.CV`
- `cs.RO`
- `econ.EM`
- `math.PR`
- `physics.optics`
- `q-bio.BM`
- `quant-ph`

## Model

The production model is based on `distilbert-base-uncased` fine-tuned for multi-class text classification.

Configuration:

- max sequence length: `256`
- epochs: `3`
- learning rate: `2e-5`

The model consumes a single formatted text built from the input fields:

```text
title: <paper title> abstract: <paper abstract>
```

If the abstract is missing, inference falls back to:

```text
title: <paper title>
```

## Dataset

The dataset was collected from the arXiv API and processed into train, validation, and test splits.

Prepared split sizes:

- train: `3120`
- validation: `391`
- test: `388`

## Metrics

Evaluation metrics from the bundled model artifact:

- validation accuracy: `0.8696`
- validation macro-F1: `0.8696`
- test accuracy: `0.8789`
- test macro-F1: `0.8769`

## Local Run

Install dependencies:

```bash
python3 -m pip install -r requirements.txt
```

Start the app:

```bash
streamlit run app.py --server.port 8080
```

## Repository Layout

- `app.py` - Streamlit UI
- `inference.py` - model loading and inference pipeline
- `configs/app_config.json` - runtime configuration
- `artifacts/large_model/best_model/` - trained model weights and tokenizer
- `artifacts/large_model/metrics.json` - evaluation metrics
- `data/processed_large/label_mapping.json` - label mapping used by inference

## Deployment

This repository is prepared for Hugging Face Spaces with `sdk: streamlit`. The app runs directly from local artifacts and does not require downloading model weights at runtime.

## Example Use Cases

- quick topic tagging for arXiv drafts
- sanity-checking paper metadata before submission
- exploring how transformer classifiers separate neighboring scientific fields

## Notes

- Predictions are limited by the training taxonomy and dataset coverage.
- The model is intended as a lightweight demo application, not a substitute for expert annotation.