Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.57.0
title: arXiv Topic Classifier
emoji: 📚
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.33.0
app_file: app.py
pinned: false
license: mit
short_description: Transformer-powered topic classification for arXiv papers
arXiv Topic Classifier
arXiv Topic Classifier is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%.
The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface.
Features
- topic prediction from
titleandabstract - inference from
titleonly when abstract is missing - top-95% cumulative probability output
- full ranked list of class probabilities
- cached model loading for faster repeated requests
- self-contained deployment with local model weights
Categories
The current model predicts 10 categories:
astro-ph.GAcond-mat.mtrl-scics.CLcs.CVcs.ROecon.EMmath.PRphysics.opticsq-bio.BMquant-ph
Model
The production model is based on distilbert-base-uncased fine-tuned for multi-class text classification.
Configuration:
- max sequence length:
256 - epochs:
3 - learning rate:
2e-5
The model consumes a single formatted text built from the input fields:
title: <paper title> abstract: <paper abstract>
If the abstract is missing, inference falls back to:
title: <paper title>
Dataset
The dataset was collected from the arXiv API and processed into train, validation, and test splits.
Prepared split sizes:
- train:
3120 - validation:
391 - test:
388
Metrics
Evaluation metrics from the bundled model artifact:
- validation accuracy:
0.8696 - validation macro-F1:
0.8696 - test accuracy:
0.8789 - test macro-F1:
0.8769
Local Run
Install dependencies:
python3 -m pip install -r requirements.txt
Start the app:
streamlit run app.py --server.port 8080
Repository Layout
app.py- Streamlit UIinference.py- model loading and inference pipelineconfigs/app_config.json- runtime configurationartifacts/large_model/best_model/- trained model weights and tokenizerartifacts/large_model/metrics.json- evaluation metricsdata/processed_large/label_mapping.json- label mapping used by inference
Deployment
This repository is prepared for Hugging Face Spaces with sdk: streamlit. The app runs directly from local artifacts and does not require downloading model weights at runtime.
Example Use Cases
- quick topic tagging for arXiv drafts
- sanity-checking paper metadata before submission
- exploring how transformer classifiers separate neighboring scientific fields
Notes
- Predictions are limited by the training taxonomy and dataset coverage.
- The model is intended as a lightweight demo application, not a substitute for expert annotation.