--- title: arXiv Topic Classifier emoji: 📚 colorFrom: blue colorTo: green sdk: streamlit sdk_version: 1.33.0 app_file: app.py pinned: false license: mit short_description: Transformer-powered topic classification for arXiv papers --- # arXiv Topic Classifier `arXiv Topic Classifier` is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%. The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface. ## Features - topic prediction from `title` and `abstract` - inference from `title` only when abstract is missing - top-95% cumulative probability output - full ranked list of class probabilities - cached model loading for faster repeated requests - self-contained deployment with local model weights ## Categories The current model predicts 10 categories: - `astro-ph.GA` - `cond-mat.mtrl-sci` - `cs.CL` - `cs.CV` - `cs.RO` - `econ.EM` - `math.PR` - `physics.optics` - `q-bio.BM` - `quant-ph` ## Model The production model is based on `distilbert-base-uncased` fine-tuned for multi-class text classification. Configuration: - max sequence length: `256` - epochs: `3` - learning rate: `2e-5` The model consumes a single formatted text built from the input fields: ```text title: abstract: ``` If the abstract is missing, inference falls back to: ```text title: ``` ## Dataset The dataset was collected from the arXiv API and processed into train, validation, and test splits. Prepared split sizes: - train: `3120` - validation: `391` - test: `388` ## Metrics Evaluation metrics from the bundled model artifact: - validation accuracy: `0.8696` - validation macro-F1: `0.8696` - test accuracy: `0.8789` - test macro-F1: `0.8769` ## Local Run Install dependencies: ```bash python3 -m pip install -r requirements.txt ``` Start the app: ```bash streamlit run app.py --server.port 8080 ``` ## Repository Layout - `app.py` - Streamlit UI - `inference.py` - model loading and inference pipeline - `configs/app_config.json` - runtime configuration - `artifacts/large_model/best_model/` - trained model weights and tokenizer - `artifacts/large_model/metrics.json` - evaluation metrics - `data/processed_large/label_mapping.json` - label mapping used by inference ## Deployment This repository is prepared for Hugging Face Spaces with `sdk: streamlit`. The app runs directly from local artifacts and does not require downloading model weights at runtime. ## Example Use Cases - quick topic tagging for arXiv drafts - sanity-checking paper metadata before submission - exploring how transformer classifiers separate neighboring scientific fields ## Notes - Predictions are limited by the training taxonomy and dataset coverage. - The model is intended as a lightweight demo application, not a substitute for expert annotation.