Spaces:
Sleeping
Sleeping
| title: arXiv Topic Classifier | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: streamlit | |
| sdk_version: 1.33.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Transformer-powered topic classification for arXiv papers | |
| # arXiv Topic Classifier | |
| `arXiv Topic Classifier` is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%. | |
| The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface. | |
| ## Features | |
| - topic prediction from `title` and `abstract` | |
| - inference from `title` only when abstract is missing | |
| - top-95% cumulative probability output | |
| - full ranked list of class probabilities | |
| - cached model loading for faster repeated requests | |
| - self-contained deployment with local model weights | |
| ## Categories | |
| The current model predicts 10 categories: | |
| - `astro-ph.GA` | |
| - `cond-mat.mtrl-sci` | |
| - `cs.CL` | |
| - `cs.CV` | |
| - `cs.RO` | |
| - `econ.EM` | |
| - `math.PR` | |
| - `physics.optics` | |
| - `q-bio.BM` | |
| - `quant-ph` | |
| ## Model | |
| The production model is based on `distilbert-base-uncased` fine-tuned for multi-class text classification. | |
| Configuration: | |
| - max sequence length: `256` | |
| - epochs: `3` | |
| - learning rate: `2e-5` | |
| The model consumes a single formatted text built from the input fields: | |
| ```text | |
| title: <paper title> abstract: <paper abstract> | |
| ``` | |
| If the abstract is missing, inference falls back to: | |
| ```text | |
| title: <paper title> | |
| ``` | |
| ## Dataset | |
| The dataset was collected from the arXiv API and processed into train, validation, and test splits. | |
| Prepared split sizes: | |
| - train: `3120` | |
| - validation: `391` | |
| - test: `388` | |
| ## Metrics | |
| Evaluation metrics from the bundled model artifact: | |
| - validation accuracy: `0.8696` | |
| - validation macro-F1: `0.8696` | |
| - test accuracy: `0.8789` | |
| - test macro-F1: `0.8769` | |
| ## Local Run | |
| Install dependencies: | |
| ```bash | |
| python3 -m pip install -r requirements.txt | |
| ``` | |
| Start the app: | |
| ```bash | |
| streamlit run app.py --server.port 8080 | |
| ``` | |
| ## Repository Layout | |
| - `app.py` - Streamlit UI | |
| - `inference.py` - model loading and inference pipeline | |
| - `configs/app_config.json` - runtime configuration | |
| - `artifacts/large_model/best_model/` - trained model weights and tokenizer | |
| - `artifacts/large_model/metrics.json` - evaluation metrics | |
| - `data/processed_large/label_mapping.json` - label mapping used by inference | |
| ## Deployment | |
| This repository is prepared for Hugging Face Spaces with `sdk: streamlit`. The app runs directly from local artifacts and does not require downloading model weights at runtime. | |
| ## Example Use Cases | |
| - quick topic tagging for arXiv drafts | |
| - sanity-checking paper metadata before submission | |
| - exploring how transformer classifiers separate neighboring scientific fields | |
| ## Notes | |
| - Predictions are limited by the training taxonomy and dataset coverage. | |
| - The model is intended as a lightweight demo application, not a substitute for expert annotation. | |