Dmitry057's picture
Add README.md
c95658d verified
---
title: arXiv Topic Classifier
emoji: πŸ“‘
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
license: mit
---
# arXiv Topic Classifier
A small web app that takes a paper's **title** and (optionally) **abstract** and predicts the most likely arXiv top-level categories (cs, math, physics, q-bio, stat, ...).
The model is a fine-tuned `distilbert-base-uncased`. Predictions are displayed as a top-95% list β€” the smallest set of categories whose total probability is at least 95%, sorted by descending confidence.
## Files
- [app.py](app.py) β€” Streamlit UI and inference code.
- [train.ipynb](train.ipynb) β€” End-to-end training notebook (data loading, fine-tuning, evaluation, model saving).
- [requirements.txt](requirements.txt) β€” Python dependencies for HuggingFace Spaces.
- [PROJECT.md](PROJECT.md) β€” Detailed project write-up (data, model choices, experiments, results).
## Run locally
```bash
pip install -r requirements.txt
# Either: train your own model with train.ipynb (produces ./model/)
# Or: set ARXIV_MODEL_REPO=your-username/arxiv-topic-classifier
streamlit run app.py
```
The app and the training notebook auto-detect the best available device: **MPS** (Apple Silicon) β†’ **CUDA** β†’ **CPU**. On an M1 Max one inference call takes ~30–80 ms; on the HF Spaces free tier (CPU) ~150–300 ms.
## Deploy to HuggingFace Spaces
1. Create a new Space at https://huggingface.co/new-space, choose **Streamlit** as the SDK.
2. Push this directory (`app.py`, `requirements.txt`, `README.md`) to the Space's git repo.
3. **Either** push the trained `./model/` directory alongside the code, **or** publish your model to HF Hub and add a Space secret named `ARXIV_MODEL_REPO` with the repo id (e.g. `your-username/arxiv-topic-classifier`).
4. The Space will rebuild automatically (β‰ˆ 2–4 minutes). Once it's green, your app is live.
## Model loading priority
1. If the env var `ARXIV_MODEL_REPO` is set, the app loads weights from that HF Hub repo.
2. Otherwise it looks for a local `./model/` directory (the one produced by `train.ipynb`).
3. If neither is available, the app shows a friendly error explaining what to do.