Spaces:

ProstoPetro
/

Article-classifier

Sleeping

App Files Files Community

Article-classifier / README.md

Pyotr Lisov

Add article classifier app

70b2ea0 about 1 month ago

preview code

raw

history blame contribute delete

3.19 kB

	---
	title: arXiv Topic Classifier
	emoji: 📚
	colorFrom: blue
	colorTo: green
	sdk: streamlit
	sdk_version: 1.33.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Transformer-powered topic classification for arXiv papers
	---

	# arXiv Topic Classifier

	`arXiv Topic Classifier` is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%.

	The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface.

	## Features

	- topic prediction from `title` and `abstract`
	- inference from `title` only when abstract is missing
	- top-95% cumulative probability output
	- full ranked list of class probabilities
	- cached model loading for faster repeated requests
	- self-contained deployment with local model weights

	## Categories

	The current model predicts 10 categories:

	- `astro-ph.GA`
	- `cond-mat.mtrl-sci`
	- `cs.CL`
	- `cs.CV`
	- `cs.RO`
	- `econ.EM`
	- `math.PR`
	- `physics.optics`
	- `q-bio.BM`
	- `quant-ph`

	## Model

	The production model is based on `distilbert-base-uncased` fine-tuned for multi-class text classification.

	Configuration:

	- max sequence length: `256`
	- epochs: `3`
	- learning rate: `2e-5`

	The model consumes a single formatted text built from the input fields:

	```text
	title: <paper title> abstract: <paper abstract>
	```

	If the abstract is missing, inference falls back to:

	```text
	title: <paper title>
	```

	## Dataset

	The dataset was collected from the arXiv API and processed into train, validation, and test splits.

	Prepared split sizes:

	- train: `3120`
	- validation: `391`
	- test: `388`

	## Metrics

	Evaluation metrics from the bundled model artifact:

	- validation accuracy: `0.8696`
	- validation macro-F1: `0.8696`
	- test accuracy: `0.8789`
	- test macro-F1: `0.8769`

	## Local Run

	Install dependencies:

	```bash
	python3 -m pip install -r requirements.txt
	```

	Start the app:

	```bash
	streamlit run app.py --server.port 8080
	```

	## Repository Layout

	- `app.py` - Streamlit UI
	- `inference.py` - model loading and inference pipeline
	- `configs/app_config.json` - runtime configuration
	- `artifacts/large_model/best_model/` - trained model weights and tokenizer
	- `artifacts/large_model/metrics.json` - evaluation metrics
	- `data/processed_large/label_mapping.json` - label mapping used by inference

	## Deployment

	This repository is prepared for Hugging Face Spaces with `sdk: streamlit`. The app runs directly from local artifacts and does not require downloading model weights at runtime.

	## Example Use Cases

	- quick topic tagging for arXiv drafts
	- sanity-checking paper metadata before submission
	- exploring how transformer classifiers separate neighboring scientific fields

	## Notes

	- Predictions are limited by the training taxonomy and dataset coverage.
	- The model is intended as a lightweight demo application, not a substitute for expert annotation.