Article-classifier / README.md
Pyotr Lisov
Add article classifier app
70b2ea0

A newer version of the Streamlit SDK is available: 1.57.0

Upgrade
metadata
title: arXiv Topic Classifier
emoji: 📚
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.33.0
app_file: app.py
pinned: false
license: mit
short_description: Transformer-powered topic classification for arXiv papers

arXiv Topic Classifier

arXiv Topic Classifier is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%.

The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface.

Features

  • topic prediction from title and abstract
  • inference from title only when abstract is missing
  • top-95% cumulative probability output
  • full ranked list of class probabilities
  • cached model loading for faster repeated requests
  • self-contained deployment with local model weights

Categories

The current model predicts 10 categories:

  • astro-ph.GA
  • cond-mat.mtrl-sci
  • cs.CL
  • cs.CV
  • cs.RO
  • econ.EM
  • math.PR
  • physics.optics
  • q-bio.BM
  • quant-ph

Model

The production model is based on distilbert-base-uncased fine-tuned for multi-class text classification.

Configuration:

  • max sequence length: 256
  • epochs: 3
  • learning rate: 2e-5

The model consumes a single formatted text built from the input fields:

title: <paper title> abstract: <paper abstract>

If the abstract is missing, inference falls back to:

title: <paper title>

Dataset

The dataset was collected from the arXiv API and processed into train, validation, and test splits.

Prepared split sizes:

  • train: 3120
  • validation: 391
  • test: 388

Metrics

Evaluation metrics from the bundled model artifact:

  • validation accuracy: 0.8696
  • validation macro-F1: 0.8696
  • test accuracy: 0.8789
  • test macro-F1: 0.8769

Local Run

Install dependencies:

python3 -m pip install -r requirements.txt

Start the app:

streamlit run app.py --server.port 8080

Repository Layout

  • app.py - Streamlit UI
  • inference.py - model loading and inference pipeline
  • configs/app_config.json - runtime configuration
  • artifacts/large_model/best_model/ - trained model weights and tokenizer
  • artifacts/large_model/metrics.json - evaluation metrics
  • data/processed_large/label_mapping.json - label mapping used by inference

Deployment

This repository is prepared for Hugging Face Spaces with sdk: streamlit. The app runs directly from local artifacts and does not require downloading model weights at runtime.

Example Use Cases

  • quick topic tagging for arXiv drafts
  • sanity-checking paper metadata before submission
  • exploring how transformer classifiers separate neighboring scientific fields

Notes

  • Predictions are limited by the training taxonomy and dataset coverage.
  • The model is intended as a lightweight demo application, not a substitute for expert annotation.