Paper-to-Field Classifier

Lightweight CPU-based topic classifier for scientific paper abstracts using the OpenAlex taxonomy (4,516 topics → 245 subfields → 26 fields → 4 domains).

Usage

from paper_classifier import PaperClassifier

classifier = PaperClassifier()
classifier.initialize()

result = classifier.classify(
    title="Attention Is All You Need",
    abstract="The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
)

print(result)
# {
#   'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
#   'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
#   'field': {'id': 17, 'name': 'Computer Science'},
#   'domain': {'id': 3, 'name': 'Physical Sciences'}
# }

Model Details

  • Base model: minishlab/potion-base-32M (Model2Vec)
  • Fine-tuned on: ~50K domain-balanced paper abstracts from OpenAlex
  • Taxonomy: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
  • Input: Paper title + abstract (truncated to 500 chars)
  • Inference: CPU-only, ~3,000 papers/second

Training

Trained on OpenAlex bulk data with papers filtered for:

  • English language
  • Has abstract
  • Primary topic confidence score > 0.8
  • Domain-balanced sampling (~12.5K per domain)

Install

pip install paper-classifier

Or from source:

git clone https://github.com/jimnoneill/paper-to-field.git
cd paper-to-field
pip install -e .
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jimnoneill/paper-to-field