Paper-to-Field Classifier
Lightweight CPU-based topic classifier for scientific paper abstracts using the OpenAlex taxonomy (4,516 topics → 245 subfields → 26 fields → 4 domains).
Usage
from paper_classifier import PaperClassifier
classifier = PaperClassifier()
classifier.initialize()
result = classifier.classify(
title="Attention Is All You Need",
abstract="The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
)
print(result)
# {
# 'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
# 'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
# 'field': {'id': 17, 'name': 'Computer Science'},
# 'domain': {'id': 3, 'name': 'Physical Sciences'}
# }
Model Details
- Base model: minishlab/potion-base-32M (Model2Vec)
- Fine-tuned on: ~50K domain-balanced paper abstracts from OpenAlex
- Taxonomy: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
- Input: Paper title + abstract (truncated to 500 chars)
- Inference: CPU-only, ~3,000 papers/second
Training
Trained on OpenAlex bulk data with papers filtered for:
- English language
- Has abstract
- Primary topic confidence score > 0.8
- Domain-balanced sampling (~12.5K per domain)
Install
pip install paper-classifier
Or from source:
git clone https://github.com/jimnoneill/paper-to-field.git
cd paper-to-field
pip install -e .
- Downloads last month
- 26