jimnoneill/paper-to-field-training
Preview • Updated • 21
How to use jimnoneill/paper-to-field with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="jimnoneill/paper-to-field") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("jimnoneill/paper-to-field")
model = AutoModelForSequenceClassification.from_pretrained("jimnoneill/paper-to-field")Transformer-based topic classifier for scientific paper abstracts using the OpenAlex taxonomy (4,516 topics → 245 subfields → 26 fields → 4 domains).
| Metric | Accuracy |
|---|---|
| Field (26 classes) | 86.3% |
| Domain (4 classes) | 94.4% |
from paper_classifier import PaperClassifier
classifier = PaperClassifier()
classifier.initialize()
result = classifier.classify(
title="Attention Is All You Need",
abstract="The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
)
print(result)
# {
# 'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
# 'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
# 'field': {'id': 17, 'name': 'Computer Science', 'score': 0.95},
# 'domain': {'id': 3, 'name': 'Physical Sciences'}
# }
Trained on 200K domain-balanced paper abstracts from OpenAlex bulk data, re-annotated with DeepSeek LLM for high-quality field labels (confidence >= 0.7 filter applied).
Hyperparameters: lr=1e-5, cosine schedule, batch=32 (grad accum 2 = effective 64), epochs=8, warmup=6%, label smoothing=0.1, fp16, early stopping (patience 5), sqrt inverse-frequency class weights.
pip install paper-classifier
Or from source:
git clone https://github.com/jimnoneill/paper-to-field.git
cd paper-to-field
pip install -e .