metadata
license: mit
datasets:
- RichardErkhov/April_2023_Public_Data_File_from_Crossref
metrics:
- precision
- recall
- f1
base_model:
- allenai/scibert_scivocab_uncased
pipeline_tag: text-classification
tags:
- scientometrics
- asjc
- multi-label
task_categories:
- text-classification
widget:
- text: >-
title={Jodometrie}, container_title={Fresenius' Zeitschrift für
analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}
🧠 Open Multi-Label ASJC Classification
Model Overview
This model fine-tunes allenai/scibert_scivocab_uncased across 307 ASJC subject categories, enabling document-level classification beyond traditional journal-level schemes.
- Task: Multi-label classification
- Labels: 307 ASJC subjects (granular level)
- Base Model: SciBERT
- Training Data: Crossref 2023 dataset (titles, abstracts, container titles)
- License: MIT
- Framework: Hugging Face Transformers
📚 Intended Use
- Classify individual research documents into multiple ASJC subjects.
- Analyze disciplinary orientation of collections (authors, institutions, databases).
- Works with title, abstract, and optionally container title metadata.
🛠 Training Details
- Preprocessing:
- Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
- Multi-hot encoding for multi-label classification.
- Data augmentation for underrepresented classes.
- Fine-tuning:
- Optimizer: AdamW
- Loss: Binary Cross-Entropy
- Learning Rate: 2e-5
- Epochs: 1
- Batch Size: 16
- Threshold for label assignment: 0.3
📈 Metrics
| Input Features | Labels | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Title + Container Title + Abstract | 307 | 0.912 | 0.885 | 0.892 |
| Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
| Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
| Title only | 307 | 0.528 | 0.416 | 0.448 |
For 26 parent subjects, F1-score improves to 0.934 with full metadata.
✅ Model Strengths
- Handles interdisciplinary and general science journals.
- Works even without container title (lower accuracy).
- Scalable for large collections.
⚠️ Limitations
- Performance relies on metadata completeness (title, abstract, container title).
- Lower accuracy for rare subjects and missing source info.
- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
🔍 Example Usage
from transformers import TextClassificationPipeline, pipeline
import torch
# --- Custom multi-label pipeline ---
class ASJCMultiLabelPipeline(TextClassificationPipeline):
"""
Multi-label classification pipeline for ASJC categories.
Uses a configurable threshold to return all labels with scores above the threshold.
"""
def __init__(self, *args, **kwargs):
# Allow threshold override; default falls back to model config
self.threshold = kwargs.pop("threshold", None)
super().__init__(*args, **kwargs)
if self.threshold is None:
self.threshold = getattr(self.model.config, "threshold", 0.3)
def postprocess(self, model_outputs, **kwargs):
# Convert logits to probabilities using sigmoid
scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
results = []
for i, score in enumerate(scores[0]):
if score >= self.threshold:
label = self.model.config.id2label[(i)]
results.append({"label": label, "score": float(score)})
# Sort by descending score
results = sorted(results, key=lambda x: x["score"], reverse=True)
return results
# --- Create the pipeline explicitly using the custom class ---
pipe = pipeline(
task="text-classification",
model="asjc-classification/scibert_multilabel_asjc_classifier",
pipeline_class=ASJCMultiLabelPipeline
)
# --- Example text input ---
text = (
"title={Jodometrie}, "
"container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
"abstract={}"
)
# --- Get multi-label predictions ---
result = pipe(text)
print(result)
# Predicted labels:
[
{'label': 'Analytical Chemistry', 'score': 0.933479368686676},
{'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
{'label': 'Biochemistry', 'score': 0.494137704372406}
]
# Expected labels:
# - Clinical Biochemistry
# - Analytical Chemistry
📖 Citation
If you use this work, please cite:
@article{gusenbauer2025open,
author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
journal = {Scientometrics},
year = {2025}
}