asjc-classification
/

scibert_multilabel_asjc_classifier

+# 🧠 Open Multi-Label ASJC Classification
+## Model Overview
+This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
+- **Task**: Multi-label classification
+- **Labels**: 307 ASJC subjects (granular level)
+- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
+- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
+- **License**: MIT
+- **Framework**: Hugging Face Transformers
+---
+## 📚 Intended Use
+- Classify individual research documents into multiple ASJC subjects.
+- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
+- Works with **title**, **abstract**, and optionally **container title** metadata.
+---
+## 🛠 Training Details
+- **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
+- **Preprocessing**:
+  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
+  - Multi-hot encoding for multi-label classification.
+  - Data augmentation for underrepresented classes.
+- **Fine-tuning**:
+  - Optimizer: AdamW
+  - Loss: Binary Cross-Entropy
+  - Learning Rate: 2e-5
+  - Epochs: 1
+  - Batch Size: 16
+  - Threshold for label assignment: 0.3
+---
+## 📈 Metrics
+| Input Features                    | Labels | Precision | Recall | F1-Score |
+|-----------------------------------|--------|-----------|--------|----------|
+| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
+| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
+| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
+| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
+For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
+---
+## ✅ Model Strengths
+- Handles **interdisciplinary** and **general science journals**.
+- Works even without container title (lower accuracy).
+- Scalable for large collections.
+---
+## ⚠️ Limitations
+- Performance relies on metadata completeness (title, abstract, container title).
+- Lower accuracy for rare subjects and missing source info.
+- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
+---
+## 🔍 Example Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import json
+# Load model and tokenizer
+model_name = "your-hf-username/open-asjc-multilabel"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Load sample input
+with open("small_example.json") as f:
+    data = json.load(f)
+text = data["title"] + " " + data.get("abstract", "")
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+# Predict
+with torch.no_grad():
+    outputs = model(**inputs)
+    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
+# Apply threshold
+threshold = 0.3
+predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
+```
+## 📖 Citation
+If you use this work, please cite:
+```bibtex
+@article{gusenbauer2025open,
+  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
+  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
+  journal   = {Scientometrics},
+  year      = {2025}
+}