asjc-classification
/

scibert_multilabel_asjc_classifier

@@ -1,103 +1,118 @@
-# 🧠 Open Multi-Label ASJC Classification
-## Model Overview
-This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
-- **Task**: Multi-label classification
-- **Labels**: 307 ASJC subjects (granular level)
-- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
-- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
-- **License**: MIT
-- **Framework**: Hugging Face Transformers
----
-## 📚 Intended Use
-- Classify individual research documents into multiple ASJC subjects.
-- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
-- Works with **title**, **abstract**, and optionally **container title** metadata.
----
-## 🛠 Training Details
-- **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
-- **Preprocessing**:
-  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
-  - Multi-hot encoding for multi-label classification.
-  - Data augmentation for underrepresented classes.
-- **Fine-tuning**:
-  - Optimizer: AdamW
-  - Loss: Binary Cross-Entropy
-  - Learning Rate: 2e-5
-  - Epochs: 1
-  - Batch Size: 16
-  - Threshold for label assignment: 0.3
----
-## 📈 Metrics
-| Input Features                    | Labels | Precision | Recall | F1-Score |
-|-----------------------------------|--------|-----------|--------|----------|
-| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
-| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
-| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
-| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
-For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
----
-## ✅ Model Strengths
-- Handles **interdisciplinary** and **general science journals**.
-- Works even without container title (lower accuracy).
-- Scalable for large collections.
----
-## ⚠️ Limitations
-- Performance relies on metadata completeness (title, abstract, container title).
-- Lower accuracy for rare subjects and missing source info.
-- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
----
-## 🔍 Example Usage
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-import json
-# Load model and tokenizer
-model_name = "your-hf-username/open-asjc-multilabel"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-# Load sample input
-with open("small_example.json") as f:
-    data = json.load(f)
-text = data["title"] + " " + data.get("abstract", "")
-inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
-# Predict
-with torch.no_grad():
-    outputs = model(**inputs)
-    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
-# Apply threshold
-threshold = 0.3
-predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
-```
-## 📖 Citation
-If you use this work, please cite:
-```bibtex
-@article{gusenbauer2025open,
-  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
-  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
-  journal   = {Scientometrics},
-  year      = {2025}
-}

+---
+license: mit
+datasets:
+- RichardErkhov/April_2023_Public_Data_File_from_Crossref
+metrics:
+- precision
+- recall
+- f1
+base_model:
+- allenai/scibert_scivocab_uncased
+pipeline_tag: text-classification
+tags:
+- scientometrics
+- asjc
+---
+# 🧠 Open Multi-Label ASJC Classification
+## Model Overview
+This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
+- **Task**: Multi-label classification
+- **Labels**: 307 ASJC subjects (granular level)
+- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
+- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
+- **License**: MIT
+- **Framework**: Hugging Face Transformers
+---
+## 📚 Intended Use
+- Classify individual research documents into multiple ASJC subjects.
+- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
+- Works with **title**, **abstract**, and optionally **container title** metadata.
+---
+## 🛠 Training Details
+- **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
+- **Preprocessing**:
+  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
+  - Multi-hot encoding for multi-label classification.
+  - Data augmentation for underrepresented classes.
+- **Fine-tuning**:
+  - Optimizer: AdamW
+  - Loss: Binary Cross-Entropy
+  - Learning Rate: 2e-5
+  - Epochs: 1
+  - Batch Size: 16
+  - Threshold for label assignment: 0.3
+---
+## 📈 Metrics
+| Input Features                    | Labels | Precision | Recall | F1-Score |
+|-----------------------------------|--------|-----------|--------|----------|
+| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
+| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
+| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
+| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
+For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
+---
+## ✅ Model Strengths
+- Handles **interdisciplinary** and **general science journals**.
+- Works even without container title (lower accuracy).
+- Scalable for large collections.
+---
+## ⚠️ Limitations
+- Performance relies on metadata completeness (title, abstract, container title).
+- Lower accuracy for rare subjects and missing source info.
+- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
+---
+## 🔍 Example Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import json
+# Load model and tokenizer
+model_name = "your-hf-username/open-asjc-multilabel"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Load sample input
+with open("small_example.json") as f:
+    data = json.load(f)
+text = data["title"] + " " + data.get("abstract", "")
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+# Predict
+with torch.no_grad():
+    outputs = model(**inputs)
+    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
+# Apply threshold
+threshold = 0.3
+predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
+```
+## 📖 Citation
+If you use this work, please cite:
+```bibtex
+@article{gusenbauer2025open,
+  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
+  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
+  journal   = {Scientometrics},
+  year      = {2025}
+}