asjc-classification
/

scibert_multilabel_asjc_classifier

@@ -1,158 +1,159 @@
----
-license: mit
-datasets:
-- RichardErkhov/April_2023_Public_Data_File_from_Crossref
-metrics:
-- precision
-- recall
-- f1
-base_model:
-- allenai/scibert_scivocab_uncased
-pipeline_tag: text-classification
-tags:
-- scientometrics
-- asjc
-- multi-label
-task_categories:
-- text-classification
-widget:
-- text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
----
-# 🧠 Open Multi-Label ASJC Classification
-## Model Overview
-This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
-- **Task**: Multi-label classification
-- **Labels**: 307 ASJC subjects (granular level)
-- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
-- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
-- **License**: MIT
-- **Framework**: Hugging Face Transformers
----
-## 📚 Intended Use
-- Classify individual research documents into multiple ASJC subjects.
-- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
-- Works with **title**, **abstract**, and optionally **container title** metadata.
----
-## 🛠 Training Details
-- **Preprocessing**:
-  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
-  - Multi-hot encoding for multi-label classification.
-  - Data augmentation for underrepresented classes.
-- **Fine-tuning**:
-  - Optimizer: AdamW
-  - Loss: Binary Cross-Entropy
-  - Learning Rate: 2e-5
-  - Epochs: 1
-  - Batch Size: 16
-  - Threshold for label assignment: 0.3
----
-## 📈 Metrics
-| Input Features                    | Labels | Precision | Recall | F1-Score |
-|-----------------------------------|--------|-----------|--------|----------|
-| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
-| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
-| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
-| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
-For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
----
-## ✅ Model Strengths
-- Handles **interdisciplinary** and **general science journals**.
-- Works even without container title (lower accuracy).
-- Scalable for large collections.
----
-## ⚠️ Limitations
-- Performance relies on metadata completeness (title, abstract, container title).
-- Lower accuracy for rare subjects and missing source info.
-- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
----
-## 🔍 Example Usage
-```python
-from transformers import TextClassificationPipeline, pipeline
-import torch
-# --- Custom multi-label pipeline ---
-class ASJCMultiLabelPipeline(TextClassificationPipeline):
-    """
-    Multi-label classification pipeline for ASJC categories.
-    Uses a configurable threshold to return all labels with scores above the threshold.
-    """
-    def __init__(self, *args, **kwargs):
-        # Allow threshold override; default falls back to model config
-        self.threshold = kwargs.pop("threshold", None)
-        super().__init__(*args, **kwargs)
-        if self.threshold is None:
-            self.threshold = getattr(self.model.config, "threshold", 0.3)
-    def postprocess(self, model_outputs, **kwargs):
-        # Convert logits to probabilities using sigmoid
-        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
-        results = []
-        for i, score in enumerate(scores[0]):
-            if score >= self.threshold:
-                label = self.model.config.id2label[(i)]
-                results.append({"label": label, "score": float(score)})
-        # Sort by descending score
-        results = sorted(results, key=lambda x: x["score"], reverse=True)
-        return results
-# --- Create the pipeline explicitly using the custom class ---
-pipe = pipeline(
-    task="text-classification",
-    model="asjc-classification/scibert_multilabel_asjc_classifier",
-    pipeline_class=ASJCMultiLabelPipeline
-)
-# --- Example text input ---
-text = (
-    "title={Jodometrie}, "
-    "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
-    "abstract={}"
-)
-# --- Get multi-label predictions ---
-result = pipe(text)
-print(result)
-# Predicted labels:
-[
-  {'label': 'Analytical Chemistry', 'score': 0.933479368686676},
-  {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
-  {'label': 'Biochemistry', 'score': 0.494137704372406}
-]
-# Expected labels:
-# - Clinical Biochemistry
-# - Analytical Chemistry
-```
----
-## 📖 Citation
-If you use this work, please cite:
-```bibtex
-@article{gusenbauer2025open,
-  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
-  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
-  journal   = {Scientometrics},
-  year      = {2025}
 }

+---
+license: mit
+datasets:
+- RichardErkhov/April_2023_Public_Data_File_from_Crossref
+metrics:
+- precision
+- recall
+- f1
+base_model:
+- allenai/scibert_scivocab_uncased
+pipeline_tag: text-classification
+tags:
+- scientometrics
+- asjc
+- multi-label
+task_categories:
+- text-classification
+widget:
+- text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
+---
+# 🧠 Open Multi-Label ASJC Classification
+## Model Overview
+This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
+- **Task**: Multi-label classification
+- **Labels**: 307 ASJC subjects (granular level)
+- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
+- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
+- **License**: MIT
+- **Framework**: Hugging Face Transformers
+---
+## 📚 Intended Use
+- Classify individual research documents into multiple ASJC subjects.
+- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
+- Works with **title**, **abstract**, and optionally **container title** metadata.
+---
+## 🛠 Training Details
+- **Preprocessing**:
+  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
+  - Multi-hot encoding for multi-label classification.
+  - Data augmentation for underrepresented classes.
+- **Fine-tuning**:
+  - Optimizer: AdamW
+  - Loss: Binary Cross-Entropy
+  - Learning Rate: 2e-5
+  - Epochs: 1
+  - Batch Size: 16
+  - Threshold for label assignment: 0.3
+---
+## 📈 Metrics
+| Input Features                    | Labels | Precision | Recall | F1-Score |
+|-----------------------------------|--------|-----------|--------|----------|
+| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
+| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
+| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
+| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
+For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
+---
+## ✅ Model Strengths
+- Handles **interdisciplinary** and **general science journals**.
+- Works even without container title (lower accuracy).
+- Scalable for large collections.
+---
+## ⚠️ Limitations
+- Performance relies on metadata completeness (title, abstract, container title).
+- Lower accuracy for rare subjects and missing source info.
+- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
+---
+## 🔍 Example Usage
+```python
+from transformers import TextClassificationPipeline, pipeline
+import torch
+# --- Custom multi-label pipeline ---
+class ASJCMultiLabelPipeline(TextClassificationPipeline):
+    """
+    Multi-label classification pipeline for ASJC categories.
+    Uses a configurable threshold to return all labels with scores above the threshold.
+    """
+    def __init__(self, *args, **kwargs):
+        # Allow threshold override; default falls back to model config
+        self.threshold = kwargs.pop("threshold", None)
+        super().__init__(*args, **kwargs)
+        if self.threshold is None:
+            self.threshold = getattr(self.model.config, "threshold", 0.3)
+    def postprocess(self, model_outputs, **kwargs):
+        # Convert logits to probabilities using sigmoid
+        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
+        results = []
+        for i, score in enumerate(scores[0]):
+            if score >= self.threshold:
+                label = self.model.config.id2label[(i)]
+                results.append({"label": label, "score": float(score)})
+        # Sort by descending score
+        results = sorted(results, key=lambda x: x["score"], reverse=True)
+        return results
+# --- Create the pipeline explicitly using the custom class ---
+pipe = pipeline(
+    task="text-classification",
+    model="asjc-classification/scibert_multilabel_asjc_classifier",
+    pipeline_class=ASJCMultiLabelPipeline
+)
+# --- Example text input ---
+text = (
+    "title={Jodometrie}, "
+    "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
+    "abstract={}"
+)
+# --- Get multi-label predictions ---
+result = pipe(text)
+print(result)
+# Predicted labels:
+[
+  {'label': 'Analytical Chemistry', 'score': 0.933479368686676},
+  {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
+  {'label': 'Biochemistry', 'score': 0.494137704372406}
+]
+# Expected labels:
+# - Clinical Biochemistry
+# - Analytical Chemistry
+```
+---
+## 📖 Citation
+If you use this work, please cite:
+```bibtex
+@article{gusenbauer2025open,
+  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
+  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
+  journal   = {Scientometrics},
+  year      = {2025},
+  doi       = {10.1007/s11192-025-05490-0}
 }