asjc-classification
/

scibert_multilabel_asjc_classifier

@@ -1,161 +1,160 @@
----
-license: mit
-datasets:
-- RichardErkhov/April_2023_Public_Data_File_from_Crossref
-metrics:
-- precision
-- recall
-- f1
-base_model:
-- allenai/scibert_scivocab_uncased
-pipeline_tag: text-classification
-tags:
-- scientometrics
-- asjc
-- multi-label
-task_categories:
-- text-classification
-widget:
-- text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
----
-# 🧠 Open Multi-Label ASJC Classification
-## Model Overview
-This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
-- **Task**: Multi-label classification
-- **Labels**: 307 ASJC subjects (granular level)
-- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
-- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
-- **License**: MIT
-- **Framework**: Hugging Face Transformers
----
-## 📚 Intended Use
-- Classify individual research documents into multiple ASJC subjects.
-- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
-- Works with **title**, **abstract**, and optionally **container title** metadata.
----
-## 🛠 Training Details
-- **Preprocessing**:
-  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
-  - Multi-hot encoding for multi-label classification.
-  - Data augmentation for underrepresented classes.
-- **Fine-tuning**:
-  - Optimizer: AdamW
-  - Loss: Binary Cross-Entropy
-  - Learning Rate: 2e-5
-  - Epochs: 1
-  - Batch Size: 16
-  - Threshold for label assignment: 0.3
----
-## 📈 Metrics
-| Input Features                    | Labels | Precision | Recall | F1-Score |
-|-----------------------------------|--------|-----------|--------|----------|
-| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
-| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
-| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
-| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
-For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
----
-## ✅ Model Strengths
-- Handles **interdisciplinary** and **general science journals**.
-- Works even without container title (lower accuracy).
-- Scalable for large collections.
----
-## ⚠️ Limitations
-- Performance relies on metadata completeness (title, abstract, container title).
-- Lower accuracy for rare subjects and missing source info.
-- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
----
-## 🔍 Example Usage
-```python
-from transformers import TextClassificationPipeline, pipeline
-import torch
-# --- Custom multi-label pipeline ---
-class ASJCMultiLabelPipeline(TextClassificationPipeline):
-    """
-    Multi-label classification pipeline for ASJC categories.
-    Uses a configurable threshold to return all labels with scores above the threshold.
-    """
-    def __init__(self, *args, **kwargs):
-        # Allow threshold override; default falls back to model config
-        self.threshold = kwargs.pop("threshold", None)
-        super().__init__(*args, **kwargs)
-        if self.threshold is None:
-            self.threshold = getattr(self.model.config, "threshold", 0.3)
-    def postprocess(self, model_outputs, **kwargs):
-        # Convert logits to probabilities using sigmoid
-        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
-        results = []
-        for i, score in enumerate(scores[0]):
-            if score >= self.threshold:
-                label = self.model.config.id2label[(i)]
-                results.append({"label": label, "score": float(score)})
-        # Sort by descending score
-        results = sorted(results, key=lambda x: x["score"], reverse=True)
-        return results
-# --- Create the pipeline explicitly using the custom class ---
-pipe = pipeline(
-    task="text-classification",
-    model="asjc-classification/scibert_multilabel_asjc_classifier",
-    pipeline_class=ASJCMultiLabelPipeline
-)
-# --- Example text input ---
-text = (
-    "title={Jodometrie}, "
-    "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
-    "abstract={}"
-)
-# --- Get multi-label predictions ---
-result = pipe(text)
-print(result)
-# Predicted labels:
-[
-  {'label': 'Analytical Chemistry', 'score': 0.933479368686676},
-  {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
-  {'label': 'Biochemistry', 'score': 0.494137704372406}
-]
-# Expected labels:
-# - Clinical Biochemistry
-# - Analytical Chemistry
-```
----
-## 📖 Citation
-If you use this work, please cite:
-```bibtex
-@article{gusenbauer2025asjc,
-  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
-  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
-  journal   = {Scientometrics},
-  year      = {2025},
-  doi       = {10.1007/s11192-025-05490-0},
-  issn      = {0138-9130},
-  keywords  = {All Science Journal Classification;Disciplinary coverage;Fine-tuning;multi-label}
 }

+---
+license: mit
+datasets:
+- RichardErkhov/April_2023_Public_Data_File_from_Crossref
+metrics:
+- precision
+- recall
+- f1
+base_model:
+- allenai/scibert_scivocab_uncased
+pipeline_tag: text-classification
+tags:
+- scientometrics
+- asjc
+- multi-label
+task_categories:
+- text-classification
+widget:
+- text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
+---
+# 🧠 Open Multi-Label ASJC Classification
+## Model Overview
+This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
+- **Task**: Multi-label classification
+- **Labels**: 307 ASJC subjects (granular level)
+- **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
+- **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
+- **License**: MIT
+- **Framework**: Hugging Face Transformers
+---
+## 📚 Intended Use
+- Classify individual research documents into multiple ASJC subjects.
+- Analyze disciplinary orientation of **collections** (authors, institutions, databases).
+- Works with **title**, **abstract**, and optionally **container title** metadata.
+---
+## 🛠 Training Details
+- **Preprocessing**:
+  - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
+  - Multi-hot encoding for multi-label classification.
+  - Data augmentation for underrepresented classes.
+- **Fine-tuning**:
+  - Optimizer: AdamW
+  - Loss: Binary Cross-Entropy
+  - Learning Rate: 2e-5
+  - Epochs: 1
+  - Batch Size: 16
+  - Threshold for label assignment: 0.3
+---
+## 📈 Metrics
+| Input Features                    | Labels | Precision | Recall | F1-Score |
+|-----------------------------------|--------|-----------|--------|----------|
+| Title + Container Title + Abstract| 307    | 0.912     | 0.885  | 0.892    |
+| Title + Abstract                  | 307    | 0.607     | 0.503  | 0.532    |
+| Title + Container Title           | 307    | 0.949     | 0.957  | 0.952    |
+| Title only                        | 307    | 0.528     | 0.416  | 0.448    |
+For **26 parent subjects**, F1-score improves to **0.934** with full metadata and **0.694** with Title + Abstract.
+---
+## ✅ Model Strengths
+- Handles **interdisciplinary** and **general science journals**.
+- Works even without container title (lower accuracy).
+- Scalable for large collections.
+---
+## ⚠️ Limitations
+- Performance relies on metadata completeness (title, abstract, container title).
+- Lower accuracy for rare subjects and missing source info.
+- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
+---
+## 🔍 Example Usage
+```python
+from transformers import TextClassificationPipeline, pipeline
+import torch
+# --- Custom multi-label pipeline ---
+class ASJCMultiLabelPipeline(TextClassificationPipeline):
+    """
+    Multi-label classification pipeline for ASJC categories.
+    Uses a configurable threshold to return all labels with scores above the threshold.
+    """
+    def __init__(self, *args, **kwargs):
+        # Allow threshold override; default falls back to model config
+        self.threshold = kwargs.pop("threshold", None)
+        super().__init__(*args, **kwargs)
+        if self.threshold is None:
+            self.threshold = getattr(self.model.config, "threshold", 0.3)
+    def postprocess(self, model_outputs, **kwargs):
+        # Convert logits to probabilities using sigmoid
+        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
+        results = []
+        for i, score in enumerate(scores[0]):
+            if score >= self.threshold:
+                label = self.model.config.id2label[(i)]
+                results.append({"label": label, "score": float(score)})
+        # Sort by descending score
+        results = sorted(results, key=lambda x: x["score"], reverse=True)
+        return results
+# --- Create the pipeline explicitly using the custom class ---
+pipe = pipeline(
+    task="text-classification",
+    model="asjc-classification/scibert_multilabel_asjc_classifier",
+    pipeline_class=ASJCMultiLabelPipeline
+)
+# --- Example text input ---
+text = (
+    "title={Dose optimization of β-lactams antibiotics in pediatrics and adults: A systematic review}, "
+    "container_title={Frontiers in Pharmacology}, "
+    "abstract={Background: β-lactams remain the cornerstone of the empirical therapy to treat various bacterial infections. This systematic review aimed to analyze the data describing the dosing regimen of β-lactams.Methods: Systematic scientific and grey literature was performed in accordance with Preferred Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. The studies were retrieved and screened on the basis of pre-defined exclusion and inclusion criteria. The cohort studies, randomized controlled trials (RCT) and case reports that reported the dosing schedule of β-lactams are included in this study.Results: A total of 52 studies met the inclusion criteria, of which 40 were cohort studies, 2 were case reports and 10 were RCTs. The majority of the studies (34/52) studied the pharmacokinetic (PK) parameters of a drug. A total of 20 studies proposed dosing schedule in pediatrics while 32 studies proposed dosing regimen among adults. Piperacillin (12/52) and Meropenem (11/52) were the most commonly used β-lactams used in hospitalized patients. As per available evidence, continuous infusion is considered as the most appropriate mode of administration to optimize the safety and efficacy of the treatment and improve the clinical outcomes.Conclusion: Appropriate antibiotic therapy is challenging due to pathophysiological changes among different age groups. The optimization of pharmacokinetic/pharmacodynamic parameters is useful to support alternative dosing regimens such as an increase in dosing interval, continuous infusion, and increased bolus doses.}"
+)
+# --- Get multi-label predictions ---
+result = pipe(text)
+print(result)
+# Predicted labels:
+# [
+#   {'label': 'Pharmacology (medical)', 'score': 0.9922493696212769},
+#   {'label': 'Pharmacology', 'score': 0.902540922164917}
+# ]
+# Expected labels:
+# - Pharmacology (medical)
+# - Pharmacology
+```
+---
+## 📖 Citation
+If you use this work, please cite:
+```bibtex
+@article{Gusenbauer.2025,
+author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
+year = {2025},
+title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
+keywords = {All Science Journal Classification;Disciplinary coverage;Fine-tuning;multi-label classification;SciBERT;Transformer-based language models},
+issn = {0138-9130},
+journal = {Scientometrics},
+doi = {10.1007/s11192-025-05490-0},
 }