grano1's picture
Upload folder using huggingface_hub
e2d52cc verified
|
raw
history blame
5.32 kB
metadata
license: mit
datasets:
  - RichardErkhov/April_2023_Public_Data_File_from_Crossref
metrics:
  - precision
  - recall
  - f1
base_model:
  - allenai/scibert_scivocab_uncased
pipeline_tag: text-classification
tags:
  - scientometrics
  - asjc
  - multi-label
task_categories:
  - text-classification
widget:
  - text: >-
      title={Jodometrie}, container_title={Fresenius' Zeitschrift für
      analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}

🧠 Open Multi-Label ASJC Classification

Model Overview

This model fine-tunes allenai/scibert_scivocab_uncased across 307 ASJC subject categories, enabling document-level classification beyond traditional journal-level schemes.

  • Task: Multi-label classification
  • Labels: 307 ASJC subjects (granular level)
  • Base Model: SciBERT
  • Training Data: Crossref 2023 dataset (titles, abstracts, container titles)
  • License: MIT
  • Framework: Hugging Face Transformers

📚 Intended Use

  • Classify individual research documents into multiple ASJC subjects.
  • Analyze disciplinary orientation of collections (authors, institutions, databases).
  • Works with title, abstract, and optionally container title metadata.

🛠 Training Details

  • Preprocessing:
    • Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
    • Multi-hot encoding for multi-label classification.
    • Data augmentation for underrepresented classes.
  • Fine-tuning:
    • Optimizer: AdamW
    • Loss: Binary Cross-Entropy
    • Learning Rate: 2e-5
    • Epochs: 1
    • Batch Size: 16
    • Threshold for label assignment: 0.3

📈 Metrics

Input Features Labels Precision Recall F1-Score
Title + Container Title + Abstract 307 0.912 0.885 0.892
Title + Abstract 307 0.607 0.503 0.532
Title + Container Title 307 0.949 0.957 0.952
Title only 307 0.528 0.416 0.448

For 26 parent subjects, F1-score improves to 0.934 with full metadata.


✅ Model Strengths

  • Handles interdisciplinary and general science journals.
  • Works even without container title (lower accuracy).
  • Scalable for large collections.

⚠️ Limitations

  • Performance relies on metadata completeness (title, abstract, container title).
  • Lower accuracy for rare subjects and missing source info.
  • Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).

🔍 Example Usage

from transformers import TextClassificationPipeline, pipeline
import torch

# --- Custom multi-label pipeline ---
class ASJCMultiLabelPipeline(TextClassificationPipeline):
    """
    Multi-label classification pipeline for ASJC categories.
    Uses a configurable threshold to return all labels with scores above the threshold.
    """
    def __init__(self, *args, **kwargs):
        # Allow threshold override; default falls back to model config
        self.threshold = kwargs.pop("threshold", None)
        super().__init__(*args, **kwargs)
        if self.threshold is None:
            self.threshold = getattr(self.model.config, "threshold", 0.3)

    def postprocess(self, model_outputs, **kwargs):
        # Convert logits to probabilities using sigmoid
        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()

        results = []
        for i, score in enumerate(scores[0]):
            if score >= self.threshold:
                label = self.model.config.id2label[(i)]
                results.append({"label": label, "score": float(score)})

        # Sort by descending score
        results = sorted(results, key=lambda x: x["score"], reverse=True)
        return results

# --- Create the pipeline explicitly using the custom class ---
pipe = pipeline(
    task="text-classification",
    model="asjc-classification/scibert_multilabel_asjc_classifier",
    pipeline_class=ASJCMultiLabelPipeline
)

# --- Example text input ---
text = (
    "title={Jodometrie}, "
    "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
    "abstract={}"
)

# --- Get multi-label predictions ---
result = pipe(text)
print(result)

# Predicted labels:
[
  {'label': 'Analytical Chemistry', 'score': 0.933479368686676}, 
  {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468}, 
  {'label': 'Biochemistry', 'score': 0.494137704372406}
]

# Expected labels:
# - Clinical Biochemistry
# - Analytical Chemistry

📖 Citation

If you use this work, please cite:

@article{gusenbauer2025open,
  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
  journal   = {Scientometrics},
  year      = {2025}
}