Upload folder using huggingface_hub

e2d52cc verified 5 months ago

5.32 kB

license: mit
datasets:
  - RichardErkhov/April_2023_Public_Data_File_from_Crossref
metrics:
  - precision
  - recall
  - f1
base_model:
  - allenai/scibert_scivocab_uncased
pipeline_tag: text-classification
tags:
  - scientometrics
  - asjc
  - multi-label
task_categories:
  - text-classification
widget:
  - text: >-
      title={Jodometrie}, container_title={Fresenius' Zeitschrift für
      analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}

🧠 Open Multi-Label ASJC Classification

Model Overview

This model fine-tunes allenai/scibert_scivocab_uncased across 307 ASJC subject categories, enabling document-level classification beyond traditional journal-level schemes.

Task: Multi-label classification
Labels: 307 ASJC subjects (granular level)
Base Model: SciBERT
Training Data: Crossref 2023 dataset (titles, abstracts, container titles)
License: MIT
Framework: Hugging Face Transformers

📚 Intended Use

Classify individual research documents into multiple ASJC subjects.
Analyze disciplinary orientation of collections (authors, institutions, databases).
Works with title, abstract, and optionally container title metadata.

🛠 Training Details

Preprocessing:
- Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
- Multi-hot encoding for multi-label classification.
- Data augmentation for underrepresented classes.
Fine-tuning:
- Optimizer: AdamW
- Loss: Binary Cross-Entropy
- Learning Rate: 2e-5
- Epochs: 1
- Batch Size: 16
- Threshold for label assignment: 0.3

📈 Metrics

Input Features	Labels	Precision	Recall	F1-Score
Title + Container Title + Abstract	307	0.912	0.885	0.892
Title + Abstract	307	0.607	0.503	0.532
Title + Container Title	307	0.949	0.957	0.952
Title only	307	0.528	0.416	0.448

For 26 parent subjects, F1-score improves to 0.934 with full metadata.

✅ Model Strengths

Handles interdisciplinary and general science journals.
Works even without container title (lower accuracy).
Scalable for large collections.

⚠️ Limitations

Performance relies on metadata completeness (title, abstract, container title).
Lower accuracy for rare subjects and missing source info.
Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).

🔍 Example Usage

from transformers import TextClassificationPipeline, pipeline
import torch

# --- Custom multi-label pipeline ---
class ASJCMultiLabelPipeline(TextClassificationPipeline):
    """
    Multi-label classification pipeline for ASJC categories.
    Uses a configurable threshold to return all labels with scores above the threshold.
    """
    def __init__(self, *args, **kwargs):
        # Allow threshold override; default falls back to model config
        self.threshold = kwargs.pop("threshold", None)
        super().__init__(*args, **kwargs)
        if self.threshold is None:
            self.threshold = getattr(self.model.config, "threshold", 0.3)

    def postprocess(self, model_outputs, **kwargs):
        # Convert logits to probabilities using sigmoid
        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()

        results = []
        for i, score in enumerate(scores[0]):
            if score >= self.threshold:
                label = self.model.config.id2label[(i)]
                results.append({"label": label, "score": float(score)})

        # Sort by descending score
        results = sorted(results, key=lambda x: x["score"], reverse=True)
        return results

# --- Create the pipeline explicitly using the custom class ---
pipe = pipeline(
    task="text-classification",
    model="asjc-classification/scibert_multilabel_asjc_classifier",
    pipeline_class=ASJCMultiLabelPipeline
)

# --- Example text input ---
text = (
    "title={Jodometrie}, "
    "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
    "abstract={}"
)

# --- Get multi-label predictions ---
result = pipe(text)
print(result)

# Predicted labels:
[
  {'label': 'Analytical Chemistry', 'score': 0.933479368686676}, 
  {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468}, 
  {'label': 'Biochemistry', 'score': 0.494137704372406}
]

# Expected labels:
# - Clinical Biochemistry
# - Analytical Chemistry

📖 Citation

If you use this work, please cite:

@article{gusenbauer2025open,
  author    = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
  title     = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
  journal   = {Scientometrics},
  year      = {2025}
}