grano1 commited on
Commit
443ac06
·
verified ·
1 Parent(s): 68bf154

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +159 -160
README.md CHANGED
@@ -1,161 +1,160 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
- metrics:
6
- - precision
7
- - recall
8
- - f1
9
- base_model:
10
- - allenai/scibert_scivocab_uncased
11
- pipeline_tag: text-classification
12
- tags:
13
- - scientometrics
14
- - asjc
15
- - multi-label
16
- task_categories:
17
- - text-classification
18
- widget:
19
- - text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
20
-
21
- ---
22
- # 🧠 Open Multi-Label ASJC Classification
23
-
24
- ## Model Overview
25
- This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
26
-
27
- - **Task**: Multi-label classification
28
- - **Labels**: 307 ASJC subjects (granular level)
29
- - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
30
- - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
31
- - **License**: MIT
32
- - **Framework**: Hugging Face Transformers
33
-
34
- ---
35
-
36
- ## 📚 Intended Use
37
- - Classify individual research documents into multiple ASJC subjects.
38
- - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
39
- - Works with **title**, **abstract**, and optionally **container title** metadata.
40
-
41
- ---
42
-
43
- ## 🛠 Training Details
44
- - **Preprocessing**:
45
- - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
46
- - Multi-hot encoding for multi-label classification.
47
- - Data augmentation for underrepresented classes.
48
- - **Fine-tuning**:
49
- - Optimizer: AdamW
50
- - Loss: Binary Cross-Entropy
51
- - Learning Rate: 2e-5
52
- - Epochs: 1
53
- - Batch Size: 16
54
- - Threshold for label assignment: 0.3
55
-
56
- ---
57
-
58
- ## 📈 Metrics
59
- | Input Features | Labels | Precision | Recall | F1-Score |
60
- |-----------------------------------|--------|-----------|--------|----------|
61
- | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
62
- | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
63
- | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
64
- | Title only | 307 | 0.528 | 0.416 | 0.448 |
65
-
66
- For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
67
-
68
- ---
69
-
70
- ## ✅ Model Strengths
71
- - Handles **interdisciplinary** and **general science journals**.
72
- - Works even without container title (lower accuracy).
73
- - Scalable for large collections.
74
-
75
- ---
76
-
77
- ## ⚠️ Limitations
78
- - Performance relies on metadata completeness (title, abstract, container title).
79
- - Lower accuracy for rare subjects and missing source info.
80
- - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
81
-
82
- ---
83
-
84
- ## 🔍 Example Usage
85
-
86
- ```python
87
- from transformers import TextClassificationPipeline, pipeline
88
- import torch
89
-
90
- # --- Custom multi-label pipeline ---
91
- class ASJCMultiLabelPipeline(TextClassificationPipeline):
92
- """
93
- Multi-label classification pipeline for ASJC categories.
94
- Uses a configurable threshold to return all labels with scores above the threshold.
95
- """
96
- def __init__(self, *args, **kwargs):
97
- # Allow threshold override; default falls back to model config
98
- self.threshold = kwargs.pop("threshold", None)
99
- super().__init__(*args, **kwargs)
100
- if self.threshold is None:
101
- self.threshold = getattr(self.model.config, "threshold", 0.3)
102
-
103
- def postprocess(self, model_outputs, **kwargs):
104
- # Convert logits to probabilities using sigmoid
105
- scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
106
-
107
- results = []
108
- for i, score in enumerate(scores[0]):
109
- if score >= self.threshold:
110
- label = self.model.config.id2label[(i)]
111
- results.append({"label": label, "score": float(score)})
112
-
113
- # Sort by descending score
114
- results = sorted(results, key=lambda x: x["score"], reverse=True)
115
- return results
116
-
117
- # --- Create the pipeline explicitly using the custom class ---
118
- pipe = pipeline(
119
- task="text-classification",
120
- model="asjc-classification/scibert_multilabel_asjc_classifier",
121
- pipeline_class=ASJCMultiLabelPipeline
122
- )
123
-
124
- # --- Example text input ---
125
- text = (
126
- "title={Jodometrie}, "
127
- "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
128
- "abstract={}"
129
- )
130
-
131
- # --- Get multi-label predictions ---
132
- result = pipe(text)
133
- print(result)
134
-
135
- # Predicted labels:
136
- [
137
- {'label': 'Analytical Chemistry', 'score': 0.933479368686676},
138
- {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
139
- {'label': 'Biochemistry', 'score': 0.494137704372406}
140
- ]
141
-
142
- # Expected labels:
143
- # - Clinical Biochemistry
144
- # - Analytical Chemistry
145
- ```
146
-
147
- ---
148
-
149
- ## 📖 Citation
150
- If you use this work, please cite:
151
-
152
- ```bibtex
153
- @article{gusenbauer2025asjc,
154
- author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
155
- title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
156
- journal = {Scientometrics},
157
- year = {2025},
158
- doi = {10.1007/s11192-025-05490-0},
159
- issn = {0138-9130},
160
- keywords = {All Science Journal Classification;Disciplinary coverage;Fine-tuning;multi-label}
161
  }
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
+ metrics:
6
+ - precision
7
+ - recall
8
+ - f1
9
+ base_model:
10
+ - allenai/scibert_scivocab_uncased
11
+ pipeline_tag: text-classification
12
+ tags:
13
+ - scientometrics
14
+ - asjc
15
+ - multi-label
16
+ task_categories:
17
+ - text-classification
18
+ widget:
19
+ - text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
20
+
21
+ ---
22
+ # 🧠 Open Multi-Label ASJC Classification
23
+
24
+ ## Model Overview
25
+ This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
26
+
27
+ - **Task**: Multi-label classification
28
+ - **Labels**: 307 ASJC subjects (granular level)
29
+ - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
30
+ - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
31
+ - **License**: MIT
32
+ - **Framework**: Hugging Face Transformers
33
+
34
+ ---
35
+
36
+ ## 📚 Intended Use
37
+ - Classify individual research documents into multiple ASJC subjects.
38
+ - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
39
+ - Works with **title**, **abstract**, and optionally **container title** metadata.
40
+
41
+ ---
42
+
43
+ ## 🛠 Training Details
44
+ - **Preprocessing**:
45
+ - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
46
+ - Multi-hot encoding for multi-label classification.
47
+ - Data augmentation for underrepresented classes.
48
+ - **Fine-tuning**:
49
+ - Optimizer: AdamW
50
+ - Loss: Binary Cross-Entropy
51
+ - Learning Rate: 2e-5
52
+ - Epochs: 1
53
+ - Batch Size: 16
54
+ - Threshold for label assignment: 0.3
55
+
56
+ ---
57
+
58
+ ## 📈 Metrics
59
+ | Input Features | Labels | Precision | Recall | F1-Score |
60
+ |-----------------------------------|--------|-----------|--------|----------|
61
+ | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
62
+ | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
63
+ | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
64
+ | Title only | 307 | 0.528 | 0.416 | 0.448 |
65
+
66
+ For **26 parent subjects**, F1-score improves to **0.934** with full metadata and **0.694** with Title + Abstract.
67
+
68
+ ---
69
+
70
+ ## ✅ Model Strengths
71
+ - Handles **interdisciplinary** and **general science journals**.
72
+ - Works even without container title (lower accuracy).
73
+ - Scalable for large collections.
74
+
75
+ ---
76
+
77
+ ## ⚠️ Limitations
78
+ - Performance relies on metadata completeness (title, abstract, container title).
79
+ - Lower accuracy for rare subjects and missing source info.
80
+ - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
81
+
82
+ ---
83
+
84
+ ## 🔍 Example Usage
85
+
86
+ ```python
87
+ from transformers import TextClassificationPipeline, pipeline
88
+ import torch
89
+
90
+ # --- Custom multi-label pipeline ---
91
+ class ASJCMultiLabelPipeline(TextClassificationPipeline):
92
+ """
93
+ Multi-label classification pipeline for ASJC categories.
94
+ Uses a configurable threshold to return all labels with scores above the threshold.
95
+ """
96
+ def __init__(self, *args, **kwargs):
97
+ # Allow threshold override; default falls back to model config
98
+ self.threshold = kwargs.pop("threshold", None)
99
+ super().__init__(*args, **kwargs)
100
+ if self.threshold is None:
101
+ self.threshold = getattr(self.model.config, "threshold", 0.3)
102
+
103
+ def postprocess(self, model_outputs, **kwargs):
104
+ # Convert logits to probabilities using sigmoid
105
+ scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
106
+
107
+ results = []
108
+ for i, score in enumerate(scores[0]):
109
+ if score >= self.threshold:
110
+ label = self.model.config.id2label[(i)]
111
+ results.append({"label": label, "score": float(score)})
112
+
113
+ # Sort by descending score
114
+ results = sorted(results, key=lambda x: x["score"], reverse=True)
115
+ return results
116
+
117
+ # --- Create the pipeline explicitly using the custom class ---
118
+ pipe = pipeline(
119
+ task="text-classification",
120
+ model="asjc-classification/scibert_multilabel_asjc_classifier",
121
+ pipeline_class=ASJCMultiLabelPipeline
122
+ )
123
+
124
+ # --- Example text input ---
125
+ text = (
126
+ "title={Dose optimization of β-lactams antibiotics in pediatrics and adults: A systematic review}, "
127
+ "container_title={Frontiers in Pharmacology}, "
128
+ "abstract={Background: β-lactams remain the cornerstone of the empirical therapy to treat various bacterial infections. This systematic review aimed to analyze the data describing the dosing regimen of β-lactams.Methods: Systematic scientific and grey literature was performed in accordance with Preferred Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. The studies were retrieved and screened on the basis of pre-defined exclusion and inclusion criteria. The cohort studies, randomized controlled trials (RCT) and case reports that reported the dosing schedule of β-lactams are included in this study.Results: A total of 52 studies met the inclusion criteria, of which 40 were cohort studies, 2 were case reports and 10 were RCTs. The majority of the studies (34/52) studied the pharmacokinetic (PK) parameters of a drug. A total of 20 studies proposed dosing schedule in pediatrics while 32 studies proposed dosing regimen among adults. Piperacillin (12/52) and Meropenem (11/52) were the most commonly used β-lactams used in hospitalized patients. As per available evidence, continuous infusion is considered as the most appropriate mode of administration to optimize the safety and efficacy of the treatment and improve the clinical outcomes.Conclusion: Appropriate antibiotic therapy is challenging due to pathophysiological changes among different age groups. The optimization of pharmacokinetic/pharmacodynamic parameters is useful to support alternative dosing regimens such as an increase in dosing interval, continuous infusion, and increased bolus doses.}"
129
+ )
130
+
131
+ # --- Get multi-label predictions ---
132
+ result = pipe(text)
133
+ print(result)
134
+
135
+ # Predicted labels:
136
+ # [
137
+ # {'label': 'Pharmacology (medical)', 'score': 0.9922493696212769},
138
+ # {'label': 'Pharmacology', 'score': 0.902540922164917}
139
+ # ]
140
+
141
+ # Expected labels:
142
+ # - Pharmacology (medical)
143
+ # - Pharmacology
144
+ ```
145
+
146
+ ---
147
+
148
+ ## 📖 Citation
149
+ If you use this work, please cite:
150
+
151
+ ```bibtex
152
+ @article{Gusenbauer.2025,
153
+ author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
154
+ year = {2025},
155
+ title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
156
+ keywords = {All Science Journal Classification;Disciplinary coverage;Fine-tuning;multi-label classification;SciBERT;Transformer-based language models},
157
+ issn = {0138-9130},
158
+ journal = {Scientometrics},
159
+ doi = {10.1007/s11192-025-05490-0},
 
160
  }