grano1 commited on
Commit
95f8b7e
·
verified ·
1 Parent(s): 5458186

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -157
README.md CHANGED
@@ -1,158 +1,159 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
- metrics:
6
- - precision
7
- - recall
8
- - f1
9
- base_model:
10
- - allenai/scibert_scivocab_uncased
11
- pipeline_tag: text-classification
12
- tags:
13
- - scientometrics
14
- - asjc
15
- - multi-label
16
- task_categories:
17
- - text-classification
18
- widget:
19
- - text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
20
-
21
- ---
22
- # 🧠 Open Multi-Label ASJC Classification
23
-
24
- ## Model Overview
25
- This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
26
-
27
- - **Task**: Multi-label classification
28
- - **Labels**: 307 ASJC subjects (granular level)
29
- - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
30
- - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
31
- - **License**: MIT
32
- - **Framework**: Hugging Face Transformers
33
-
34
- ---
35
-
36
- ## 📚 Intended Use
37
- - Classify individual research documents into multiple ASJC subjects.
38
- - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
39
- - Works with **title**, **abstract**, and optionally **container title** metadata.
40
-
41
- ---
42
-
43
- ## 🛠 Training Details
44
- - **Preprocessing**:
45
- - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
46
- - Multi-hot encoding for multi-label classification.
47
- - Data augmentation for underrepresented classes.
48
- - **Fine-tuning**:
49
- - Optimizer: AdamW
50
- - Loss: Binary Cross-Entropy
51
- - Learning Rate: 2e-5
52
- - Epochs: 1
53
- - Batch Size: 16
54
- - Threshold for label assignment: 0.3
55
-
56
- ---
57
-
58
- ## 📈 Metrics
59
- | Input Features | Labels | Precision | Recall | F1-Score |
60
- |-----------------------------------|--------|-----------|--------|----------|
61
- | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
62
- | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
63
- | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
64
- | Title only | 307 | 0.528 | 0.416 | 0.448 |
65
-
66
- For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
67
-
68
- ---
69
-
70
- ## ✅ Model Strengths
71
- - Handles **interdisciplinary** and **general science journals**.
72
- - Works even without container title (lower accuracy).
73
- - Scalable for large collections.
74
-
75
- ---
76
-
77
- ## ⚠️ Limitations
78
- - Performance relies on metadata completeness (title, abstract, container title).
79
- - Lower accuracy for rare subjects and missing source info.
80
- - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
81
-
82
- ---
83
-
84
- ## 🔍 Example Usage
85
-
86
- ```python
87
- from transformers import TextClassificationPipeline, pipeline
88
- import torch
89
-
90
- # --- Custom multi-label pipeline ---
91
- class ASJCMultiLabelPipeline(TextClassificationPipeline):
92
- """
93
- Multi-label classification pipeline for ASJC categories.
94
- Uses a configurable threshold to return all labels with scores above the threshold.
95
- """
96
- def __init__(self, *args, **kwargs):
97
- # Allow threshold override; default falls back to model config
98
- self.threshold = kwargs.pop("threshold", None)
99
- super().__init__(*args, **kwargs)
100
- if self.threshold is None:
101
- self.threshold = getattr(self.model.config, "threshold", 0.3)
102
-
103
- def postprocess(self, model_outputs, **kwargs):
104
- # Convert logits to probabilities using sigmoid
105
- scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
106
-
107
- results = []
108
- for i, score in enumerate(scores[0]):
109
- if score >= self.threshold:
110
- label = self.model.config.id2label[(i)]
111
- results.append({"label": label, "score": float(score)})
112
-
113
- # Sort by descending score
114
- results = sorted(results, key=lambda x: x["score"], reverse=True)
115
- return results
116
-
117
- # --- Create the pipeline explicitly using the custom class ---
118
- pipe = pipeline(
119
- task="text-classification",
120
- model="asjc-classification/scibert_multilabel_asjc_classifier",
121
- pipeline_class=ASJCMultiLabelPipeline
122
- )
123
-
124
- # --- Example text input ---
125
- text = (
126
- "title={Jodometrie}, "
127
- "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
128
- "abstract={}"
129
- )
130
-
131
- # --- Get multi-label predictions ---
132
- result = pipe(text)
133
- print(result)
134
-
135
- # Predicted labels:
136
- [
137
- {'label': 'Analytical Chemistry', 'score': 0.933479368686676},
138
- {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
139
- {'label': 'Biochemistry', 'score': 0.494137704372406}
140
- ]
141
-
142
- # Expected labels:
143
- # - Clinical Biochemistry
144
- # - Analytical Chemistry
145
- ```
146
-
147
- ---
148
-
149
- ## 📖 Citation
150
- If you use this work, please cite:
151
-
152
- ```bibtex
153
- @article{gusenbauer2025open,
154
- author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
155
- title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
156
- journal = {Scientometrics},
157
- year = {2025}
 
158
  }
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
+ metrics:
6
+ - precision
7
+ - recall
8
+ - f1
9
+ base_model:
10
+ - allenai/scibert_scivocab_uncased
11
+ pipeline_tag: text-classification
12
+ tags:
13
+ - scientometrics
14
+ - asjc
15
+ - multi-label
16
+ task_categories:
17
+ - text-classification
18
+ widget:
19
+ - text: "title={Jodometrie}, container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, abstract={}"
20
+
21
+ ---
22
+ # 🧠 Open Multi-Label ASJC Classification
23
+
24
+ ## Model Overview
25
+ This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
26
+
27
+ - **Task**: Multi-label classification
28
+ - **Labels**: 307 ASJC subjects (granular level)
29
+ - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
30
+ - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
31
+ - **License**: MIT
32
+ - **Framework**: Hugging Face Transformers
33
+
34
+ ---
35
+
36
+ ## 📚 Intended Use
37
+ - Classify individual research documents into multiple ASJC subjects.
38
+ - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
39
+ - Works with **title**, **abstract**, and optionally **container title** metadata.
40
+
41
+ ---
42
+
43
+ ## 🛠 Training Details
44
+ - **Preprocessing**:
45
+ - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
46
+ - Multi-hot encoding for multi-label classification.
47
+ - Data augmentation for underrepresented classes.
48
+ - **Fine-tuning**:
49
+ - Optimizer: AdamW
50
+ - Loss: Binary Cross-Entropy
51
+ - Learning Rate: 2e-5
52
+ - Epochs: 1
53
+ - Batch Size: 16
54
+ - Threshold for label assignment: 0.3
55
+
56
+ ---
57
+
58
+ ## 📈 Metrics
59
+ | Input Features | Labels | Precision | Recall | F1-Score |
60
+ |-----------------------------------|--------|-----------|--------|----------|
61
+ | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
62
+ | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
63
+ | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
64
+ | Title only | 307 | 0.528 | 0.416 | 0.448 |
65
+
66
+ For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
67
+
68
+ ---
69
+
70
+ ## ✅ Model Strengths
71
+ - Handles **interdisciplinary** and **general science journals**.
72
+ - Works even without container title (lower accuracy).
73
+ - Scalable for large collections.
74
+
75
+ ---
76
+
77
+ ## ⚠️ Limitations
78
+ - Performance relies on metadata completeness (title, abstract, container title).
79
+ - Lower accuracy for rare subjects and missing source info.
80
+ - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
81
+
82
+ ---
83
+
84
+ ## 🔍 Example Usage
85
+
86
+ ```python
87
+ from transformers import TextClassificationPipeline, pipeline
88
+ import torch
89
+
90
+ # --- Custom multi-label pipeline ---
91
+ class ASJCMultiLabelPipeline(TextClassificationPipeline):
92
+ """
93
+ Multi-label classification pipeline for ASJC categories.
94
+ Uses a configurable threshold to return all labels with scores above the threshold.
95
+ """
96
+ def __init__(self, *args, **kwargs):
97
+ # Allow threshold override; default falls back to model config
98
+ self.threshold = kwargs.pop("threshold", None)
99
+ super().__init__(*args, **kwargs)
100
+ if self.threshold is None:
101
+ self.threshold = getattr(self.model.config, "threshold", 0.3)
102
+
103
+ def postprocess(self, model_outputs, **kwargs):
104
+ # Convert logits to probabilities using sigmoid
105
+ scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
106
+
107
+ results = []
108
+ for i, score in enumerate(scores[0]):
109
+ if score >= self.threshold:
110
+ label = self.model.config.id2label[(i)]
111
+ results.append({"label": label, "score": float(score)})
112
+
113
+ # Sort by descending score
114
+ results = sorted(results, key=lambda x: x["score"], reverse=True)
115
+ return results
116
+
117
+ # --- Create the pipeline explicitly using the custom class ---
118
+ pipe = pipeline(
119
+ task="text-classification",
120
+ model="asjc-classification/scibert_multilabel_asjc_classifier",
121
+ pipeline_class=ASJCMultiLabelPipeline
122
+ )
123
+
124
+ # --- Example text input ---
125
+ text = (
126
+ "title={Jodometrie}, "
127
+ "container_title={Fresenius' Zeitschrift für analytische Chemie, Zeitschrift für analytische Chemie}, "
128
+ "abstract={}"
129
+ )
130
+
131
+ # --- Get multi-label predictions ---
132
+ result = pipe(text)
133
+ print(result)
134
+
135
+ # Predicted labels:
136
+ [
137
+ {'label': 'Analytical Chemistry', 'score': 0.933479368686676},
138
+ {'label': 'Clinical Biochemistry', 'score': 0.9108470678329468},
139
+ {'label': 'Biochemistry', 'score': 0.494137704372406}
140
+ ]
141
+
142
+ # Expected labels:
143
+ # - Clinical Biochemistry
144
+ # - Analytical Chemistry
145
+ ```
146
+
147
+ ---
148
+
149
+ ## 📖 Citation
150
+ If you use this work, please cite:
151
+
152
+ ```bibtex
153
+ @article{gusenbauer2025open,
154
+ author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
155
+ title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
156
+ journal = {Scientometrics},
157
+ year = {2025},
158
+ doi = {10.1007/s11192-025-05490-0}
159
  }