grano1 commited on
Commit
460c565
·
verified ·
1 Parent(s): d63ac78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -103
README.md CHANGED
@@ -1,103 +1,118 @@
1
- # 🧠 Open Multi-Label ASJC Classification
2
-
3
- ## Model Overview
4
- This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
5
-
6
- - **Task**: Multi-label classification
7
- - **Labels**: 307 ASJC subjects (granular level)
8
- - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
9
- - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
10
- - **License**: MIT
11
- - **Framework**: Hugging Face Transformers
12
-
13
- ---
14
-
15
- ## 📚 Intended Use
16
- - Classify individual research documents into multiple ASJC subjects.
17
- - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
18
- - Works with **title**, **abstract**, and optionally **container title** metadata.
19
-
20
- ---
21
-
22
- ## 🛠 Training Details
23
- - **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
24
- - **Preprocessing**:
25
- - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
26
- - Multi-hot encoding for multi-label classification.
27
- - Data augmentation for underrepresented classes.
28
- - **Fine-tuning**:
29
- - Optimizer: AdamW
30
- - Loss: Binary Cross-Entropy
31
- - Learning Rate: 2e-5
32
- - Epochs: 1
33
- - Batch Size: 16
34
- - Threshold for label assignment: 0.3
35
-
36
- ---
37
-
38
- ## 📈 Metrics
39
- | Input Features | Labels | Precision | Recall | F1-Score |
40
- |-----------------------------------|--------|-----------|--------|----------|
41
- | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
42
- | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
43
- | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
44
- | Title only | 307 | 0.528 | 0.416 | 0.448 |
45
-
46
- For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
47
-
48
- ---
49
-
50
- ## ✅ Model Strengths
51
- - Handles **interdisciplinary** and **general science journals**.
52
- - Works even without container title (lower accuracy).
53
- - Scalable for large collections.
54
-
55
- ---
56
-
57
- ## ⚠️ Limitations
58
- - Performance relies on metadata completeness (title, abstract, container title).
59
- - Lower accuracy for rare subjects and missing source info.
60
- - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
61
-
62
- ---
63
-
64
- ## 🔍 Example Usage
65
-
66
- ```python
67
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
68
- import torch
69
- import json
70
-
71
- # Load model and tokenizer
72
- model_name = "your-hf-username/open-asjc-multilabel"
73
- tokenizer = AutoTokenizer.from_pretrained(model_name)
74
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
75
-
76
- # Load sample input
77
- with open("small_example.json") as f:
78
- data = json.load(f)
79
-
80
- text = data["title"] + " " + data.get("abstract", "")
81
- inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
82
-
83
- # Predict
84
- with torch.no_grad():
85
- outputs = model(**inputs)
86
- probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
87
-
88
- # Apply threshold
89
- threshold = 0.3
90
- predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
91
- ```
92
-
93
-
94
- ## 📖 Citation
95
- If you use this work, please cite:
96
-
97
- ```bibtex
98
- @article{gusenbauer2025open,
99
- author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
100
- title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
101
- journal = {Scientometrics},
102
- year = {2025}
103
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
+ metrics:
6
+ - precision
7
+ - recall
8
+ - f1
9
+ base_model:
10
+ - allenai/scibert_scivocab_uncased
11
+ pipeline_tag: text-classification
12
+ tags:
13
+ - scientometrics
14
+ - asjc
15
+ ---
16
+ # 🧠 Open Multi-Label ASJC Classification
17
+
18
+ ## Model Overview
19
+ This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
20
+
21
+ - **Task**: Multi-label classification
22
+ - **Labels**: 307 ASJC subjects (granular level)
23
+ - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
24
+ - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
25
+ - **License**: MIT
26
+ - **Framework**: Hugging Face Transformers
27
+
28
+ ---
29
+
30
+ ## 📚 Intended Use
31
+ - Classify individual research documents into multiple ASJC subjects.
32
+ - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
33
+ - Works with **title**, **abstract**, and optionally **container title** metadata.
34
+
35
+ ---
36
+
37
+ ## 🛠 Training Details
38
+ - **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
39
+ - **Preprocessing**:
40
+ - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
41
+ - Multi-hot encoding for multi-label classification.
42
+ - Data augmentation for underrepresented classes.
43
+ - **Fine-tuning**:
44
+ - Optimizer: AdamW
45
+ - Loss: Binary Cross-Entropy
46
+ - Learning Rate: 2e-5
47
+ - Epochs: 1
48
+ - Batch Size: 16
49
+ - Threshold for label assignment: 0.3
50
+
51
+ ---
52
+
53
+ ## 📈 Metrics
54
+ | Input Features | Labels | Precision | Recall | F1-Score |
55
+ |-----------------------------------|--------|-----------|--------|----------|
56
+ | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
57
+ | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
58
+ | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
59
+ | Title only | 307 | 0.528 | 0.416 | 0.448 |
60
+
61
+ For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
62
+
63
+ ---
64
+
65
+ ## ✅ Model Strengths
66
+ - Handles **interdisciplinary** and **general science journals**.
67
+ - Works even without container title (lower accuracy).
68
+ - Scalable for large collections.
69
+
70
+ ---
71
+
72
+ ## ⚠️ Limitations
73
+ - Performance relies on metadata completeness (title, abstract, container title).
74
+ - Lower accuracy for rare subjects and missing source info.
75
+ - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
76
+
77
+ ---
78
+
79
+ ## 🔍 Example Usage
80
+
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
83
+ import torch
84
+ import json
85
+
86
+ # Load model and tokenizer
87
+ model_name = "your-hf-username/open-asjc-multilabel"
88
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
89
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
90
+
91
+ # Load sample input
92
+ with open("small_example.json") as f:
93
+ data = json.load(f)
94
+
95
+ text = data["title"] + " " + data.get("abstract", "")
96
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
97
+
98
+ # Predict
99
+ with torch.no_grad():
100
+ outputs = model(**inputs)
101
+ probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
102
+
103
+ # Apply threshold
104
+ threshold = 0.3
105
+ predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
106
+ ```
107
+
108
+
109
+ ## 📖 Citation
110
+ If you use this work, please cite:
111
+
112
+ ```bibtex
113
+ @article{gusenbauer2025open,
114
+ author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
115
+ title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
116
+ journal = {Scientometrics},
117
+ year = {2025}
118
+ }