grano1 commited on
Commit
0814639
·
verified ·
1 Parent(s): 460c565

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +117 -117
README.md CHANGED
@@ -1,118 +1,118 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
- metrics:
6
- - precision
7
- - recall
8
- - f1
9
- base_model:
10
- - allenai/scibert_scivocab_uncased
11
- pipeline_tag: text-classification
12
- tags:
13
- - scientometrics
14
- - asjc
15
- ---
16
- # 🧠 Open Multi-Label ASJC Classification
17
-
18
- ## Model Overview
19
- This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
20
-
21
- - **Task**: Multi-label classification
22
- - **Labels**: 307 ASJC subjects (granular level)
23
- - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
24
- - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
25
- - **License**: MIT
26
- - **Framework**: Hugging Face Transformers
27
-
28
- ---
29
-
30
- ## 📚 Intended Use
31
- - Classify individual research documents into multiple ASJC subjects.
32
- - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
33
- - Works with **title**, **abstract**, and optionally **container title** metadata.
34
-
35
- ---
36
-
37
- ## 🛠 Training Details
38
- - **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
39
- - **Preprocessing**:
40
- - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
41
- - Multi-hot encoding for multi-label classification.
42
- - Data augmentation for underrepresented classes.
43
- - **Fine-tuning**:
44
- - Optimizer: AdamW
45
- - Loss: Binary Cross-Entropy
46
- - Learning Rate: 2e-5
47
- - Epochs: 1
48
- - Batch Size: 16
49
- - Threshold for label assignment: 0.3
50
-
51
- ---
52
-
53
- ## 📈 Metrics
54
- | Input Features | Labels | Precision | Recall | F1-Score |
55
- |-----------------------------------|--------|-----------|--------|----------|
56
- | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
57
- | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
58
- | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
59
- | Title only | 307 | 0.528 | 0.416 | 0.448 |
60
-
61
- For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
62
-
63
- ---
64
-
65
- ## ✅ Model Strengths
66
- - Handles **interdisciplinary** and **general science journals**.
67
- - Works even without container title (lower accuracy).
68
- - Scalable for large collections.
69
-
70
- ---
71
-
72
- ## ⚠️ Limitations
73
- - Performance relies on metadata completeness (title, abstract, container title).
74
- - Lower accuracy for rare subjects and missing source info.
75
- - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
76
-
77
- ---
78
-
79
- ## 🔍 Example Usage
80
-
81
- ```python
82
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
83
- import torch
84
- import json
85
-
86
- # Load model and tokenizer
87
- model_name = "your-hf-username/open-asjc-multilabel"
88
- tokenizer = AutoTokenizer.from_pretrained(model_name)
89
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
90
-
91
- # Load sample input
92
- with open("small_example.json") as f:
93
- data = json.load(f)
94
-
95
- text = data["title"] + " " + data.get("abstract", "")
96
- inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
97
-
98
- # Predict
99
- with torch.no_grad():
100
- outputs = model(**inputs)
101
- probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
102
-
103
- # Apply threshold
104
- threshold = 0.3
105
- predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
106
- ```
107
-
108
-
109
- ## 📖 Citation
110
- If you use this work, please cite:
111
-
112
- ```bibtex
113
- @article{gusenbauer2025open,
114
- author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
115
- title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
116
- journal = {Scientometrics},
117
- year = {2025}
118
  }
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - RichardErkhov/April_2023_Public_Data_File_from_Crossref
5
+ metrics:
6
+ - precision
7
+ - recall
8
+ - f1
9
+ base_model:
10
+ - allenai/scibert_scivocab_uncased
11
+ pipeline_tag: text-classification
12
+ tags:
13
+ - scientometrics
14
+ - asjc
15
+ ---
16
+ # 🧠 Open Multi-Label ASJC Classification
17
+
18
+ ## Model Overview
19
+ This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
20
+
21
+ - **Task**: Multi-label classification
22
+ - **Labels**: 307 ASJC subjects (granular level)
23
+ - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
24
+ - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
25
+ - **License**: MIT
26
+ - **Framework**: Hugging Face Transformers
27
+
28
+ ---
29
+
30
+ ## 📚 Intended Use
31
+ - Classify individual research documents into multiple ASJC subjects.
32
+ - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
33
+ - Works with **title**, **abstract**, and optionally **container title** metadata.
34
+
35
+ ---
36
+
37
+ ## 🛠 Training Details
38
+ - **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
39
+ - **Preprocessing**:
40
+ - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
41
+ - Multi-hot encoding for multi-label classification.
42
+ - Data augmentation for underrepresented classes.
43
+ - **Fine-tuning**:
44
+ - Optimizer: AdamW
45
+ - Loss: Binary Cross-Entropy
46
+ - Learning Rate: 2e-5
47
+ - Epochs: 1
48
+ - Batch Size: 16
49
+ - Threshold for label assignment: 0.3
50
+
51
+ ---
52
+
53
+ ## 📈 Metrics
54
+ | Input Features | Labels | Precision | Recall | F1-Score |
55
+ |-----------------------------------|--------|-----------|--------|----------|
56
+ | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
57
+ | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
58
+ | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
59
+ | Title only | 307 | 0.528 | 0.416 | 0.448 |
60
+
61
+ For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
62
+
63
+ ---
64
+
65
+ ## ✅ Model Strengths
66
+ - Handles **interdisciplinary** and **general science journals**.
67
+ - Works even without container title (lower accuracy).
68
+ - Scalable for large collections.
69
+
70
+ ---
71
+
72
+ ## ⚠️ Limitations
73
+ - Performance relies on metadata completeness (title, abstract, container title).
74
+ - Lower accuracy for rare subjects and missing source info.
75
+ - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
76
+
77
+ ---
78
+
79
+ ## 🔍 Example Usage
80
+
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
83
+ import torch
84
+ import json
85
+
86
+ # Load model and tokenizer
87
+ model_name = "your-hf-username/open-asjc-multilabel"
88
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
89
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
90
+
91
+ # Load sample input
92
+ with open("small_example.json") as f:
93
+ data = json.load(f)
94
+
95
+ text = data["title"] + " " + data.get("abstract", "")
96
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
97
+
98
+ # Predict
99
+ with torch.no_grad():
100
+ outputs = model(**inputs)
101
+ probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
102
+
103
+ # Apply threshold
104
+ threshold = 0.3
105
+ predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
106
+ ```
107
+
108
+
109
+ ## 📖 Citation
110
+ If you use this work, please cite:
111
+
112
+ ```bibtex
113
+ @article{gusenbauer2025open,
114
+ author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
115
+ title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
116
+ journal = {Scientometrics},
117
+ year = {2025}
118
  }