grano1 commited on
Commit
d63ac78
·
verified ·
1 Parent(s): 748c815

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Open Multi-Label ASJC Classification
2
+
3
+ ## Model Overview
4
+ This model fine-tunes **allenai/scibert_scivocab_uncased** across **307 ASJC subject categories**, enabling document-level classification beyond traditional journal-level schemes.
5
+
6
+ - **Task**: Multi-label classification
7
+ - **Labels**: 307 ASJC subjects (granular level)
8
+ - **Base Model**: [SciBERT](https://arxiv.org/pdf/1903.10676)
9
+ - **Training Data**: Crossref 2023 dataset (titles, abstracts, container titles)
10
+ - **License**: MIT
11
+ - **Framework**: Hugging Face Transformers
12
+
13
+ ---
14
+
15
+ ## 📚 Intended Use
16
+ - Classify individual research documents into multiple ASJC subjects.
17
+ - Analyze disciplinary orientation of **collections** (authors, institutions, databases).
18
+ - Works with **title**, **abstract**, and optionally **container title** metadata.
19
+
20
+ ---
21
+
22
+ ## 🛠 Training Details
23
+ - **Dataset**: [Crossref](https://doi.org/10.13003/8wx5k)
24
+ - **Preprocessing**:
25
+ - Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
26
+ - Multi-hot encoding for multi-label classification.
27
+ - Data augmentation for underrepresented classes.
28
+ - **Fine-tuning**:
29
+ - Optimizer: AdamW
30
+ - Loss: Binary Cross-Entropy
31
+ - Learning Rate: 2e-5
32
+ - Epochs: 1
33
+ - Batch Size: 16
34
+ - Threshold for label assignment: 0.3
35
+
36
+ ---
37
+
38
+ ## 📈 Metrics
39
+ | Input Features | Labels | Precision | Recall | F1-Score |
40
+ |-----------------------------------|--------|-----------|--------|----------|
41
+ | Title + Container Title + Abstract| 307 | 0.912 | 0.885 | 0.892 |
42
+ | Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
43
+ | Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
44
+ | Title only | 307 | 0.528 | 0.416 | 0.448 |
45
+
46
+ For **26 parent subjects**, F1-score improves to **0.934** with full metadata.
47
+
48
+ ---
49
+
50
+ ## ✅ Model Strengths
51
+ - Handles **interdisciplinary** and **general science journals**.
52
+ - Works even without container title (lower accuracy).
53
+ - Scalable for large collections.
54
+
55
+ ---
56
+
57
+ ## ⚠️ Limitations
58
+ - Performance relies on metadata completeness (title, abstract, container title).
59
+ - Lower accuracy for rare subjects and missing source info.
60
+ - Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
61
+
62
+ ---
63
+
64
+ ## 🔍 Example Usage
65
+
66
+ ```python
67
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
68
+ import torch
69
+ import json
70
+
71
+ # Load model and tokenizer
72
+ model_name = "your-hf-username/open-asjc-multilabel"
73
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
74
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
75
+
76
+ # Load sample input
77
+ with open("small_example.json") as f:
78
+ data = json.load(f)
79
+
80
+ text = data["title"] + " " + data.get("abstract", "")
81
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
82
+
83
+ # Predict
84
+ with torch.no_grad():
85
+ outputs = model(**inputs)
86
+ probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
87
+
88
+ # Apply threshold
89
+ threshold = 0.3
90
+ predicted_labels = [label for label, prob in zip(model.config.id2label.values(), probs) if prob >= threshold]
91
+ ```
92
+
93
+
94
+ ## 📖 Citation
95
+ If you use this work, please cite:
96
+
97
+ ```bibtex
98
+ @article{gusenbauer2025open,
99
+ author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
100
+ title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
101
+ journal = {Scientometrics},
102
+ year = {2025}
103
+ }