zehralx commited on
Commit
6019380
·
verified ·
1 Parent(s): ef979ea

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +102 -6
README.md CHANGED
@@ -1,9 +1,105 @@
1
  ---
2
- base_model:
3
- - allenai/scibert_scivocab_uncased
4
  pipeline_tag: text-classification
5
  tags:
6
- - s-index
7
- - nih
8
- - datasharing
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
  pipeline_tag: text-classification
5
  tags:
6
+ - scibert
7
+ - data-paper-classification
8
+ - scholarly-papers
9
+ - binary-classification
10
+ base_model: allenai/scibert_scivocab_uncased
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - accuracy
15
+ - f1
16
+ model-index:
17
+ - name: scibert-data-paper
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Data Paper Classification
22
+ metrics:
23
+ - name: Edge Case Accuracy
24
+ type: accuracy
25
+ value: 1.0
26
+ - name: Mean Confidence
27
+ type: accuracy
28
+ value: 0.94
29
+ ---
30
+
31
+ # SciBERT Data-Paper Classifier
32
+
33
+ A fine-tuned [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) model for binary classification of scholarly papers as **data papers** (datasets, databases, atlases, benchmarks) vs **non-data papers** (methods, reviews, surveys, clinical trials).
34
+
35
+ Built for the [DataRank Portal](https://github.com/zehrakorkusuz/sindex-portal) — a data-sharing influence engine using Personalized PageRank on citation graphs.
36
+
37
+ ## Usage
38
+
39
+ ```python
40
+ from transformers import pipeline
41
+
42
+ clf = pipeline("text-classification", model="zehralx/scibert-data-paper", top_k=None, device=-1)
43
+ result = clf("MIMIC-III, a freely accessible critical care database")
44
+ # [{'label': 'LABEL_1', 'score': 0.9519}, {'label': 'LABEL_0', 'score': 0.0481}]
45
+ # LABEL_1 = data paper, LABEL_0 = not data paper
46
+ ```
47
+
48
+ ## Model Details
49
+
50
+ | Property | Value |
51
+ |----------|-------|
52
+ | Base model | `allenai/scibert_scivocab_uncased` |
53
+ | Architecture | BertForSequenceClassification (12 layers, 768 hidden, 12 heads) |
54
+ | Parameters | ~110M |
55
+ | Max tokens | 512 |
56
+ | Output | Binary: `data_paper` (1) / `not_data_paper` (0) |
57
+ | Inference | CPU (no GPU required) |
58
+
59
+ ## Training
60
+
61
+ Two-phase continued fine-tuning:
62
+
63
+ 1. **Phase 1**: 5 epochs, learning rate 2e-5
64
+ 2. **Phase 2**: 3 epochs, learning rate 5e-6 (lower LR for refinement)
65
+
66
+ | Hyperparameter | Value |
67
+ |----------------|-------|
68
+ | Batch size | 24 |
69
+ | Label smoothing | 0.1 |
70
+ | Edge case weight | 5x |
71
+ | Mixed precision | FP16 |
72
+
73
+ ## Evaluation
74
+
75
+ Tested on 38 curated edge cases spanning diverse categories:
76
+
77
+ | Category | Examples | Correctly classified |
78
+ |----------|----------|---------------------|
79
+ | Data papers | UniProt, GTEx, ImageNet, TCGA, MIMIC-III, UK Biobank | All |
80
+ | Non-data papers | Methods, reviews, surveys, perspectives, protocols | All |
81
+
82
+ - **Edge case accuracy**: 100% (38/38)
83
+ - **Confidence range**: 0.80 - 0.96
84
+ - **Mean confidence**: 0.94
85
+
86
+ ## Input Format
87
+
88
+ Concatenated `title + abstract`, truncated to 512 tokens. The model works well with title-only input when abstracts are unavailable.
89
+
90
+ ## Limitations
91
+
92
+ - Trained primarily on biomedical/life sciences papers; may underperform on other domains
93
+ - Binary classification only (no multi-class dataset subtypes)
94
+ - Confidence may be lower for interdisciplinary papers that mix methods and data contributions
95
+
96
+ ## Citation
97
+
98
+ ```bibtex
99
+ @misc{scibert-data-paper-2026,
100
+ title={SciBERT Data-Paper Classifier},
101
+ author={Zehra Korkusuz},
102
+ year={2026},
103
+ url={https://huggingface.co/zehralx/scibert-data-paper}
104
+ }
105
+ ```