File size: 3,170 Bytes
ef979ea
6019380
 
ef979ea
 
2e4161c
 
 
 
6019380
 
2e4161c
 
6019380
2e4161c
 
 
 
 
 
 
 
 
 
 
 
6019380
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e4161c
 
6019380
 
2e4161c
 
6019380
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b578513
6019380
 
 
2e4161c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- scibert
- data-paper-classification
- scholarly-papers
- binary-classification
base_model: allenai/scibert_scivocab_uncased
metrics:
- accuracy
- f1
model-index:
- name: scibert-data-paper
  results:
  - task:
      type: text-classification
      name: Data Paper Classification
    metrics:
    - name: Edge Case Accuracy
      type: accuracy
      value: 1
    - name: Mean Confidence
      type: accuracy
      value: 0.94
---

# SciBERT Data-Paper Classifier

A fine-tuned [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) model for binary classification of scholarly papers as **data papers** (datasets, databases, atlases, benchmarks) vs **non-data papers** (methods, reviews, surveys, clinical trials).

Built for the [DataRank Portal](https://github.com/zehrakorkusuz/sindex-portal) — a data-sharing influence engine using Personalized PageRank on citation graphs.

## Usage

```python
from transformers import pipeline

clf = pipeline("text-classification", model="zehralx/scibert-data-paper", top_k=None, device=-1)
result = clf("MIMIC-III, a freely accessible critical care database")
# [{'label': 'LABEL_1', 'score': 0.9519}, {'label': 'LABEL_0', 'score': 0.0481}]
# LABEL_1 = data paper, LABEL_0 = not data paper
```

## Model Details

| Property | Value |
|----------|-------|
| Base model | `allenai/scibert_scivocab_uncased` |
| Architecture | BertForSequenceClassification (12 layers, 768 hidden, 12 heads) |
| Parameters | ~110M |
| Max tokens | 512 |
| Output | Binary: `data_paper` (1) / `not_data_paper` (0) |
| Inference | CPU (no GPU required) |



## Training

[Train Data](https://www.kaggle.com/datasets/zehrakorkusuz/labeling-4k-datasets-with-gemini-flash-2-0)

Two-phase continued fine-tuning:

1. **Phase 1**: 5 epochs, learning rate 2e-5
2. **Phase 2**: 3 epochs, learning rate 5e-6 (lower LR for refinement)

| Hyperparameter | Value |
|----------------|-------|
| Batch size | 24 |
| Label smoothing | 0.1 |
| Edge case weight | 5x |
| Mixed precision | FP16 |

## Evaluation

Tested on 38 curated edge cases spanning diverse categories:

| Category | Examples | Correctly classified |
|----------|----------|---------------------|
| Data papers | UniProt, GTEx, ImageNet, TCGA, MIMIC-III, UK Biobank | All |
| Non-data papers | Methods, reviews, surveys, perspectives, protocols | All |

- **Edge case accuracy**: 100% (38/38)
- **Confidence range**: 0.80 - 0.96
- **Mean confidence**: 0.94

## Input Format

Concatenated `title + abstract`, truncated to 512 tokens. The model works well with title-only input when abstracts are unavailable.

## Limitations

- Trained primarily on biomedical/life sciences papers; may underperform on other domains
- Binary classification only (no multi-class dataset subtypes)
- Confidence may be lower for interdisciplinary papers that mix methods and data contributions

## Citation

```bibtex
@misc{scibert-data-paper-2026,
  title={SciBERT Data-Paper Classifier},
  author={Zehra Korkusuz, Kuan-Lin Huang},
  year={2026},
  url={https://huggingface.co/zehralx/scibert-data-paper}
}
```