zehralx
/

scibert-data-paper

@@ -1,9 +1,105 @@
 ---
-base_model:
-- allenai/scibert_scivocab_uncased
 pipeline_tag: text-classification
 tags:
-- s-index
-- nih
-- datasharing
----

 ---
+license: apache-2.0
+library_name: transformers
 pipeline_tag: text-classification
 tags:
+  - scibert
+  - data-paper-classification
+  - scholarly-papers
+  - binary-classification
+base_model: allenai/scibert_scivocab_uncased
+datasets:
+  - custom
+metrics:
+  - accuracy
+  - f1
+model-index:
+  - name: scibert-data-paper
+    results:
+      - task:
+          type: text-classification
+          name: Data Paper Classification
+        metrics:
+          - name: Edge Case Accuracy
+            type: accuracy
+            value: 1.0
+          - name: Mean Confidence
+            type: accuracy
+            value: 0.94
+---
+# SciBERT Data-Paper Classifier
+A fine-tuned [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) model for binary classification of scholarly papers as **data papers** (datasets, databases, atlases, benchmarks) vs **non-data papers** (methods, reviews, surveys, clinical trials).
+Built for the [DataRank Portal](https://github.com/zehrakorkusuz/sindex-portal) — a data-sharing influence engine using Personalized PageRank on citation graphs.
+## Usage
+```python
+from transformers import pipeline
+clf = pipeline("text-classification", model="zehralx/scibert-data-paper", top_k=None, device=-1)
+result = clf("MIMIC-III, a freely accessible critical care database")
+# [{'label': 'LABEL_1', 'score': 0.9519}, {'label': 'LABEL_0', 'score': 0.0481}]
+# LABEL_1 = data paper, LABEL_0 = not data paper
+```
+## Model Details
+| Property | Value |
+|----------|-------|
+| Base model | `allenai/scibert_scivocab_uncased` |
+| Architecture | BertForSequenceClassification (12 layers, 768 hidden, 12 heads) |
+| Parameters | ~110M |
+| Max tokens | 512 |
+| Output | Binary: `data_paper` (1) / `not_data_paper` (0) |
+| Inference | CPU (no GPU required) |
+## Training
+Two-phase continued fine-tuning:
+1. **Phase 1**: 5 epochs, learning rate 2e-5
+2. **Phase 2**: 3 epochs, learning rate 5e-6 (lower LR for refinement)
+| Hyperparameter | Value |
+|----------------|-------|
+| Batch size | 24 |
+| Label smoothing | 0.1 |
+| Edge case weight | 5x |
+| Mixed precision | FP16 |
+## Evaluation
+Tested on 38 curated edge cases spanning diverse categories:
+| Category | Examples | Correctly classified |
+|----------|----------|---------------------|
+| Data papers | UniProt, GTEx, ImageNet, TCGA, MIMIC-III, UK Biobank | All |
+| Non-data papers | Methods, reviews, surveys, perspectives, protocols | All |
+- **Edge case accuracy**: 100% (38/38)
+- **Confidence range**: 0.80 - 0.96
+- **Mean confidence**: 0.94
+## Input Format
+Concatenated `title + abstract`, truncated to 512 tokens. The model works well with title-only input when abstracts are unavailable.
+## Limitations
+- Trained primarily on biomedical/life sciences papers; may underperform on other domains
+- Binary classification only (no multi-class dataset subtypes)
+- Confidence may be lower for interdisciplinary papers that mix methods and data contributions
+## Citation
+```bibtex
+@misc{scibert-data-paper-2026,
+  title={SciBERT Data-Paper Classifier},
+  author={Zehra Korkusuz},
+  year={2026},
+  url={https://huggingface.co/zehralx/scibert-data-paper}
+}
+```