zehralx
/

scibert-data-paper

Text Classification

data-paper-classification

scholarly-papers

binary-classification

Eval Results (legacy)

text-embeddings-inference

Model card Files Files and versions

scibert-data-paper / README.md

zehralx's picture

Update README.md

2e4161c verified 3 days ago

|

history blame contribute delete

3.17 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- scibert
	- data-paper-classification
	- scholarly-papers
	- binary-classification
	base_model: allenai/scibert_scivocab_uncased
	metrics:
	- accuracy
	- f1
	model-index:
	- name: scibert-data-paper
	results:
	- task:
	type: text-classification
	name: Data Paper Classification
	metrics:
	- name: Edge Case Accuracy
	type: accuracy
	value: 1
	- name: Mean Confidence
	type: accuracy
	value: 0.94
	---

	# SciBERT Data-Paper Classifier

	A fine-tuned [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) model for binary classification of scholarly papers as data papers (datasets, databases, atlases, benchmarks) vs non-data papers (methods, reviews, surveys, clinical trials).

	Built for the [DataRank Portal](https://github.com/zehrakorkusuz/sindex-portal) — a data-sharing influence engine using Personalized PageRank on citation graphs.

	## Usage

	```python
	from transformers import pipeline

	clf = pipeline("text-classification", model="zehralx/scibert-data-paper", top_k=None, device=-1)
	result = clf("MIMIC-III, a freely accessible critical care database")
	# [{'label': 'LABEL_1', 'score': 0.9519}, {'label': 'LABEL_0', 'score': 0.0481}]
	# LABEL_1 = data paper, LABEL_0 = not data paper
	```

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| `allenai/scibert_scivocab_uncased` \|
	\| Architecture \| BertForSequenceClassification (12 layers, 768 hidden, 12 heads) \|
	\| Parameters \| ~110M \|
	\| Max tokens \| 512 \|
	\| Output \| Binary: `data_paper` (1) / `not_data_paper` (0) \|
	\| Inference \| CPU (no GPU required) \|



	## Training

	[Train Data](https://www.kaggle.com/datasets/zehrakorkusuz/labeling-4k-datasets-with-gemini-flash-2-0)

	Two-phase continued fine-tuning:

	1. Phase 1: 5 epochs, learning rate 2e-5
	2. Phase 2: 3 epochs, learning rate 5e-6 (lower LR for refinement)

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Batch size \| 24 \|
	\| Label smoothing \| 0.1 \|
	\| Edge case weight \| 5x \|
	\| Mixed precision \| FP16 \|

	## Evaluation

	Tested on 38 curated edge cases spanning diverse categories:

	\| Category \| Examples \| Correctly classified \|
	\|----------\|----------\|---------------------\|
	\| Data papers \| UniProt, GTEx, ImageNet, TCGA, MIMIC-III, UK Biobank \| All \|
	\| Non-data papers \| Methods, reviews, surveys, perspectives, protocols \| All \|

	- Edge case accuracy: 100% (38/38)
	- Confidence range: 0.80 - 0.96
	- Mean confidence: 0.94

	## Input Format

	Concatenated `title + abstract`, truncated to 512 tokens. The model works well with title-only input when abstracts are unavailable.

	## Limitations

	- Trained primarily on biomedical/life sciences papers; may underperform on other domains
	- Binary classification only (no multi-class dataset subtypes)
	- Confidence may be lower for interdisciplinary papers that mix methods and data contributions

	## Citation

	```bibtex
	@misc{scibert-data-paper-2026,
	title={SciBERT Data-Paper Classifier},
	author={Zehra Korkusuz, Kuan-Lin Huang},
	year={2026},
	url={https://huggingface.co/zehralx/scibert-data-paper}
	}
	```