Text Classification
Transformers
Safetensors
English
distilbert
biomedical
citation-detection
scientific-text
text-embeddings-inference
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="vgainullin/citation_classifier")# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("vgainullin/citation_classifier")
model = AutoModelForSequenceClassification.from_pretrained("vgainullin/citation_classifier")Quick Links
Citation Classifier
A DistilBERT-based binary classifier that identifies sentences in biomedical text that require citations.
Model Description
This model takes a sentence from a scientific/biomedical article and predicts whether it should contain a citation (1) or not (0). It is a key component of the pubciter pipeline for automated citation generation.
Base model: distilbert-base-uncased
Task: Binary text classification
Domain: Biomedical / scientific literature
Variants
- coteaching/ — Trained with co-teaching strategy for noise-robust learning
- self_filtering/ — Trained with self-filtering for label noise reduction
- last-checkpoint/ — Standard training final checkpoint
Training
- Dataset: vgainullin/xciting_data — PubMed sentences annotated for citation presence
- Samples: 100k balanced (50k cited, 50k uncited)
- Epochs: 3
- Learning rate: 1e-6
- Batch size: 16 (train), 64 (eval)
- Optimizer: AdamW
Usage
Citation
If you use this model, please cite:
- Downloads last month
- -
# Gated model: Login with a HF token with gated access permission hf auth login