| language: en | |
| license: mit | |
| tags: | |
| - text-classification | |
| - distilbert | |
| - biomedical | |
| - citation-detection | |
| - scientific-text | |
| datasets: | |
| - vgainullin/xciting_data | |
| metrics: | |
| - accuracy | |
| - f1 | |
| pipeline_tag: text-classification | |
| # Citation Classifier | |
| A DistilBERT-based binary classifier that identifies sentences in biomedical text that require citations. | |
| ## Model Description | |
| This model takes a sentence from a scientific/biomedical article and predicts whether it should contain a citation (1) or not (0). It is a key component of the [pubciter](https://github.com/vgainullin/pubciter) pipeline for automated citation generation. | |
| **Base model:** distilbert-base-uncased | |
| **Task:** Binary text classification | |
| **Domain:** Biomedical / scientific literature | |
| ## Variants | |
| - **coteaching/** — Trained with co-teaching strategy for noise-robust learning | |
| - **self_filtering/** — Trained with self-filtering for label noise reduction | |
| - **last-checkpoint/** — Standard training final checkpoint | |
| ## Training | |
| - **Dataset:** [vgainullin/xciting_data](https://huggingface.co/datasets/vgainullin/xciting_data) — PubMed sentences annotated for citation presence | |
| - **Samples:** 100k balanced (50k cited, 50k uncited) | |
| - **Epochs:** 3 | |
| - **Learning rate:** 1e-6 | |
| - **Batch size:** 16 (train), 64 (eval) | |
| - **Optimizer:** AdamW | |
| ## Usage | |
| ## Citation | |
| If you use this model, please cite: | |