Update README.md

56a9ccd verified 14 days ago

2.7 kB

license: apache-2.0
datasets:
  - ade-benchmark-corpus/ade_corpus_v2
language:
  - en
base_model:
  - dmis-lab/biobert-base-cased-v1.2
pipeline_tag: text-classification
tags:
  - biomedical
  - nlp
  - adverse-drug-effects
  - bert
  - biobert

BioBERT for Adverse Drug Effect (ADE) Classification

This model is a fine-tuned version of dmis-lab/biobert-base-cased-v1.2 for binary sentence classification: Does a sentence describe an adverse drug effect (ADE)? It was fine-tuned on the ADE Corpus V2 dataset and compared against a classical TF-IDF + Logistic Regression baseline as part of a broader project benchmarking classical vs. transformer approaches on imbalanced biomedical text.

Project Repo: GitHub

Results (Test Set: N=3,528)

Model	Weighted F1	ADE Class F1	Accuracy	Total Errors
TF-IDF + Logistic Regression	0.90	0.84	90%	349
BioBERT (this model)	0.96	0.93	96%	145

BioBERT reduced misclassifications by 58% (349 → 145 errors) compared to the classical baseline.

Training Details

Base model: dmis-lab/biobert-base-cased-v1.2 (110M parameters)
Epochs: 3 (Best checkpoint selected by validation F1)
Learning rate: 2e-5
Batch size: 16
Max sequence length: 128
Precision: fp16
Data split: stratified 70/15/15 train/val/test (seed=42)

Epoch	Train Loss	Val F1	Val Accuracy
1	0.175	0.943	0.943
2	0.114	0.952	0.952
3	0.043	0.952	0.952

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("scheun/biobert-ade-classifier")
tokenizer = AutoTokenizer.from_pretrained("scheun/biobert-ade-classifier")

inputs = tokenizer("Patient developed severe nausea after taking the medication.", return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
print(prediction)  # 0 = not ADE, 1 = ADE

Limitations

Trained on MEDLINE case report sentences. Performance may vary on other text domains.
Binary classification only. It does not extract which drug or which effect is mentioned.

References

Gurulingappa et al. (2012), Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports
Lee et al. (2020), BioBERT: a pre-trained biomedical language representation model for biomedical text mining