scheun
/

biobert-ade-classifier

Text Classification

adverse-drug-effects

Model card Files Files and versions

biobert-ade-classifier / README.md

scheun's picture

Update README.md

56a9ccd verified 14 days ago

|

History Blame Contribute Delete

2.7 kB

	---
	license: apache-2.0
	datasets:
	- ade-benchmark-corpus/ade_corpus_v2
	language:
	- en
	base_model:
	- dmis-lab/biobert-base-cased-v1.2
	pipeline_tag: text-classification
	tags:
	- biomedical
	- nlp
	- adverse-drug-effects
	- bert
	- biobert
	---

	# BioBERT for Adverse Drug Effect (ADE) Classification

	This model is a fine-tuned version of [`dmis-lab/biobert-base-cased-v1.2`](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) for binary sentence classification: Does a sentence describe an adverse drug effect (ADE)?
	It was fine-tuned on the [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2) dataset and compared against a classical TF-IDF + Logistic Regression baseline as part of a broader project benchmarking classical vs. transformer approaches on imbalanced biomedical text.

	Project Repo: [GitHub](https://github.com/steven-cheun/nlp-ade-classification)

	## Results (Test Set: N=3,528)

	\| Model \| Weighted F1 \| ADE Class F1 \| Accuracy \| Total Errors \|
	\|---\|---\|---\|---\|---\|
	\| TF-IDF + Logistic Regression \| 0.90 \| 0.84 \| 90% \| 349 \|
	\| BioBERT (this model) \| 0.96 \| 0.93 \| 96% \| 145 \|

	BioBERT reduced misclassifications by 58% (349 → 145 errors) compared to the classical baseline.

	## Training Details

	- Base model: `dmis-lab/biobert-base-cased-v1.2` (110M parameters)
	- Epochs: 3 (Best checkpoint selected by validation F1)
	- Learning rate: 2e-5
	- Batch size: 16
	- Max sequence length: 128
	- Precision: fp16
	- Data split: stratified 70/15/15 train/val/test (seed=42)

	\| Epoch \| Train Loss \| Val F1 \| Val Accuracy \|
	\|---\|---\|---\|---\|
	\| 1 \| 0.175 \| 0.943 \| 0.943 \|
	\| 2 \| 0.114 \| 0.952 \| 0.952 \|
	\| 3 \| 0.043 \| 0.952 \| 0.952 \|

	## Usage

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model = AutoModelForSequenceClassification.from_pretrained("scheun/biobert-ade-classifier")
	tokenizer = AutoTokenizer.from_pretrained("scheun/biobert-ade-classifier")

	inputs = tokenizer("Patient developed severe nausea after taking the medication.", return_tensors="pt")
	outputs = model(**inputs)
	prediction = outputs.logits.argmax(-1).item()
	print(prediction) # 0 = not ADE, 1 = ADE
	```

	## Limitations

	- Trained on MEDLINE case report sentences. Performance may vary on other text domains.
	- Binary classification only. It does not extract which drug or which effect is mentioned.

	## References

	- Gurulingappa et al. (2012), Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports
	- Lee et al. (2020), BioBERT: a pre-trained biomedical language representation model for biomedical text mining