clinvarbert / README.md
weijiang99's picture
Upload BertForSequenceClassification
38631f5 verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- biomedical
- clinical
- variant-classification
- genetics
- bert
- fine-tuned
base_model: dmis-lab/biobert-large-cased-v1.1
datasets:
- clinvar
pipeline_tag: text-classification
---
# ClinVarBERT
A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.
---
## 🧬 Model Details
### Model Description
**ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.
It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.
- **Model Type:** BERT-based transformer for sequence classification
- **Languages:** English (biomedical / clinical domain)
- **License:** Apache 2.0
- **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
- **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations
### Model Sources
- **Repository:** [Your GitHub Repository or Project Page]
- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
- **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)
---
## 🚀 Uses
### Direct Use
ClinVarBERT can be directly applied to:
- **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.
- **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.
- **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.
## Label Mapping
| Class ID | Label | Description |
|-----------|--------|-------------|
| 0 | **P/LP** | Pathogenic or Likely Pathogenic |
| 1 | **VUS** | Variant of Uncertain Significance |
| 2 | **B/LB** | Benign or Likely Benign |
---
## ⚡ Quick Start
### Option 1: Use via Hugging Face Pipeline
```python
from transformers import pipeline
# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")
# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
# Predict
result = pipe(text)
print(result)
```
### Option 2: Manual Inference
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Inference
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = torch.argmax(probs, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_id]
print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")
```