ClinVarBERT
A BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification, built upon BioBERT-Large.
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.
𧬠Model Details
Model Description
ClinVarBERT-Large is a domain-specific transformer model fine-tuned from BioBERT-Large for the task of genetic variant interpretation.
It is trained to capture subtle linguistic patterns in ClinVar submissions and related clinical genetics texts, enabling accurate classification of variant pathogenicity.
- Model Type: BERT-based transformer for sequence classification
- Languages: English (biomedical / clinical domain)
- License: Apache 2.0
- Fine-tuned From: dmis-lab/biobert-large-cased-v1.1
- Training Data: Curated ClinVar submission texts describing genetic variants and their clinical interpretations
Model Sources
- Repository: [Your GitHub Repository or Project Page]
- Base Model: BioBERT-Large
- Dataset: ClinVar Database
π Uses
Direct Use
ClinVarBERT can be directly applied to:
- Variant pathogenicity classification: Classify genetic variants as Pathogenic/Likely Pathogenic (P/LP), Benign/Likely Benign (B/LB), or Variant of Uncertain Significance (VUS).
- Clinical interpretation mining: Analyze and categorize textual variant interpretations from clinical databases or research reports.
- Biomedical NLP tasks: Serve as a strong domain-specific encoder for clinical genetics-related text classification.
Label Mapping
| Class ID | Label | Description |
|---|---|---|
| 0 | P/LP | Pathogenic or Likely Pathogenic |
| 1 | VUS | Variant of Uncertain Significance |
| 2 | B/LB | Benign or Likely Benign |
β‘ Quick Start
Option 1: Use via Hugging Face Pipeline
from transformers import pipeline
# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")
# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
# Predict
result = pipe(text)
print(result)
Option 2: Manual Inference
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Inference
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = torch.argmax(probs, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_id]
print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")
- Downloads last month
- 13
Model tree for weijiang99/clinvarbert
Base model
dmis-lab/biobert-large-cased-v1.1