ClinVarBERT

A BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification, built upon BioBERT-Large.
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.

🧬 Model Details

Model Description

ClinVarBERT-Large is a domain-specific transformer model fine-tuned from BioBERT-Large for the task of genetic variant interpretation.
It is trained to capture subtle linguistic patterns in ClinVar submissions and related clinical genetics texts, enabling accurate classification of variant pathogenicity.

Model Type: BERT-based transformer for sequence classification
Languages: English (biomedical / clinical domain)
License: Apache 2.0
Fine-tuned From: dmis-lab/biobert-large-cased-v1.1
Training Data: Curated ClinVar submission texts describing genetic variants and their clinical interpretations

Model Sources

Repository: [Your GitHub Repository or Project Page]
Base Model: BioBERT-Large
Dataset: ClinVar Database

🚀 Uses

Direct Use

ClinVarBERT can be directly applied to:

Variant pathogenicity classification: Classify genetic variants as Pathogenic/Likely Pathogenic (P/LP), Benign/Likely Benign (B/LB), or Variant of Uncertain Significance (VUS).
Clinical interpretation mining: Analyze and categorize textual variant interpretations from clinical databases or research reports.
Biomedical NLP tasks: Serve as a strong domain-specific encoder for clinical genetics-related text classification.

Label Mapping

Class ID	Label	Description
0	P/LP	Pathogenic or Likely Pathogenic
1	VUS	Variant of Uncertain Significance
2	B/LB	Benign or Likely Benign

⚡ Quick Start

Option 1: Use via Hugging Face Pipeline

from transformers import pipeline

# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")

# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."

# Predict
result = pipe(text)
print(result)

Option 2: Manual Inference

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")

# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = torch.argmax(probs, dim=-1).item()
    predicted_label = model.config.id2label[predicted_class_id]

print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for weijiang99/clinvarbert

Base model

dmis-lab/biobert-large-cased-v1.1

Finetuned

(2)

this model

weijiang99
/

clinvarbert