ClinVarBERT

A BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification, built upon BioBERT-Large.
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.


🧬 Model Details

Model Description

ClinVarBERT-Large is a domain-specific transformer model fine-tuned from BioBERT-Large for the task of genetic variant interpretation.
It is trained to capture subtle linguistic patterns in ClinVar submissions and related clinical genetics texts, enabling accurate classification of variant pathogenicity.

  • Model Type: BERT-based transformer for sequence classification
  • Languages: English (biomedical / clinical domain)
  • License: Apache 2.0
  • Fine-tuned From: dmis-lab/biobert-large-cased-v1.1
  • Training Data: Curated ClinVar submission texts describing genetic variants and their clinical interpretations

Model Sources


πŸš€ Uses

Direct Use

ClinVarBERT can be directly applied to:

  • Variant pathogenicity classification: Classify genetic variants as Pathogenic/Likely Pathogenic (P/LP), Benign/Likely Benign (B/LB), or Variant of Uncertain Significance (VUS).
  • Clinical interpretation mining: Analyze and categorize textual variant interpretations from clinical databases or research reports.
  • Biomedical NLP tasks: Serve as a strong domain-specific encoder for clinical genetics-related text classification.

Label Mapping

Class ID Label Description
0 P/LP Pathogenic or Likely Pathogenic
1 VUS Variant of Uncertain Significance
2 B/LB Benign or Likely Benign

⚑ Quick Start

Option 1: Use via Hugging Face Pipeline

from transformers import pipeline

# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")

# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."

# Predict
result = pipe(text)
print(result)

Option 2: Manual Inference

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")

# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = torch.argmax(probs, dim=-1).item()
    predicted_label = model.config.id2label[predicted_class_id]

print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")
Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for weijiang99/clinvarbert

Finetuned
(2)
this model

Space using weijiang99/clinvarbert 1