---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- biomedical
- clinical
- variant-classification
- genetics
- bert
- fine-tuned
base_model: dmis-lab/biobert-large-cased-v1.1
datasets:
- clinvar
pipeline_tag: text-classification
---

# ClinVarBERT

A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.  
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.

---

## 🧬 Model Details

### Model Description

**ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.  
It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.

- **Model Type:** BERT-based transformer for sequence classification  
- **Languages:** English (biomedical / clinical domain)  
- **License:** Apache 2.0  
- **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)  
- **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations  

### Model Sources

- **Repository:** [Your GitHub Repository or Project Page]  
- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)  
- **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)

---

## 🚀 Uses

### Direct Use

ClinVarBERT can be directly applied to:
- **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.  
- **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.  
- **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.

## Label Mapping

| Class ID | Label | Description |
|-----------|--------|-------------|
| 0 | **P/LP** | Pathogenic or Likely Pathogenic |
| 1 | **VUS** | Variant of Uncertain Significance |
| 2 | **B/LB** | Benign or Likely Benign |

---

## ⚡ Quick Start

### Option 1: Use via Hugging Face Pipeline

```python
from transformers import pipeline

# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")

# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."

# Predict
result = pipe(text)
print(result)
```

### Option 2: Manual Inference

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")

# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = torch.argmax(probs, dim=-1).item()
    predicted_label = model.config.id2label[predicted_class_id]

print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")
```