File size: 3,528 Bytes
ff6a884
38631f5
 
 
ff6a884
cf2e007
 
 
 
 
 
 
 
 
 
 
ff6a884
 
7bd0367
ff6a884
a3c85d9
 
ff6a884
a3c85d9
 
 
ff6a884
 
 
a3c85d9
 
ff6a884
a3c85d9
 
 
 
 
ff6a884
cf2e007
ff6a884
a3c85d9
 
 
ff6a884
a3c85d9
 
 
ff6a884
 
 
a3c85d9
 
 
 
 
 
ff6a884
a3c85d9
 
3da22a7
a3c85d9
3da22a7
a3c85d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff6a884
cf2e007
 
a3c85d9
ff6a884
cf2e007
fe299e4
 
ff6a884
a3c85d9
 
cf2e007
ff6a884
a3c85d9
cf2e007
 
a3c85d9
 
 
ff6a884
a3c85d9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- biomedical
- clinical
- variant-classification
- genetics
- bert
- fine-tuned
base_model: dmis-lab/biobert-large-cased-v1.1
datasets:
- clinvar
pipeline_tag: text-classification
---

# ClinVarBERT

A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.  
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.

---

## 🧬 Model Details

### Model Description

**ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.  
It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.

- **Model Type:** BERT-based transformer for sequence classification  
- **Languages:** English (biomedical / clinical domain)  
- **License:** Apache 2.0  
- **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)  
- **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations  

### Model Sources

- **Repository:** [Your GitHub Repository or Project Page]  
- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)  
- **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)

---

## 🚀 Uses

### Direct Use

ClinVarBERT can be directly applied to:
- **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.  
- **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.  
- **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.

## Label Mapping

| Class ID | Label | Description |
|-----------|--------|-------------|
| 0 | **P/LP** | Pathogenic or Likely Pathogenic |
| 1 | **VUS** | Variant of Uncertain Significance |
| 2 | **B/LB** | Benign or Likely Benign |

---

## ⚡ Quick Start

### Option 1: Use via Hugging Face Pipeline

```python
from transformers import pipeline

# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")

# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."

# Predict
result = pipe(text)
print(result)
```

### Option 2: Manual Inference

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")

# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = torch.argmax(probs, dim=-1).item()
    predicted_label = model.config.id2label[predicted_class_id]

print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")
```