Update README.md
Browse files
README.md
CHANGED
|
@@ -18,51 +18,90 @@ pipeline_tag: text-classification
|
|
| 18 |
|
| 19 |
# ClinVarBERT
|
| 20 |
|
| 21 |
-
A BERT model fine-tuned for clinical variant interpretation and classification
|
|
|
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
### Model Description
|
| 26 |
|
| 27 |
-
ClinVarBERT-Large is a domain-specific
|
|
|
|
| 28 |
|
| 29 |
-
- **Model
|
| 30 |
-
- **
|
| 31 |
-
- **License:** Apache 2.0
|
| 32 |
-
- **
|
|
|
|
| 33 |
|
| 34 |
### Model Sources
|
| 35 |
|
| 36 |
-
- **Repository:** [Your GitHub Repository]
|
| 37 |
-
- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
|
| 38 |
-
- **
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
### Direct Use
|
| 43 |
|
| 44 |
-
|
| 45 |
-
- **Variant pathogenicity classification:**
|
| 46 |
-
- **Clinical interpretation
|
| 47 |
-
- **Biomedical
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
```python
|
| 52 |
-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 53 |
import torch
|
|
|
|
| 54 |
|
| 55 |
# Load model and tokenizer
|
| 56 |
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
|
| 57 |
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
|
| 58 |
|
| 59 |
-
#
|
| 60 |
-
text = "This
|
| 61 |
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
| 62 |
|
|
|
|
| 63 |
with torch.no_grad():
|
| 64 |
outputs = model(**inputs)
|
| 65 |
-
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
| 18 |
|
| 19 |
# ClinVarBERT
|
| 20 |
|
| 21 |
+
A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.
|
| 22 |
+
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.
|
| 23 |
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## 🧬 Model Details
|
| 27 |
|
| 28 |
### Model Description
|
| 29 |
|
| 30 |
+
**ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.
|
| 31 |
+
It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.
|
| 32 |
|
| 33 |
+
- **Model Type:** BERT-based transformer for sequence classification
|
| 34 |
+
- **Languages:** English (biomedical / clinical domain)
|
| 35 |
+
- **License:** Apache 2.0
|
| 36 |
+
- **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
|
| 37 |
+
- **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations
|
| 38 |
|
| 39 |
### Model Sources
|
| 40 |
|
| 41 |
+
- **Repository:** [Your GitHub Repository or Project Page]
|
| 42 |
+
- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
|
| 43 |
+
- **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)
|
| 44 |
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 🚀 Uses
|
| 48 |
|
| 49 |
### Direct Use
|
| 50 |
|
| 51 |
+
ClinVarBERT can be directly applied to:
|
| 52 |
+
- **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.
|
| 53 |
+
- **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.
|
| 54 |
+
- **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.
|
| 55 |
+
|
| 56 |
+
## Label Mapping
|
| 57 |
|
| 58 |
+
| Class ID | Label | Description |
|
| 59 |
+
|-----------|--------|-------------|
|
| 60 |
+
| 0 | **B/LB** | Benign or Likely Benign |
|
| 61 |
+
| 1 | **VUS** | Variant of Uncertain Significance |
|
| 62 |
+
| 2 | **P/LP** | Pathogenic or Likely Pathogenic |
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## ⚡ Quick Start
|
| 67 |
+
|
| 68 |
+
### Option 1: Use via Hugging Face Pipeline
|
| 69 |
+
|
| 70 |
+
```python
|
| 71 |
+
from transformers import pipeline
|
| 72 |
+
|
| 73 |
+
# Load the pipeline
|
| 74 |
+
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")
|
| 75 |
+
|
| 76 |
+
# Example text
|
| 77 |
+
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
|
| 78 |
+
|
| 79 |
+
# Predict
|
| 80 |
+
result = pipe(text)
|
| 81 |
+
print(result)
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### Option 2: Manual Inference
|
| 85 |
|
| 86 |
```python
|
|
|
|
| 87 |
import torch
|
| 88 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 89 |
|
| 90 |
# Load model and tokenizer
|
| 91 |
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
|
| 92 |
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
|
| 93 |
|
| 94 |
+
# Input text
|
| 95 |
+
text = "This variant was reported as likely benign in multiple submissions."
|
| 96 |
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
| 97 |
|
| 98 |
+
# Inference
|
| 99 |
with torch.no_grad():
|
| 100 |
outputs = model(**inputs)
|
| 101 |
+
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 102 |
+
predicted_class_id = torch.argmax(probs, dim=-1).item()
|
| 103 |
+
predicted_label = model.config.id2label[predicted_class_id]
|
| 104 |
|
| 105 |
+
print(f"Predicted label: {predicted_label}")
|
| 106 |
+
print(f"Probabilities: {probs}")
|
| 107 |
+
```
|