File size: 3,528 Bytes
ff6a884 38631f5 ff6a884 cf2e007 ff6a884 7bd0367 ff6a884 a3c85d9 ff6a884 a3c85d9 ff6a884 a3c85d9 ff6a884 a3c85d9 ff6a884 cf2e007 ff6a884 a3c85d9 ff6a884 a3c85d9 ff6a884 a3c85d9 ff6a884 a3c85d9 3da22a7 a3c85d9 3da22a7 a3c85d9 ff6a884 cf2e007 a3c85d9 ff6a884 cf2e007 fe299e4 ff6a884 a3c85d9 cf2e007 ff6a884 a3c85d9 cf2e007 a3c85d9 ff6a884 a3c85d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- biomedical
- clinical
- variant-classification
- genetics
- bert
- fine-tuned
base_model: dmis-lab/biobert-large-cased-v1.1
datasets:
- clinvar
pipeline_tag: text-classification
---
# ClinVarBERT
A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.
ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.
---
## 🧬 Model Details
### Model Description
**ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.
It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.
- **Model Type:** BERT-based transformer for sequence classification
- **Languages:** English (biomedical / clinical domain)
- **License:** Apache 2.0
- **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
- **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations
### Model Sources
- **Repository:** [Your GitHub Repository or Project Page]
- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
- **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)
---
## 🚀 Uses
### Direct Use
ClinVarBERT can be directly applied to:
- **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.
- **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.
- **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.
## Label Mapping
| Class ID | Label | Description |
|-----------|--------|-------------|
| 0 | **P/LP** | Pathogenic or Likely Pathogenic |
| 1 | **VUS** | Variant of Uncertain Significance |
| 2 | **B/LB** | Benign or Likely Benign |
---
## ⚡ Quick Start
### Option 1: Use via Hugging Face Pipeline
```python
from transformers import pipeline
# Load the pipeline
pipe = pipeline("text-classification", model="weijiang99/clinvarbert")
# Example text
text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
# Predict
result = pipe(text)
print(result)
```
### Option 2: Manual Inference
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
# Input text
text = "This variant was reported as likely benign in multiple submissions."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Inference
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = torch.argmax(probs, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_id]
print(f"Predicted label: {predicted_label}")
print(f"Probabilities: {probs}")
``` |