weijiang99
/

clinvarbert

@@ -18,51 +18,90 @@ pipeline_tag: text-classification
 # ClinVarBERT
-A BERT model fine-tuned for clinical variant interpretation and classification tasks, based on BioBERT-Large.
-## Model Details
 ### Model Description
-ClinVarBERT-Large is a domain-specific language model fine-tuned from BioBERT-Large for understanding and classifying genetic variant descriptions and clinical interpretations. The model has been trained to understand the nuanced language used in clinical genetics, particularly for variant pathogenicity assessment and clinical significance classification.
-- **Model type:** BERT-based transformer for sequence classification
-- **Language(s):** English (biomedical/clinical domain)
-- **License:** Apache 2.0
-- **Finetuned from model:** dmis-lab/biobert-large-cased-v1.1
 ### Model Sources
-- **Repository:** [Your GitHub Repository]
-- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
-- **Training Data:** ClinVar database submissions text
-## Uses
 ### Direct Use
-This model is designed for:
-- **Variant pathogenicity classification:** Classifying genetic variants as P/LP, B/LB, or VUS
-- **Clinical interpretation analysis:** Understanding and categorizing clinical variant descriptions
-- **Biomedical text classification:** General classification tasks in the clinical genetics domain
-## How to Get Started with the Model
 ```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
 # Load model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
 model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
-# Example usage
-text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
 inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
 with torch.no_grad():
     outputs = model(**inputs)
-    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
-# Get predicted class
-predicted_class = torch.argmax(predictions, dim=-1)

 # ClinVarBERT
+A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.
+ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.
+---
+## 🧬 Model Details
 ### Model Description
+**ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.
+It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.
+- **Model Type:** BERT-based transformer for sequence classification
+- **Languages:** English (biomedical / clinical domain)
+- **License:** Apache 2.0
+- **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
+- **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations
 ### Model Sources
+- **Repository:** [Your GitHub Repository or Project Page]
+- **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
+- **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)
+---
+## 🚀 Uses
 ### Direct Use
+ClinVarBERT can be directly applied to:
+- **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.
+- **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.
+- **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.
+## Label Mapping
+| Class ID | Label | Description |
+|-----------|--------|-------------|
+| 0 | **B/LB** | Benign or Likely Benign |
+| 1 | **VUS** | Variant of Uncertain Significance |
+| 2 | **P/LP** | Pathogenic or Likely Pathogenic |
+---
+## ⚡ Quick Start
+### Option 1: Use via Hugging Face Pipeline
+```python
+from transformers import pipeline
+# Load the pipeline
+pipe = pipeline("text-classification", model="weijiang99/clinvarbert")
+# Example text
+text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
+# Predict
+result = pipe(text)
+print(result)
+```
+### Option 2: Manual Inference
 ```python
 import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
 # Load model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
 model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
+# Input text
+text = "This variant was reported as likely benign in multiple submissions."
 inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+# Inference
 with torch.no_grad():
     outputs = model(**inputs)
+    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class_id = torch.argmax(probs, dim=-1).item()
+    predicted_label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {predicted_label}")
+print(f"Probabilities: {probs}")
+```