--- language: - en license: apache-2.0 library_name: transformers tags: - biomedical - clinical - variant-classification - genetics - bert - fine-tuned base_model: dmis-lab/biobert-large-cased-v1.1 datasets: - clinvar pipeline_tag: text-classification --- # ClinVarBERT A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**. ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports. --- ## 🧬 Model Details ### Model Description **ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**. It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity. - **Model Type:** BERT-based transformer for sequence classification - **Languages:** English (biomedical / clinical domain) - **License:** Apache 2.0 - **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1) - **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations ### Model Sources - **Repository:** [Your GitHub Repository or Project Page] - **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1) - **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/) --- ## 🚀 Uses ### Direct Use ClinVarBERT can be directly applied to: - **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*. - **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports. - **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification. ## Label Mapping | Class ID | Label | Description | |-----------|--------|-------------| | 0 | **P/LP** | Pathogenic or Likely Pathogenic | | 1 | **VUS** | Variant of Uncertain Significance | | 2 | **B/LB** | Benign or Likely Benign | --- ## ⚡ Quick Start ### Option 1: Use via Hugging Face Pipeline ```python from transformers import pipeline # Load the pipeline pipe = pipeline("text-classification", model="weijiang99/clinvarbert") # Example text text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer." # Predict result = pipe(text) print(result) ``` ### Option 2: Manual Inference ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert") model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert") # Input text text = "This variant was reported as likely benign in multiple submissions." inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) # Inference with torch.no_grad(): outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class_id = torch.argmax(probs, dim=-1).item() predicted_label = model.config.id2label[predicted_class_id] print(f"Predicted label: {predicted_label}") print(f"Probabilities: {probs}") ```