weijiang99 commited on
Commit
a3c85d9
·
verified ·
1 Parent(s): 708b4f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -22
README.md CHANGED
@@ -18,51 +18,90 @@ pipeline_tag: text-classification
18
 
19
  # ClinVarBERT
20
 
21
- A BERT model fine-tuned for clinical variant interpretation and classification tasks, based on BioBERT-Large.
 
22
 
23
- ## Model Details
 
 
24
 
25
  ### Model Description
26
 
27
- ClinVarBERT-Large is a domain-specific language model fine-tuned from BioBERT-Large for understanding and classifying genetic variant descriptions and clinical interpretations. The model has been trained to understand the nuanced language used in clinical genetics, particularly for variant pathogenicity assessment and clinical significance classification.
 
28
 
29
- - **Model type:** BERT-based transformer for sequence classification
30
- - **Language(s):** English (biomedical/clinical domain)
31
- - **License:** Apache 2.0
32
- - **Finetuned from model:** dmis-lab/biobert-large-cased-v1.1
 
33
 
34
  ### Model Sources
35
 
36
- - **Repository:** [Your GitHub Repository]
37
- - **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
38
- - **Training Data:** ClinVar database submissions text
39
 
40
- ## Uses
 
 
41
 
42
  ### Direct Use
43
 
44
- This model is designed for:
45
- - **Variant pathogenicity classification:** Classifying genetic variants as P/LP, B/LB, or VUS
46
- - **Clinical interpretation analysis:** Understanding and categorizing clinical variant descriptions
47
- - **Biomedical text classification:** General classification tasks in the clinical genetics domain
 
 
48
 
49
- ## How to Get Started with the Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ```python
52
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
53
  import torch
 
54
 
55
  # Load model and tokenizer
56
  tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
57
  model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
58
 
59
- # Example usage
60
- text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
61
  inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
62
 
 
63
  with torch.no_grad():
64
  outputs = model(**inputs)
65
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
 
 
66
 
67
- # Get predicted class
68
- predicted_class = torch.argmax(predictions, dim=-1)
 
 
18
 
19
  # ClinVarBERT
20
 
21
+ A **BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification**, built upon **BioBERT-Large**.
22
+ ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.
23
 
24
+ ---
25
+
26
+ ## 🧬 Model Details
27
 
28
  ### Model Description
29
 
30
+ **ClinVarBERT-Large** is a domain-specific transformer model fine-tuned from **BioBERT-Large** for the task of **genetic variant interpretation**.
31
+ It is trained to capture subtle linguistic patterns in **ClinVar submissions** and related clinical genetics texts, enabling accurate classification of variant pathogenicity.
32
 
33
+ - **Model Type:** BERT-based transformer for sequence classification
34
+ - **Languages:** English (biomedical / clinical domain)
35
+ - **License:** Apache 2.0
36
+ - **Fine-tuned From:** [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
37
+ - **Training Data:** Curated ClinVar submission texts describing genetic variants and their clinical interpretations
38
 
39
  ### Model Sources
40
 
41
+ - **Repository:** [Your GitHub Repository or Project Page]
42
+ - **Base Model:** [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
43
+ - **Dataset:** [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)
44
 
45
+ ---
46
+
47
+ ## 🚀 Uses
48
 
49
  ### Direct Use
50
 
51
+ ClinVarBERT can be directly applied to:
52
+ - **Variant pathogenicity classification:** Classify genetic variants as *Pathogenic/Likely Pathogenic (P/LP)*, *Benign/Likely Benign (B/LB)*, or *Variant of Uncertain Significance (VUS)*.
53
+ - **Clinical interpretation mining:** Analyze and categorize textual variant interpretations from clinical databases or research reports.
54
+ - **Biomedical NLP tasks:** Serve as a strong domain-specific encoder for clinical genetics-related text classification.
55
+
56
+ ## Label Mapping
57
 
58
+ | Class ID | Label | Description |
59
+ |-----------|--------|-------------|
60
+ | 0 | **B/LB** | Benign or Likely Benign |
61
+ | 1 | **VUS** | Variant of Uncertain Significance |
62
+ | 2 | **P/LP** | Pathogenic or Likely Pathogenic |
63
+
64
+ ---
65
+
66
+ ## ⚡ Quick Start
67
+
68
+ ### Option 1: Use via Hugging Face Pipeline
69
+
70
+ ```python
71
+ from transformers import pipeline
72
+
73
+ # Load the pipeline
74
+ pipe = pipeline("text-classification", model="weijiang99/clinvarbert")
75
+
76
+ # Example text
77
+ text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."
78
+
79
+ # Predict
80
+ result = pipe(text)
81
+ print(result)
82
+ ```
83
+
84
+ ### Option 2: Manual Inference
85
 
86
  ```python
 
87
  import torch
88
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
89
 
90
  # Load model and tokenizer
91
  tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
92
  model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")
93
 
94
+ # Input text
95
+ text = "This variant was reported as likely benign in multiple submissions."
96
  inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
97
 
98
+ # Inference
99
  with torch.no_grad():
100
  outputs = model(**inputs)
101
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
102
+ predicted_class_id = torch.argmax(probs, dim=-1).item()
103
+ predicted_label = model.config.id2label[predicted_class_id]
104
 
105
+ print(f"Predicted label: {predicted_label}")
106
+ print(f"Probabilities: {probs}")
107
+ ```