weijiang99
/

clinvarbert

Text Classification

variant-classification

Model card Files Files and versions

clinvarbert / README.md

weijiang99's picture

Upload BertForSequenceClassification

38631f5 verified 2 months ago

|

history blame contribute delete

3.53 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- biomedical
	- clinical
	- variant-classification
	- genetics
	- bert
	- fine-tuned
	base_model: dmis-lab/biobert-large-cased-v1.1
	datasets:
	- clinvar
	pipeline_tag: text-classification
	---

	# ClinVarBERT

	A BERT-based model fine-tuned for clinical variant interpretation and pathogenicity classification, built upon BioBERT-Large.
	ClinVarBERT is designed to understand the nuanced biomedical language used in variant descriptions and clinical genetics reports.

	---

	## 🧬 Model Details

	### Model Description

	ClinVarBERT-Large is a domain-specific transformer model fine-tuned from BioBERT-Large for the task of genetic variant interpretation.
	It is trained to capture subtle linguistic patterns in ClinVar submissions and related clinical genetics texts, enabling accurate classification of variant pathogenicity.

	- Model Type: BERT-based transformer for sequence classification
	- Languages: English (biomedical / clinical domain)
	- License: Apache 2.0
	- Fine-tuned From: [dmis-lab/biobert-large-cased-v1.1](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
	- Training Data: Curated ClinVar submission texts describing genetic variants and their clinical interpretations

	### Model Sources

	- Repository: [Your GitHub Repository or Project Page]
	- Base Model: [BioBERT-Large](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1)
	- Dataset: [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)

	---

	## 🚀 Uses

	### Direct Use

	ClinVarBERT can be directly applied to:
	- Variant pathogenicity classification: Classify genetic variants as Pathogenic/Likely Pathogenic (P/LP), Benign/Likely Benign (B/LB), or Variant of Uncertain Significance (VUS).
	- Clinical interpretation mining: Analyze and categorize textual variant interpretations from clinical databases or research reports.
	- Biomedical NLP tasks: Serve as a strong domain-specific encoder for clinical genetics-related text classification.

	## Label Mapping

	\| Class ID \| Label \| Description \|
	\|-----------\|--------\|-------------\|
	\| 0 \| P/LP \| Pathogenic or Likely Pathogenic \|
	\| 1 \| VUS \| Variant of Uncertain Significance \|
	\| 2 \| B/LB \| Benign or Likely Benign \|

	---

	## ⚡ Quick Start

	### Option 1: Use via Hugging Face Pipeline

	```python
	from transformers import pipeline

	# Load the pipeline
	pipe = pipeline("text-classification", model="weijiang99/clinvarbert")

	# Example text
	text = "This missense variant in exon 5 of the BRCA1 gene has been observed in multiple families with breast cancer."

	# Predict
	result = pipe(text)
	print(result)
	```

	### Option 2: Manual Inference

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("weijiang99/clinvarbert")
	model = AutoModelForSequenceClassification.from_pretrained("weijiang99/clinvarbert")

	# Input text
	text = "This variant was reported as likely benign in multiple submissions."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

	# Inference
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class_id = torch.argmax(probs, dim=-1).item()
	predicted_label = model.config.id2label[predicted_class_id]

	print(f"Predicted label: {predicted_label}")
	print(f"Probabilities: {probs}")
	```