README.md · YADAV0206/PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD at main

PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD / README.md

YADAV0206

Update README.md

3130661 verified 15 days ago

preview code

raw

history blame contribute delete

13.9 kB

	---
	model_name: PathoPreter-4B-SNV
	base_model:
	- unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
	library_name: peft
	pipeline_tag: text-generation
	language:
	- en
	tags:
	- genomics
	- variant-classification
	- pathogenicity
	- clinvar
	- gnomad
	- lora
	- unsloth
	- gguf
	- research
	datasets:
	- clinvar
	- gnomad
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	license: apache-2.0
	---


	<h1 style="font-size: 56px; text-align: center;">
	PathoPreter-4B-SNV
	</h1>

	<div style="text-align: center;">
	(Base: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit)
	</div>
	<div style="text-align: center;">
	A LLM for SNV Pathogenicity Classification
	</div>

	<div style="font-size: 25px;text-decoration-color:blue;text-align: center;">
	<a href="https://github.com/YADAV1825/PathoPreter">Github: https://github.com/YADAV1825/PathoPreter</a>
	</div>

	---

	## ⚠️ IMPORTANT DISCLAIMER

	PathoPreter is a research-only predictive model.
	It is NOT a clinical diagnostic system PathoPreter is a research tool for risk prioritization, NOT a diagnostic device. It outputs "Pathogenic Indication" signals to assist in workflow prioritization.
	It should never be used to confirm a diagnosis or determine medical treatment without expert ACMG review

	---

	## 🔬 Model Summary

	PathoPreter-4B-SNV is a parameter-efficient large language model fine-tuned to classify
	Single Nucleotide Variants (SNVs) as Pathogenic or Benign using structured genomic context.

	- Base model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
	- Fine-tuning: LoRA via Unsloth
	- Training data: 1.2 million SNVs
	- Genome assembly: GRCh38
	- Scope: SNVs only
	- Deployment: safetensors + GGUF (Q4/Q8)

	Strict HGVS-level and coordinate-level train–test disjointness was enforced.

	---

	## 🧬 Dataset Description


	The dataset consists exclusively of Single Nucleotide Variants (SNVs) curated from ClinVar and augmented with gnomAD population allele frequency.
	<div><h3 align="center">“The training dataset contains approximately 144k pathogenic variants and 1.05 million benign variants, for a total of about 1.2 million samples and Testing have 55k total different samples with 11 seperate ablations tests (on same 55k rows) making around 12*55k=660k”</h3></div>

	<h3 align="center">I have uploaded a sample data of 12k rows for train and 1.2k for test to get idea of the dataset use</h3>

	<div>Github Links: <a href="https://github.com/YADAV1825/PathoPreter/blob/main/sample_train.parquet">Train</a> \| <a href="https://github.com/YADAV1825/PathoPreter/blob/main/sample_test.parquet">Test</a>.</div>
	<div>Hugging Face Links: <a href="https://huggingface.co/datasets/YADAV0206/Clinvar_SNV_GnomAD_Pathogen_Variants">https://huggingface.co/datasets/YADAV0206/Clinvar_SNV_GnomAD_Pathogen_Variants</a>

	```parquet
	{"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: SERPINH1\nVariant: NC_000011.10:g.75561608C>T\n\n### Response:","clean_label":"Pathogenic","variant_id":"NC_000011.10:g.75561608C>T","join_key":"NC_000011.10:g.75561608C>T","Name":"NC_000011.10:g.75561608C>T","Assembly":"GRCh38","chrom":"11","pos":"75561608","ref":"C","alt":"T","gnomad_af":0}


	```

	Included signals:
	- Gene symbol
	- HGVS transcript-level variant (join_key)
	- Genomic coordinates (chrom, pos, ref, alt)
	- gnomAD allele frequency

	Excluded:
	- Variants of Uncertain Significance (VUS)
	- Conflicting interpretations
	- Indels, CNVs, structural variants

	Dataset integrity guarantees:
	- No detectable label leakage under automated audits (i.e Zero duplicate variants, Zero label leakage, Zero train–test overlap under automated audits)
	- Allele frequency sanity checks passed

	---

	## 📦 Dataset Availability

	<div>Dataset Construction and Availability:</div>
	The datasets used to train and evaluate PathoPreter (including large-scale ClinVar-derived SNV corpora, controlled ablation test suites, and robustness evaluation datasets) were fully constructed in-house using publicly available, permissively licensed genomic resources such as ClinVar and gnomAD. All upstream sources are properly credited and explicitly permit commercial use and redistribution.
	Significant original engineering and curation effort was applied beyond raw data usage. This included large-scale extraction, normalization, schema unification, quality control, deduplication, and strict train–test disjointness enforcement. Approximately 8 million raw ClinVar variants and ~250 GB of gnomAD VCF data were processed and merged into production-grade Parquet datasets optimized for large-scale analytics, machine learning training, and downstream integration.
	The end-to-end data construction process required approximately 150 hours of compute time and over 250 hours of expert engineering and curation work. The resulting datasets constitute a high-value derived data asset, distinct from the original source distributions.
	These datasets are available for licensed distribution to startups, enterprises, and research organizations for use in applied genomics, AI/ML model development, benchmarking, variant prioritization workflows, and internal research. Commercial licensing, redistribution terms, and support options are available upon request.


	Available components include:(in Parquet and CSV both)
	- Large-scale ClinVar-style SNV training dataset
	- Held-out test set with identical variants across ablations
	- Controlled ablation datasets (signal removal studies)
	- Fake-variant's robustness evaluation dataset (see below why is it important in FAKE VARIANT ROBUSTNESS TEST)
	- Balanced CSV subsets suitable for classical ML training
	- Data audit and leakage-verification scripts

	If you are interested in:
	- dataset licensing
	- research or industry use
	- collaboration or benchmarking
	- reproducing or extending this work

	please contact:

	Rohit Yadav

	<div>
	<a href="yrohit1825@gmail.com">yrohit1825@gmail.com</a> \| <a href="https://github.com/YADAV1825/PathoPreter">Github: https://github.com/YADAV1825/PathoPreter</a>
	</div>

	Requests are evaluated on a case-by-case basis.


	---

	## 📊 Evaluation Protocol

	- Held-out test set: 55,055 (55k) unseen SNVs
	- Total evaluation variants across ablations: ~600k
	- One full-context baseline + ten controlled ablation datasets
	- All ablations use identical variants; only input signals are removed


	![Screenshot 2026-01-09 095943](https://cdn-uploads.huggingface.co/production/uploads/68bf07a31d80a360f1405b72/nDk8ajy51JNTwYZ0X_9Sy.png)


	---

	## 📉 Ablation Study Results

	The table below summarizes performance when specific biological signals are removed.

	\| Ablation Dataset \| Accuracy \| Pathogenic Precision \| Pathogenic Recall \| Pathogenic F1 \| Key Interpretation \|
	\|------------------\|----------\|----------------------\|-------------------\|---------------\|-------------------\|
	\| Baseline (All Signals) \| ~0.98 \| ~0.91 \| ~0.96 \| ~0.93 \| Full biological context \|
	\| Remove Gene \| ~0.98 \| ~0.91 \| ~0.95 \| ~0.93 \| Gene name alone is not a shortcut \|
	\| Remove Variant ID \| ~0.60 \| ~0.22 \| ~0.81 \| ~0.34 \| Variant identity is critical \|
	\| Remove Location \| ~0.98 \| ~0.91 \| ~0.95 \| ~0.93 \| Coordinates alone are not memorized \|
	\| Remove gnomAD AF \| ~0.97 \| ~0.82 \| ~0.92 \| ~0.87 \| Population frequency is a strong signal \|
	\| Remove Variant ID + Location \| ~0.59 \| ~0.21 \| ~0.81 \| ~0.34 \| Identity removal causes collapse \|
	\| Remove Gene + Location \| ~0.98 \| ~0.91 \| ~0.95 \| ~0.93 \| Gene names are contextual, not labels \|
	\| Remove Gene + Variant ID \| ~0.60 \| ~0.22 \| ~0.81 \| ~0.34 \| Identity loss dominates \|
	\| Remove Location + gnomAD AF \| ~0.97 \| ~0.86 \| ~0.90 \| ~0.88 \| Frequency + position matter jointly \|
	\| Remove Variant ID + gnomAD AF \| ~0.51 \| ~0.17 \| ~0.72 \| ~0.27 \| Severe degradation \|
	\| Remove Gene + gnomAD AF \| ~0.97 \| ~0.87 \| ~0.89 \| ~0.88 \| Frequency more important than gene name \|

	---

	## 🔍 Detailed Result: Remove Gene + gnomAD AF

	Precision (Pathogenic): 0.87
	Recall (Pathogenic): 0.89
	F1-score (Pathogenic): 0.88
	Accuracy: 0.97

	Pathogenic Correct: 6206
	Pathogenic Missed: 785
	Benign Correct: 47102
	Benign Missed: 962

	Interpretation:
	Even without gene identity and population frequency, the model retains strong
	performance by leveraging remaining genomic structure.

	---

	## 🧠 Key Ablation Insights

	- Variant identity is the dominant signal for pathogenicity classification
	- gnomAD allele frequency strongly improves recall but does not fully determine predictions
	- Gene names alone do not cause memorization or leakage
	- Performance degradation is predictable and biologically consistent
	- It generalizes within a ClinVar-like SNV distribution
	- It is robust to literal string obfuscation

	---



	🔬 FAKE VARIANT ROBUSTNESS TEST (IDENTITY MASKING)
	------------------------------------------------

	Purpose:
	--------
	Validate that the model does NOT rely on memorized variant
	identifiers by testing on synthetically altered (fake) IDs.

	Test Setup:
	-----------
	- All variant identifiers (variant_id, join_key, HGVS Name)
	were prefixed with 'FAKE_'
	- The same replacement was applied inside the prompt text
	- No real or known variant IDs were visible to the model
	- Gene, genomic location, ref/alt alleles, and gnomAD AF
	were preserved
	- Dataset size unchanged (55,055 SNVs)

	Evaluation Results: fake_variant.parquet
	---------------------------------------

	The table below summarizes performance under full variant
	identity masking.

	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|--------------\|-----------\|--------\|----------\|---------\|
	\| Pathogenic \| 0.99 \| 0.62 \| 0.77 \| 6,991 \|
	\| Benign \| 0.95 \| 1.00 \| 0.97 \| 48,064 \|
	\| Accuracy \| \| \| 0.95 \| 55,055 \|
	\| Macro Avg \| 0.97 \| 0.81 \| 0.87 \| 55,055 \|
	\| Weighted Avg \| 0.95 \| 0.95 \| 0.95 \| 55,055 \|

	Breakdown:
	----------
	- Pathogenic → Correct: 4,366 \| Missed: 2,625
	- Benign → Correct: 48,008 \| Missed: 56

	Interpretation:
	---------------
	- Benign recall remains near-perfect, indicating strong
	generalization even without variant identity.

	- Pathogenic recall drops when identity information is
	removed, which is expected and biologically consistent.

	- Very high pathogenic precision (0.99) shows that when
	the model predicts 'Pathogenic' under identity masking,
	it does so with high confidence.

	Conclusion:
	-----------
	This robustness test confirms that:
	- The model is NOT solely dependent on memorized IDs
	- Variant identity is a strong but non-exclusive signal
	- Meaningful pathogenicity patterns are learned from
	genomic and population-level context

	🏁 FAKE VARIANT ROBUSTNESS TEST COMPLETE


	---

	## 🚀 Intended Use

	PathoPreter is purpose-built for ClinVar-style SNV interpretation, where decisions must be made from limited but widely available variant-level genomic context. Within this niche, the model delivers state-of-the-art performance by combining large-scale curated ClinVar data, gnomAD population signals, and rigorously validated robustness and ablation testing.

	Optimized for fast, large-batch inference on accessible hardware, PathoPreter offers a practical alternative to heavyweight foundation models. Its predictable degradation under ablation and resilience to identifier masking demonstrate that it learns meaningful biological patterns rather than relying on memorization—making it well-suited for real-world variant prioritization, curation support, and benchmarking workflows.

	Appropriate:
	- Research variant prioritization
	- Benchmarking pathogenicity models
	- Educational genomics experiments
	- Early Risk flagging and prioritizing orders of Variants for testing

	Inappropriate:
	- Clinical diagnosis
	- Medical decision-making
	- Patient-facing applications

	---
	<div style="text-align: center;">
	<h2>
	TO USE THIS MODEL</h2> </div>

	```python
	!pip install transformers accelerate peft bitsandbytes #bitsandbytes-0.49.1
	from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
	from peft import PeftModel

	base_model = "unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit" # same base your adapter was trained on
	adapter_repo = "YADAV0206/PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD"

	# Load tokenizer from your repo (it contains added tokens + merges)
	tok = AutoTokenizer.from_pretrained(adapter_repo)

	# Load base model in 4-bit to save RAM
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	load_in_4bit=True,
	device_map="auto"
	)

	# Load LoRA adapter weights
	model = PeftModel.from_pretrained(model, adapter_repo)

	pipe = pipeline(
	"text-generation",
	model=model,
	tokenizer=tok,
	device_map="auto"
	)

	#THIS SAMPLE SHOULD OUTPUT PATHOGENIC
	sample = {
	"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: CDK8\nVariant: NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)\nLocation: chr13:26385259 C>G\ngnomAD Frequency: 0.000000\n\n### Response:",
	"variant_id":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
	"join_key":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
	"Name":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
	"Assembly":"GRCh38",
	"chrom":"13",
	"pos":"26385259",
	"ref":"C",
	"alt":"G",
	"gnomad_af":0
	}

	prompt = sample["text"]
	print(prompt)

	#EXPECTED OUTPUT PATHOGENIC

	out = pipe(
	prompt,
	max_new_tokens=50,
	do_sample=True,
	temperature=0.2,
	)[0]["generated_text"]

	print("\n------ OUTPUT ------")
	print(out)



	```

	---

	## 👨‍💻 Author

	Rohit Yadav
	National Institute of Technology, Jalandhar
	<div>
	<a href="yrohit1825@gmail.com">yrohit1825@gmail.com</a> \| <a href="https://github.com/YADAV1825/PathoPreter">Github: https://github.com/YADAV1825/PathoPreter</a>
	</div>

	---

	## 🔗 References

	- ClinVar (NCBI)
	- gnomAD
	- unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
	- Unsloth
	- PEFT 0.18.0