|
|
--- |
|
|
model_name: PathoPreter-4B-SNV |
|
|
base_model: |
|
|
- unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit |
|
|
library_name: peft |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- genomics |
|
|
- variant-classification |
|
|
- pathogenicity |
|
|
- clinvar |
|
|
- gnomad |
|
|
- lora |
|
|
- unsloth |
|
|
- gguf |
|
|
- research |
|
|
datasets: |
|
|
- clinvar |
|
|
- gnomad |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
|
|
|
<h1 style="font-size: 56px; text-align: center;"> |
|
|
PathoPreter-4B-SNV |
|
|
</h1> |
|
|
|
|
|
<div style="text-align: center;"> |
|
|
(Base: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit) |
|
|
</div> |
|
|
<div style="text-align: center;"> |
|
|
A LLM for SNV Pathogenicity Classification |
|
|
</div> |
|
|
|
|
|
<div style="font-size: 25px;text-decoration-color:blue;text-align: center;"> |
|
|
<a href="https://github.com/YADAV1825/PathoPreter">Github: https://github.com/YADAV1825/PathoPreter</a> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚠️ IMPORTANT DISCLAIMER |
|
|
|
|
|
PathoPreter is a research-only predictive model. |
|
|
It is NOT a clinical diagnostic system PathoPreter is a research tool for risk prioritization, NOT a diagnostic device. It outputs "Pathogenic Indication" signals to assist in workflow prioritization. |
|
|
It should never be used to confirm a diagnosis or determine medical treatment without expert ACMG review |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔬 Model Summary |
|
|
|
|
|
PathoPreter-4B-SNV is a parameter-efficient large language model fine-tuned to classify |
|
|
Single Nucleotide Variants (SNVs) as Pathogenic or Benign using structured genomic context. |
|
|
|
|
|
- Base model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit |
|
|
- Fine-tuning: LoRA via Unsloth |
|
|
- **Training data: 1.2 million SNVs** |
|
|
- Genome assembly: GRCh38 |
|
|
- Scope: SNVs only |
|
|
- Deployment: safetensors + GGUF (Q4/Q8) |
|
|
|
|
|
Strict HGVS-level and coordinate-level train–test disjointness was enforced. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧬 Dataset Description |
|
|
|
|
|
|
|
|
The dataset consists exclusively of **Single Nucleotide Variants (SNVs)** curated from **ClinVar** and augmented with **gnomAD population allele frequency**. |
|
|
<div><h3 align="center">“The training dataset contains approximately 144k pathogenic variants and 1.05 million benign variants, for a total of about 1.2 million samples and Testing have 55k total different samples with 11 seperate ablations tests (on same 55k rows) making around 12*55k=660k”</h3></div> |
|
|
|
|
|
<h3 align="center">I have uploaded a sample data of 12k rows for train and 1.2k for test to get idea of the dataset use</h3> |
|
|
|
|
|
<div>Github Links: <a href="https://github.com/YADAV1825/PathoPreter/blob/main/sample_train.parquet">Train</a> | <a href="https://github.com/YADAV1825/PathoPreter/blob/main/sample_test.parquet">Test</a>.</div> |
|
|
<div>Hugging Face Links: <a href="https://huggingface.co/datasets/YADAV0206/Clinvar_SNV_GnomAD_Pathogen_Variants">https://huggingface.co/datasets/YADAV0206/Clinvar_SNV_GnomAD_Pathogen_Variants</a> |
|
|
|
|
|
```parquet |
|
|
{"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: SERPINH1\nVariant: NC_000011.10:g.75561608C>T\n\n### Response:","clean_label":"Pathogenic","variant_id":"NC_000011.10:g.75561608C>T","join_key":"NC_000011.10:g.75561608C>T","Name":"NC_000011.10:g.75561608C>T","Assembly":"GRCh38","chrom":"11","pos":"75561608","ref":"C","alt":"T","gnomad_af":0} |
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
Included signals: |
|
|
- Gene symbol |
|
|
- HGVS transcript-level variant (join_key) |
|
|
- Genomic coordinates (chrom, pos, ref, alt) |
|
|
- gnomAD allele frequency |
|
|
|
|
|
Excluded: |
|
|
- Variants of Uncertain Significance (VUS) |
|
|
- Conflicting interpretations |
|
|
- Indels, CNVs, structural variants |
|
|
|
|
|
Dataset integrity guarantees: |
|
|
- No detectable label leakage under automated audits (i.e Zero duplicate variants, Zero label leakage, Zero train–test overlap under automated audits) |
|
|
- Allele frequency sanity checks passed |
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Dataset Availability |
|
|
|
|
|
<div>Dataset Construction and Availability:</div> |
|
|
The datasets used to train and evaluate PathoPreter (including large-scale ClinVar-derived SNV corpora, controlled ablation test suites, and robustness evaluation datasets) were fully constructed in-house using publicly available, permissively licensed genomic resources such as ClinVar and gnomAD. All upstream sources are properly credited and explicitly permit commercial use and redistribution. |
|
|
Significant original engineering and curation effort was applied beyond raw data usage. This included large-scale extraction, normalization, schema unification, quality control, deduplication, and strict train–test disjointness enforcement. Approximately 8 million raw ClinVar variants and ~250 GB of gnomAD VCF data were processed and merged into production-grade Parquet datasets optimized for large-scale analytics, machine learning training, and downstream integration. |
|
|
The end-to-end data construction process required approximately 150 hours of compute time and over 250 hours of expert engineering and curation work. The resulting datasets constitute a high-value derived data asset, distinct from the original source distributions. |
|
|
These datasets are available for licensed distribution to startups, enterprises, and research organizations for use in applied genomics, AI/ML model development, benchmarking, variant prioritization workflows, and internal research. Commercial licensing, redistribution terms, and support options are available upon request. |
|
|
|
|
|
|
|
|
Available components include:(in Parquet and CSV both) |
|
|
- Large-scale ClinVar-style SNV training dataset |
|
|
- Held-out test set with identical variants across ablations |
|
|
- Controlled ablation datasets (signal removal studies) |
|
|
- Fake-variant's robustness evaluation dataset (see below why is it important in FAKE VARIANT ROBUSTNESS TEST) |
|
|
- Balanced CSV subsets suitable for classical ML training |
|
|
- Data audit and leakage-verification scripts |
|
|
|
|
|
If you are interested in: |
|
|
- dataset licensing |
|
|
- research or industry use |
|
|
- collaboration or benchmarking |
|
|
- reproducing or extending this work |
|
|
|
|
|
please contact: |
|
|
|
|
|
**Rohit Yadav** |
|
|
|
|
|
<div> |
|
|
<a href="yrohit1825@gmail.com">yrohit1825@gmail.com</a> | <a href="https://github.com/YADAV1825/PathoPreter">Github: https://github.com/YADAV1825/PathoPreter</a> |
|
|
</div> |
|
|
|
|
|
Requests are evaluated on a case-by-case basis. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation Protocol |
|
|
|
|
|
- Held-out test set: 55,055 (55k) unseen SNVs |
|
|
- Total evaluation variants across ablations: ~600k |
|
|
- One full-context baseline + ten controlled ablation datasets |
|
|
- All ablations use identical variants; only input signals are removed |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 📉 Ablation Study Results |
|
|
|
|
|
The table below summarizes performance when specific biological signals are removed. |
|
|
|
|
|
| Ablation Dataset | Accuracy | Pathogenic Precision | Pathogenic Recall | Pathogenic F1 | Key Interpretation | |
|
|
|------------------|----------|----------------------|-------------------|---------------|-------------------| |
|
|
| Baseline (All Signals) | ~0.98 | ~0.91 | ~0.96 | ~0.93 | Full biological context | |
|
|
| Remove Gene | ~0.98 | ~0.91 | ~0.95 | ~0.93 | Gene name alone is not a shortcut | |
|
|
| Remove Variant ID | ~0.60 | ~0.22 | ~0.81 | ~0.34 | Variant identity is critical | |
|
|
| Remove Location | ~0.98 | ~0.91 | ~0.95 | ~0.93 | Coordinates alone are not memorized | |
|
|
| Remove gnomAD AF | ~0.97 | ~0.82 | ~0.92 | ~0.87 | Population frequency is a strong signal | |
|
|
| Remove Variant ID + Location | ~0.59 | ~0.21 | ~0.81 | ~0.34 | Identity removal causes collapse | |
|
|
| Remove Gene + Location | ~0.98 | ~0.91 | ~0.95 | ~0.93 | Gene names are contextual, not labels | |
|
|
| Remove Gene + Variant ID | ~0.60 | ~0.22 | ~0.81 | ~0.34 | Identity loss dominates | |
|
|
| Remove Location + gnomAD AF | ~0.97 | ~0.86 | ~0.90 | ~0.88 | Frequency + position matter jointly | |
|
|
| Remove Variant ID + gnomAD AF | ~0.51 | ~0.17 | ~0.72 | ~0.27 | Severe degradation | |
|
|
| Remove Gene + gnomAD AF | ~0.97 | ~0.87 | ~0.89 | ~0.88 | Frequency more important than gene name | |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔍 Detailed Result: Remove Gene + gnomAD AF |
|
|
|
|
|
Precision (Pathogenic): 0.87 |
|
|
Recall (Pathogenic): 0.89 |
|
|
F1-score (Pathogenic): 0.88 |
|
|
Accuracy: 0.97 |
|
|
|
|
|
Pathogenic Correct: 6206 |
|
|
Pathogenic Missed: 785 |
|
|
Benign Correct: 47102 |
|
|
Benign Missed: 962 |
|
|
|
|
|
Interpretation: |
|
|
Even without gene identity and population frequency, the model retains strong |
|
|
performance by leveraging remaining genomic structure. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Key Ablation Insights |
|
|
|
|
|
- Variant identity is the dominant signal for pathogenicity classification |
|
|
- gnomAD allele frequency strongly improves recall but does not fully determine predictions |
|
|
- Gene names alone do not cause memorization or leakage |
|
|
- Performance degradation is predictable and biologically consistent |
|
|
- It generalizes within a ClinVar-like SNV distribution |
|
|
- It is robust to literal string obfuscation |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
🔬 FAKE VARIANT ROBUSTNESS TEST (IDENTITY MASKING) |
|
|
------------------------------------------------ |
|
|
|
|
|
Purpose: |
|
|
-------- |
|
|
Validate that the model does NOT rely on memorized variant |
|
|
identifiers by testing on synthetically altered (fake) IDs. |
|
|
|
|
|
Test Setup: |
|
|
----------- |
|
|
- All variant identifiers (variant_id, join_key, HGVS Name) |
|
|
were prefixed with 'FAKE_' |
|
|
- The same replacement was applied inside the prompt text |
|
|
- No real or known variant IDs were visible to the model |
|
|
- Gene, genomic location, ref/alt alleles, and gnomAD AF |
|
|
were preserved |
|
|
- Dataset size unchanged (55,055 SNVs) |
|
|
|
|
|
Evaluation Results: fake_variant.parquet |
|
|
--------------------------------------- |
|
|
|
|
|
The table below summarizes performance under full variant |
|
|
identity masking. |
|
|
|
|
|
| Class | Precision | Recall | F1-score | Support | |
|
|
|--------------|-----------|--------|----------|---------| |
|
|
| Pathogenic | 0.99 | **0.62** | 0.77 | 6,991 | |
|
|
| Benign | 0.95 | 1.00 | 0.97 | 48,064 | |
|
|
| **Accuracy** | | | **0.95** | 55,055 | |
|
|
| Macro Avg | 0.97 | 0.81 | 0.87 | 55,055 | |
|
|
| Weighted Avg | 0.95 | 0.95 | 0.95 | 55,055 | |
|
|
|
|
|
Breakdown: |
|
|
---------- |
|
|
- Pathogenic → Correct: 4,366 | Missed: 2,625 |
|
|
- Benign → Correct: 48,008 | Missed: 56 |
|
|
|
|
|
Interpretation: |
|
|
--------------- |
|
|
- Benign recall remains near-perfect, indicating strong |
|
|
generalization even without variant identity. |
|
|
|
|
|
- Pathogenic recall **drops** when identity information is |
|
|
removed, which is expected and biologically consistent. |
|
|
|
|
|
- Very high pathogenic precision (0.99) shows that when |
|
|
the model predicts 'Pathogenic' under identity masking, |
|
|
it does so with high confidence. |
|
|
|
|
|
Conclusion: |
|
|
----------- |
|
|
This robustness test confirms that: |
|
|
- The model is NOT solely dependent on memorized IDs |
|
|
- Variant identity is a strong but non-exclusive signal |
|
|
- Meaningful pathogenicity patterns are learned from |
|
|
genomic and population-level context |
|
|
|
|
|
🏁 FAKE VARIANT ROBUSTNESS TEST COMPLETE |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Intended Use |
|
|
|
|
|
PathoPreter is purpose-built for ClinVar-style SNV interpretation, where decisions must be made from limited but widely available variant-level genomic context. Within this niche, the model delivers state-of-the-art performance by combining large-scale curated ClinVar data, gnomAD population signals, and rigorously validated robustness and ablation testing. |
|
|
|
|
|
Optimized for fast, large-batch inference on accessible hardware, PathoPreter offers a practical alternative to heavyweight foundation models. Its predictable degradation under ablation and resilience to identifier masking demonstrate that it learns meaningful biological patterns rather than relying on memorization—making it well-suited for real-world variant prioritization, curation support, and benchmarking workflows. |
|
|
|
|
|
Appropriate: |
|
|
- Research variant prioritization |
|
|
- Benchmarking pathogenicity models |
|
|
- Educational genomics experiments |
|
|
- Early Risk flagging and prioritizing orders of Variants for testing |
|
|
|
|
|
Inappropriate: |
|
|
- Clinical diagnosis |
|
|
- Medical decision-making |
|
|
- Patient-facing applications |
|
|
|
|
|
--- |
|
|
<div style="text-align: center;"> |
|
|
<h2> |
|
|
TO USE THIS MODEL</h2> </div> |
|
|
|
|
|
```python |
|
|
!pip install transformers accelerate peft bitsandbytes #bitsandbytes-0.49.1 |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
|
|
from peft import PeftModel |
|
|
|
|
|
base_model = "unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit" # same base your adapter was trained on |
|
|
adapter_repo = "YADAV0206/PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD" |
|
|
|
|
|
# Load tokenizer from your repo (it contains added tokens + merges) |
|
|
tok = AutoTokenizer.from_pretrained(adapter_repo) |
|
|
|
|
|
# Load base model in 4-bit to save RAM |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
base_model, |
|
|
load_in_4bit=True, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Load LoRA adapter weights |
|
|
model = PeftModel.from_pretrained(model, adapter_repo) |
|
|
|
|
|
pipe = pipeline( |
|
|
"text-generation", |
|
|
model=model, |
|
|
tokenizer=tok, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
#THIS SAMPLE SHOULD OUTPUT PATHOGENIC |
|
|
sample = { |
|
|
"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: CDK8\nVariant: NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)\nLocation: chr13:26385259 C>G\ngnomAD Frequency: 0.000000\n\n### Response:", |
|
|
"variant_id":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)", |
|
|
"join_key":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)", |
|
|
"Name":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)", |
|
|
"Assembly":"GRCh38", |
|
|
"chrom":"13", |
|
|
"pos":"26385259", |
|
|
"ref":"C", |
|
|
"alt":"G", |
|
|
"gnomad_af":0 |
|
|
} |
|
|
|
|
|
prompt = sample["text"] |
|
|
print(prompt) |
|
|
|
|
|
#EXPECTED OUTPUT PATHOGENIC |
|
|
|
|
|
out = pipe( |
|
|
prompt, |
|
|
max_new_tokens=50, |
|
|
do_sample=True, |
|
|
temperature=0.2, |
|
|
)[0]["generated_text"] |
|
|
|
|
|
print("\n------ OUTPUT ------") |
|
|
print(out) |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 👨💻 Author |
|
|
|
|
|
Rohit Yadav |
|
|
National Institute of Technology, Jalandhar |
|
|
<div> |
|
|
<a href="yrohit1825@gmail.com">yrohit1825@gmail.com</a> | <a href="https://github.com/YADAV1825/PathoPreter">Github: https://github.com/YADAV1825/PathoPreter</a> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔗 References |
|
|
|
|
|
- ClinVar (NCBI) |
|
|
- gnomAD |
|
|
- unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit |
|
|
- Unsloth |
|
|
- PEFT 0.18.0 |