PathoPreter-4B-SNV
β οΈ IMPORTANT DISCLAIMER
PathoPreter is a research-only predictive model. It is NOT a clinical diagnostic system PathoPreter is a research tool for risk prioritization, NOT a diagnostic device. It outputs "Pathogenic Indication" signals to assist in workflow prioritization. It should never be used to confirm a diagnosis or determine medical treatment without expert ACMG review
π¬ Model Summary
PathoPreter-4B-SNV is a parameter-efficient large language model fine-tuned to classify Single Nucleotide Variants (SNVs) as Pathogenic or Benign using structured genomic context.
- Base model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
- Fine-tuning: LoRA via Unsloth
- Training data: 1.2 million SNVs
- Genome assembly: GRCh38
- Scope: SNVs only
- Deployment: safetensors + GGUF (Q4/Q8)
Strict HGVS-level and coordinate-level trainβtest disjointness was enforced.
𧬠Dataset Description
The dataset consists exclusively of Single Nucleotide Variants (SNVs) curated from ClinVar and augmented with gnomAD population allele frequency.
βThe training dataset contains approximately 144k pathogenic variants and 1.05 million benign variants, for a total of about 1.2 million samples and Testing have 55k total different samples with 11 seperate ablations tests (on same 55k rows) making around 12*55k=660kβ
I have uploaded a sample data of 12k rows for train and 1.2k for test to get idea of the dataset use
{"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: SERPINH1\nVariant: NC_000011.10:g.75561608C>T\n\n### Response:","clean_label":"Pathogenic","variant_id":"NC_000011.10:g.75561608C>T","join_key":"NC_000011.10:g.75561608C>T","Name":"NC_000011.10:g.75561608C>T","Assembly":"GRCh38","chrom":"11","pos":"75561608","ref":"C","alt":"T","gnomad_af":0}
Included signals:
- Gene symbol
- HGVS transcript-level variant (join_key)
- Genomic coordinates (chrom, pos, ref, alt)
- gnomAD allele frequency
Excluded:
- Variants of Uncertain Significance (VUS)
- Conflicting interpretations
- Indels, CNVs, structural variants
Dataset integrity guarantees:
- No detectable label leakage under automated audits (i.e Zero duplicate variants, Zero label leakage, Zero trainβtest overlap under automated audits)
- Allele frequency sanity checks passed
π¦ Dataset Availability
Available components include:(in Parquet and CSV both)
- Large-scale ClinVar-style SNV training dataset
- Held-out test set with identical variants across ablations
- Controlled ablation datasets (signal removal studies)
- Fake-variant's robustness evaluation dataset (see below why is it important in FAKE VARIANT ROBUSTNESS TEST)
- Balanced CSV subsets suitable for classical ML training
- Data audit and leakage-verification scripts
If you are interested in:
- dataset licensing
- research or industry use
- collaboration or benchmarking
- reproducing or extending this work
please contact:
Rohit Yadav
Requests are evaluated on a case-by-case basis.
π Evaluation Protocol
- Held-out test set: 55,055 (55k) unseen SNVs
- Total evaluation variants across ablations: ~600k
- One full-context baseline + ten controlled ablation datasets
- All ablations use identical variants; only input signals are removed
π Ablation Study Results
The table below summarizes performance when specific biological signals are removed.
| Ablation Dataset | Accuracy | Pathogenic Precision | Pathogenic Recall | Pathogenic F1 | Key Interpretation |
|---|---|---|---|---|---|
| Baseline (All Signals) | ~0.98 | ~0.91 | ~0.96 | ~0.93 | Full biological context |
| Remove Gene | ~0.98 | ~0.91 | ~0.95 | ~0.93 | Gene name alone is not a shortcut |
| Remove Variant ID | ~0.60 | ~0.22 | ~0.81 | ~0.34 | Variant identity is critical |
| Remove Location | ~0.98 | ~0.91 | ~0.95 | ~0.93 | Coordinates alone are not memorized |
| Remove gnomAD AF | ~0.97 | ~0.82 | ~0.92 | ~0.87 | Population frequency is a strong signal |
| Remove Variant ID + Location | ~0.59 | ~0.21 | ~0.81 | ~0.34 | Identity removal causes collapse |
| Remove Gene + Location | ~0.98 | ~0.91 | ~0.95 | ~0.93 | Gene names are contextual, not labels |
| Remove Gene + Variant ID | ~0.60 | ~0.22 | ~0.81 | ~0.34 | Identity loss dominates |
| Remove Location + gnomAD AF | ~0.97 | ~0.86 | ~0.90 | ~0.88 | Frequency + position matter jointly |
| Remove Variant ID + gnomAD AF | ~0.51 | ~0.17 | ~0.72 | ~0.27 | Severe degradation |
| Remove Gene + gnomAD AF | ~0.97 | ~0.87 | ~0.89 | ~0.88 | Frequency more important than gene name |
π Detailed Result: Remove Gene + gnomAD AF
Precision (Pathogenic): 0.87
Recall (Pathogenic): 0.89
F1-score (Pathogenic): 0.88
Accuracy: 0.97
Pathogenic Correct: 6206
Pathogenic Missed: 785
Benign Correct: 47102
Benign Missed: 962
Interpretation: Even without gene identity and population frequency, the model retains strong performance by leveraging remaining genomic structure.
π§ Key Ablation Insights
- Variant identity is the dominant signal for pathogenicity classification
- gnomAD allele frequency strongly improves recall but does not fully determine predictions
- Gene names alone do not cause memorization or leakage
- Performance degradation is predictable and biologically consistent
- It generalizes within a ClinVar-like SNV distribution
- It is robust to literal string obfuscation
π¬ FAKE VARIANT ROBUSTNESS TEST (IDENTITY MASKING)
Purpose:
Validate that the model does NOT rely on memorized variant identifiers by testing on synthetically altered (fake) IDs.
Test Setup:
- All variant identifiers (variant_id, join_key, HGVS Name) were prefixed with 'FAKE_'
- The same replacement was applied inside the prompt text
- No real or known variant IDs were visible to the model
- Gene, genomic location, ref/alt alleles, and gnomAD AF were preserved
- Dataset size unchanged (55,055 SNVs)
Evaluation Results: fake_variant.parquet
The table below summarizes performance under full variant identity masking.
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Pathogenic | 0.99 | 0.62 | 0.77 | 6,991 |
| Benign | 0.95 | 1.00 | 0.97 | 48,064 |
| Accuracy | 0.95 | 55,055 | ||
| Macro Avg | 0.97 | 0.81 | 0.87 | 55,055 |
| Weighted Avg | 0.95 | 0.95 | 0.95 | 55,055 |
Breakdown:
- Pathogenic β Correct: 4,366 | Missed: 2,625
- Benign β Correct: 48,008 | Missed: 56
Interpretation:
Benign recall remains near-perfect, indicating strong generalization even without variant identity.
Pathogenic recall drops when identity information is removed, which is expected and biologically consistent.
Very high pathogenic precision (0.99) shows that when the model predicts 'Pathogenic' under identity masking, it does so with high confidence.
Conclusion:
This robustness test confirms that:
- The model is NOT solely dependent on memorized IDs
- Variant identity is a strong but non-exclusive signal
- Meaningful pathogenicity patterns are learned from genomic and population-level context
π FAKE VARIANT ROBUSTNESS TEST COMPLETE
π Intended Use
PathoPreter is purpose-built for ClinVar-style SNV interpretation, where decisions must be made from limited but widely available variant-level genomic context. Within this niche, the model delivers state-of-the-art performance by combining large-scale curated ClinVar data, gnomAD population signals, and rigorously validated robustness and ablation testing.
Optimized for fast, large-batch inference on accessible hardware, PathoPreter offers a practical alternative to heavyweight foundation models. Its predictable degradation under ablation and resilience to identifier masking demonstrate that it learns meaningful biological patterns rather than relying on memorizationβmaking it well-suited for real-world variant prioritization, curation support, and benchmarking workflows.
Appropriate:
- Research variant prioritization
- Benchmarking pathogenicity models
- Educational genomics experiments
- Early Risk flagging and prioritizing orders of Variants for testing
Inappropriate:
- Clinical diagnosis
- Medical decision-making
- Patient-facing applications
TO USE THIS MODEL
!pip install transformers accelerate peft bitsandbytes #bitsandbytes-0.49.1
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
base_model = "unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit" # same base your adapter was trained on
adapter_repo = "YADAV0206/PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD"
# Load tokenizer from your repo (it contains added tokens + merges)
tok = AutoTokenizer.from_pretrained(adapter_repo)
# Load base model in 4-bit to save RAM
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_4bit=True,
device_map="auto"
)
# Load LoRA adapter weights
model = PeftModel.from_pretrained(model, adapter_repo)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tok,
device_map="auto"
)
#THIS SAMPLE SHOULD OUTPUT PATHOGENIC
sample = {
"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: CDK8\nVariant: NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)\nLocation: chr13:26385259 C>G\ngnomAD Frequency: 0.000000\n\n### Response:",
"variant_id":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"join_key":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"Name":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"Assembly":"GRCh38",
"chrom":"13",
"pos":"26385259",
"ref":"C",
"alt":"G",
"gnomad_af":0
}
prompt = sample["text"]
print(prompt)
#EXPECTED OUTPUT PATHOGENIC
out = pipe(
prompt,
max_new_tokens=50,
do_sample=True,
temperature=0.2,
)[0]["generated_text"]
print("\n------ OUTPUT ------")
print(out)
π¨βπ» Author
Rohit Yadav
National Institute of Technology, Jalandhar
π References
- ClinVar (NCBI)
- gnomAD
- unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
- Unsloth
- PEFT 0.18.0
- Downloads last month
- 120
