PathoPreter-4B-SNV

(Base: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit)

A LLM for SNV Pathogenicity Classification

Github: https://github.com/YADAV1825/PathoPreter

⚠️ IMPORTANT DISCLAIMER

PathoPreter is a research-only predictive model. It is NOT a clinical diagnostic system PathoPreter is a research tool for risk prioritization, NOT a diagnostic device. It outputs "Pathogenic Indication" signals to assist in workflow prioritization. It should never be used to confirm a diagnosis or determine medical treatment without expert ACMG review

🔬 Model Summary

PathoPreter-4B-SNV is a parameter-efficient large language model fine-tuned to classify Single Nucleotide Variants (SNVs) as Pathogenic or Benign using structured genomic context.

Base model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
Fine-tuning: LoRA via Unsloth
Training data: 1.2 million SNVs
Genome assembly: GRCh38
Scope: SNVs only
Deployment: safetensors + GGUF (Q4/Q8)

Strict HGVS-level and coordinate-level train–test disjointness was enforced.

🧬 Dataset Description

The dataset consists exclusively of Single Nucleotide Variants (SNVs) curated from ClinVar and augmented with gnomAD population allele frequency.

“The training dataset contains approximately 144k pathogenic variants and 1.05 million benign variants, for a total of about 1.2 million samples and Testing have 55k total different samples with 11 seperate ablations tests (on same 55k rows) making around 12*55k=660k”

I have uploaded a sample data of 12k rows for train and 1.2k for test to get idea of the dataset use

Github Links: Train | Test.

Hugging Face Links: https://huggingface.co/datasets/YADAV0206/Clinvar_SNV_GnomAD_Pathogen_Variants

{"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: SERPINH1\nVariant: NC_000011.10:g.75561608C>T\n\n### Response:","clean_label":"Pathogenic","variant_id":"NC_000011.10:g.75561608C>T","join_key":"NC_000011.10:g.75561608C>T","Name":"NC_000011.10:g.75561608C>T","Assembly":"GRCh38","chrom":"11","pos":"75561608","ref":"C","alt":"T","gnomad_af":0}

Included signals:

Gene symbol
HGVS transcript-level variant (join_key)
Genomic coordinates (chrom, pos, ref, alt)
gnomAD allele frequency

Excluded:

Variants of Uncertain Significance (VUS)
Conflicting interpretations
Indels, CNVs, structural variants

Dataset integrity guarantees:

No detectable label leakage under automated audits (i.e Zero duplicate variants, Zero label leakage, Zero train–test overlap under automated audits)
Allele frequency sanity checks passed

📦 Dataset Availability

Dataset Construction and Availability:

The datasets used to train and evaluate PathoPreter (including large-scale ClinVar-derived SNV corpora, controlled ablation test suites, and robustness evaluation datasets) were fully constructed in-house using publicly available, permissively licensed genomic resources such as ClinVar and gnomAD. All upstream sources are properly credited and explicitly permit commercial use and redistribution. Significant original engineering and curation effort was applied beyond raw data usage. This included large-scale extraction, normalization, schema unification, quality control, deduplication, and strict train–test disjointness enforcement. Approximately 8 million raw ClinVar variants and ~250 GB of gnomAD VCF data were processed and merged into production-grade Parquet datasets optimized for large-scale analytics, machine learning training, and downstream integration. The end-to-end data construction process required approximately 150 hours of compute time and over 250 hours of expert engineering and curation work. The resulting datasets constitute a high-value derived data asset, distinct from the original source distributions. These datasets are available for licensed distribution to startups, enterprises, and research organizations for use in applied genomics, AI/ML model development, benchmarking, variant prioritization workflows, and internal research. Commercial licensing, redistribution terms, and support options are available upon request.

Available components include:(in Parquet and CSV both)

Large-scale ClinVar-style SNV training dataset
Held-out test set with identical variants across ablations
Controlled ablation datasets (signal removal studies)
Fake-variant's robustness evaluation dataset (see below why is it important in FAKE VARIANT ROBUSTNESS TEST)
Balanced CSV subsets suitable for classical ML training
Data audit and leakage-verification scripts

If you are interested in:

dataset licensing
research or industry use
collaboration or benchmarking
reproducing or extending this work

please contact:

Rohit Yadav

yrohit1825@gmail.com | Github: https://github.com/YADAV1825/PathoPreter

Requests are evaluated on a case-by-case basis.

📊 Evaluation Protocol

Held-out test set: 55,055 (55k) unseen SNVs
Total evaluation variants across ablations: ~600k
One full-context baseline + ten controlled ablation datasets
All ablations use identical variants; only input signals are removed

📉 Ablation Study Results

The table below summarizes performance when specific biological signals are removed.

Ablation Dataset	Accuracy	Pathogenic Precision	Pathogenic Recall	Pathogenic F1	Key Interpretation
Baseline (All Signals)	~0.98	~0.91	~0.96	~0.93	Full biological context
Remove Gene	~0.98	~0.91	~0.95	~0.93	Gene name alone is not a shortcut
Remove Variant ID	~0.60	~0.22	~0.81	~0.34	Variant identity is critical
Remove Location	~0.98	~0.91	~0.95	~0.93	Coordinates alone are not memorized
Remove gnomAD AF	~0.97	~0.82	~0.92	~0.87	Population frequency is a strong signal
Remove Variant ID + Location	~0.59	~0.21	~0.81	~0.34	Identity removal causes collapse
Remove Gene + Location	~0.98	~0.91	~0.95	~0.93	Gene names are contextual, not labels
Remove Gene + Variant ID	~0.60	~0.22	~0.81	~0.34	Identity loss dominates
Remove Location + gnomAD AF	~0.97	~0.86	~0.90	~0.88	Frequency + position matter jointly
Remove Variant ID + gnomAD AF	~0.51	~0.17	~0.72	~0.27	Severe degradation
Remove Gene + gnomAD AF	~0.97	~0.87	~0.89	~0.88	Frequency more important than gene name

🔍 Detailed Result: Remove Gene + gnomAD AF

Precision (Pathogenic): 0.87
Recall (Pathogenic): 0.89
F1-score (Pathogenic): 0.88
Accuracy: 0.97

Pathogenic Correct: 6206
Pathogenic Missed: 785
Benign Correct: 47102
Benign Missed: 962

Interpretation: Even without gene identity and population frequency, the model retains strong performance by leveraging remaining genomic structure.

🧠 Key Ablation Insights

Variant identity is the dominant signal for pathogenicity classification
gnomAD allele frequency strongly improves recall but does not fully determine predictions
Gene names alone do not cause memorization or leakage
Performance degradation is predictable and biologically consistent
It generalizes within a ClinVar-like SNV distribution
It is robust to literal string obfuscation

🔬 FAKE VARIANT ROBUSTNESS TEST (IDENTITY MASKING)

Purpose:

Validate that the model does NOT rely on memorized variant identifiers by testing on synthetically altered (fake) IDs.

Test Setup:

All variant identifiers (variant_id, join_key, HGVS Name) were prefixed with 'FAKE_'
The same replacement was applied inside the prompt text
No real or known variant IDs were visible to the model
Gene, genomic location, ref/alt alleles, and gnomAD AF were preserved
Dataset size unchanged (55,055 SNVs)

Evaluation Results: fake_variant.parquet

The table below summarizes performance under full variant identity masking.

Class	Precision	Recall	F1-score	Support
Pathogenic	0.99	0.62	0.77	6,991
Benign	0.95	1.00	0.97	48,064
Accuracy			0.95	55,055
Macro Avg	0.97	0.81	0.87	55,055
Weighted Avg	0.95	0.95	0.95	55,055

Breakdown:

Pathogenic → Correct: 4,366 | Missed: 2,625
Benign → Correct: 48,008 | Missed: 56

Interpretation:

Benign recall remains near-perfect, indicating strong generalization even without variant identity.
Pathogenic recall drops when identity information is removed, which is expected and biologically consistent.
Very high pathogenic precision (0.99) shows that when the model predicts 'Pathogenic' under identity masking, it does so with high confidence.

Conclusion:

This robustness test confirms that:

The model is NOT solely dependent on memorized IDs
Variant identity is a strong but non-exclusive signal
Meaningful pathogenicity patterns are learned from genomic and population-level context

🏁 FAKE VARIANT ROBUSTNESS TEST COMPLETE

🚀 Intended Use

PathoPreter is purpose-built for ClinVar-style SNV interpretation, where decisions must be made from limited but widely available variant-level genomic context. Within this niche, the model delivers state-of-the-art performance by combining large-scale curated ClinVar data, gnomAD population signals, and rigorously validated robustness and ablation testing.

Optimized for fast, large-batch inference on accessible hardware, PathoPreter offers a practical alternative to heavyweight foundation models. Its predictable degradation under ablation and resilience to identifier masking demonstrate that it learns meaningful biological patterns rather than relying on memorization—making it well-suited for real-world variant prioritization, curation support, and benchmarking workflows.

Appropriate:

Research variant prioritization
Benchmarking pathogenicity models
Educational genomics experiments
Early Risk flagging and prioritizing orders of Variants for testing

Inappropriate:

Clinical diagnosis
Medical decision-making
Patient-facing applications

TO USE THIS MODEL

!pip install transformers accelerate peft bitsandbytes #bitsandbytes-0.49.1
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel

base_model = "unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit"   # same base your adapter was trained on
adapter_repo = "YADAV0206/PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD"

# Load tokenizer from your repo (it contains added tokens + merges)
tok = AutoTokenizer.from_pretrained(adapter_repo)

# Load base model in 4-bit to save RAM
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit=True,
    device_map="auto"
)

# Load LoRA adapter weights
model = PeftModel.from_pretrained(model, adapter_repo)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tok,
    device_map="auto"
)

#THIS SAMPLE SHOULD OUTPUT PATHOGENIC 
sample = {
"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: CDK8\nVariant: NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)\nLocation: chr13:26385259 C>G\ngnomAD Frequency: 0.000000\n\n### Response:",
"variant_id":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"join_key":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"Name":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"Assembly":"GRCh38",
"chrom":"13",
"pos":"26385259",
"ref":"C",
"alt":"G",
"gnomad_af":0
}

prompt = sample["text"]
print(prompt)

#EXPECTED OUTPUT PATHOGENIC

out = pipe(
    prompt,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.2,
)[0]["generated_text"]

print("\n------ OUTPUT ------")
print(out)

👨‍💻 Author

Rohit Yadav
National Institute of Technology, Jalandhar

yrohit1825@gmail.com | Github: https://github.com/YADAV1825/PathoPreter

🔗 References

ClinVar (NCBI)
gnomAD
unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
Unsloth
PEFT 0.18.0

Downloads last month: 3