PathoPreter-4B-SNV

(Base: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit)
A LLM for SNV Pathogenicity Classification

⚠️ IMPORTANT DISCLAIMER

PathoPreter is a research-only predictive model. It is NOT a clinical diagnostic system PathoPreter is a research tool for risk prioritization, NOT a diagnostic device. It outputs "Pathogenic Indication" signals to assist in workflow prioritization. It should never be used to confirm a diagnosis or determine medical treatment without expert ACMG review


πŸ”¬ Model Summary

PathoPreter-4B-SNV is a parameter-efficient large language model fine-tuned to classify Single Nucleotide Variants (SNVs) as Pathogenic or Benign using structured genomic context.

  • Base model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
  • Fine-tuning: LoRA via Unsloth
  • Training data: 1.2 million SNVs
  • Genome assembly: GRCh38
  • Scope: SNVs only
  • Deployment: safetensors + GGUF (Q4/Q8)

Strict HGVS-level and coordinate-level train–test disjointness was enforced.


🧬 Dataset Description

The dataset consists exclusively of Single Nucleotide Variants (SNVs) curated from ClinVar and augmented with gnomAD population allele frequency.

β€œThe training dataset contains approximately 144k pathogenic variants and 1.05 million benign variants, for a total of about 1.2 million samples and Testing have 55k total different samples with 11 seperate ablations tests (on same 55k rows) making around 12*55k=660k”

I have uploaded a sample data of 12k rows for train and 1.2k for test to get idea of the dataset use

Github Links: Train | Test.
Hugging Face Links: https://huggingface.co/datasets/YADAV0206/Clinvar_SNV_GnomAD_Pathogen_Variants
{"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: SERPINH1\nVariant: NC_000011.10:g.75561608C>T\n\n### Response:","clean_label":"Pathogenic","variant_id":"NC_000011.10:g.75561608C>T","join_key":"NC_000011.10:g.75561608C>T","Name":"NC_000011.10:g.75561608C>T","Assembly":"GRCh38","chrom":"11","pos":"75561608","ref":"C","alt":"T","gnomad_af":0}

Included signals:

  • Gene symbol
  • HGVS transcript-level variant (join_key)
  • Genomic coordinates (chrom, pos, ref, alt)
  • gnomAD allele frequency

Excluded:

  • Variants of Uncertain Significance (VUS)
  • Conflicting interpretations
  • Indels, CNVs, structural variants

Dataset integrity guarantees:

  • No detectable label leakage under automated audits (i.e Zero duplicate variants, Zero label leakage, Zero train–test overlap under automated audits)
  • Allele frequency sanity checks passed

πŸ“¦ Dataset Availability

Dataset Construction and Availability:
The datasets used to train and evaluate PathoPreter (including large-scale ClinVar-derived SNV corpora, controlled ablation test suites, and robustness evaluation datasets) were fully constructed in-house using publicly available, permissively licensed genomic resources such as ClinVar and gnomAD. All upstream sources are properly credited and explicitly permit commercial use and redistribution. Significant original engineering and curation effort was applied beyond raw data usage. This included large-scale extraction, normalization, schema unification, quality control, deduplication, and strict train–test disjointness enforcement. Approximately 8 million raw ClinVar variants and ~250 GB of gnomAD VCF data were processed and merged into production-grade Parquet datasets optimized for large-scale analytics, machine learning training, and downstream integration. The end-to-end data construction process required approximately 150 hours of compute time and over 250 hours of expert engineering and curation work. The resulting datasets constitute a high-value derived data asset, distinct from the original source distributions. These datasets are available for licensed distribution to startups, enterprises, and research organizations for use in applied genomics, AI/ML model development, benchmarking, variant prioritization workflows, and internal research. Commercial licensing, redistribution terms, and support options are available upon request.

Available components include:(in Parquet and CSV both)

  • Large-scale ClinVar-style SNV training dataset
  • Held-out test set with identical variants across ablations
  • Controlled ablation datasets (signal removal studies)
  • Fake-variant's robustness evaluation dataset (see below why is it important in FAKE VARIANT ROBUSTNESS TEST)
  • Balanced CSV subsets suitable for classical ML training
  • Data audit and leakage-verification scripts

If you are interested in:

  • dataset licensing
  • research or industry use
  • collaboration or benchmarking
  • reproducing or extending this work

please contact:

Rohit Yadav

Requests are evaluated on a case-by-case basis.


πŸ“Š Evaluation Protocol

  • Held-out test set: 55,055 (55k) unseen SNVs
  • Total evaluation variants across ablations: ~600k
  • One full-context baseline + ten controlled ablation datasets
  • All ablations use identical variants; only input signals are removed

Screenshot 2026-01-09 095943


πŸ“‰ Ablation Study Results

The table below summarizes performance when specific biological signals are removed.

Ablation Dataset Accuracy Pathogenic Precision Pathogenic Recall Pathogenic F1 Key Interpretation
Baseline (All Signals) ~0.98 ~0.91 ~0.96 ~0.93 Full biological context
Remove Gene ~0.98 ~0.91 ~0.95 ~0.93 Gene name alone is not a shortcut
Remove Variant ID ~0.60 ~0.22 ~0.81 ~0.34 Variant identity is critical
Remove Location ~0.98 ~0.91 ~0.95 ~0.93 Coordinates alone are not memorized
Remove gnomAD AF ~0.97 ~0.82 ~0.92 ~0.87 Population frequency is a strong signal
Remove Variant ID + Location ~0.59 ~0.21 ~0.81 ~0.34 Identity removal causes collapse
Remove Gene + Location ~0.98 ~0.91 ~0.95 ~0.93 Gene names are contextual, not labels
Remove Gene + Variant ID ~0.60 ~0.22 ~0.81 ~0.34 Identity loss dominates
Remove Location + gnomAD AF ~0.97 ~0.86 ~0.90 ~0.88 Frequency + position matter jointly
Remove Variant ID + gnomAD AF ~0.51 ~0.17 ~0.72 ~0.27 Severe degradation
Remove Gene + gnomAD AF ~0.97 ~0.87 ~0.89 ~0.88 Frequency more important than gene name

πŸ” Detailed Result: Remove Gene + gnomAD AF

Precision (Pathogenic): 0.87
Recall (Pathogenic): 0.89
F1-score (Pathogenic): 0.88
Accuracy: 0.97

Pathogenic Correct: 6206
Pathogenic Missed: 785
Benign Correct: 47102
Benign Missed: 962

Interpretation: Even without gene identity and population frequency, the model retains strong performance by leveraging remaining genomic structure.


🧠 Key Ablation Insights

  • Variant identity is the dominant signal for pathogenicity classification
  • gnomAD allele frequency strongly improves recall but does not fully determine predictions
  • Gene names alone do not cause memorization or leakage
  • Performance degradation is predictable and biologically consistent
  • It generalizes within a ClinVar-like SNV distribution
  • It is robust to literal string obfuscation

πŸ”¬ FAKE VARIANT ROBUSTNESS TEST (IDENTITY MASKING)

Purpose:

Validate that the model does NOT rely on memorized variant identifiers by testing on synthetically altered (fake) IDs.

Test Setup:

  • All variant identifiers (variant_id, join_key, HGVS Name) were prefixed with 'FAKE_'
  • The same replacement was applied inside the prompt text
  • No real or known variant IDs were visible to the model
  • Gene, genomic location, ref/alt alleles, and gnomAD AF were preserved
  • Dataset size unchanged (55,055 SNVs)

Evaluation Results: fake_variant.parquet

The table below summarizes performance under full variant identity masking.

Class Precision Recall F1-score Support
Pathogenic 0.99 0.62 0.77 6,991
Benign 0.95 1.00 0.97 48,064
Accuracy 0.95 55,055
Macro Avg 0.97 0.81 0.87 55,055
Weighted Avg 0.95 0.95 0.95 55,055

Breakdown:

  • Pathogenic β†’ Correct: 4,366 | Missed: 2,625
  • Benign β†’ Correct: 48,008 | Missed: 56

Interpretation:

  • Benign recall remains near-perfect, indicating strong generalization even without variant identity.

  • Pathogenic recall drops when identity information is removed, which is expected and biologically consistent.

  • Very high pathogenic precision (0.99) shows that when the model predicts 'Pathogenic' under identity masking, it does so with high confidence.

Conclusion:

This robustness test confirms that:

  • The model is NOT solely dependent on memorized IDs
  • Variant identity is a strong but non-exclusive signal
  • Meaningful pathogenicity patterns are learned from genomic and population-level context

🏁 FAKE VARIANT ROBUSTNESS TEST COMPLETE


πŸš€ Intended Use

PathoPreter is purpose-built for ClinVar-style SNV interpretation, where decisions must be made from limited but widely available variant-level genomic context. Within this niche, the model delivers state-of-the-art performance by combining large-scale curated ClinVar data, gnomAD population signals, and rigorously validated robustness and ablation testing.

Optimized for fast, large-batch inference on accessible hardware, PathoPreter offers a practical alternative to heavyweight foundation models. Its predictable degradation under ablation and resilience to identifier masking demonstrate that it learns meaningful biological patterns rather than relying on memorizationβ€”making it well-suited for real-world variant prioritization, curation support, and benchmarking workflows.

Appropriate:

  • Research variant prioritization
  • Benchmarking pathogenicity models
  • Educational genomics experiments
  • Early Risk flagging and prioritizing orders of Variants for testing

Inappropriate:

  • Clinical diagnosis
  • Medical decision-making
  • Patient-facing applications

TO USE THIS MODEL

!pip install transformers accelerate peft bitsandbytes #bitsandbytes-0.49.1
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel

base_model = "unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit"   # same base your adapter was trained on
adapter_repo = "YADAV0206/PathoPreter-4B-SNV-Pathogen-ClinVar-gnomAD"

# Load tokenizer from your repo (it contains added tokens + merges)
tok = AutoTokenizer.from_pretrained(adapter_repo)

# Load base model in 4-bit to save RAM
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit=True,
    device_map="auto"
)

# Load LoRA adapter weights
model = PeftModel.from_pretrained(model, adapter_repo)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tok,
    device_map="auto"
)

#THIS SAMPLE SHOULD OUTPUT PATHOGENIC 
sample = {
"text":"Below is a biological context regarding a genetic variant. Determine if it is Pathogenic or Benign.\n\n### Input:\nGene: CDK8\nVariant: NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)\nLocation: chr13:26385259 C>G\ngnomAD Frequency: 0.000000\n\n### Response:",
"variant_id":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"join_key":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"Name":"NM_001260.3(CDK8):c.563C>G (p.Ala188Gly)",
"Assembly":"GRCh38",
"chrom":"13",
"pos":"26385259",
"ref":"C",
"alt":"G",
"gnomad_af":0
}

prompt = sample["text"]
print(prompt)

#EXPECTED OUTPUT PATHOGENIC

out = pipe(
    prompt,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.2,
)[0]["generated_text"]

print("\n------ OUTPUT ------")
print(out)



πŸ‘¨β€πŸ’» Author

Rohit Yadav
National Institute of Technology, Jalandhar


πŸ”— References

  • ClinVar (NCBI)
  • gnomAD
  • unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
  • Unsloth
  • PEFT 0.18.0
Downloads last month
120
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support