Foundation Model

🧬 PathoPreter

Clinical-Grade Genomic Variant Triage & Pathogenicity Predictor by AutonomousX

PathoPreter is a highly efficient, hybrid foundation model engineered to predict the pathogenicity of genetic variants. Built on Instadeep's 500M parameter Nucleotide Transformer backbone with a custom hybrid classification head, it natively processes both raw DNA sequences and clinical tabular features (conservation scores, gnomAD AF).

By acting as a highly calibrated ranker rather than a simple binary classifier, PathoPreter is specifically engineered to solve the Clinical Triage Problem. It delivers state-of-the-art diagnostic insights, explicitly outperforming industry standards like CADD, REVEL, and Google DeepMind's AlphaMissense on unseen clinical benchmarks.

⚑ TRAIN YOURSELF FOR FREE ON LIGHTNING.AI ON A100 40GB IN 6 HOURS UNDER THE FREE CREDITS!! (use Lambda labs!πŸ˜‰) See reproducible script section at end of readme!!
License: CC-BY-NC-ND-3.0

Model Architecture & Specs

500M Parameters Nucleotide Transformer Hybrid Classification Head DNA + Tabular Inputs

Author

Rohit Yadav

B.Tech 3rd Year
Dr. B.R. Ambedkar National Institute of Technology (NIT) Jalandhar, India

πŸ“§ E-mail: yrohit1825@gmail.com
πŸ”— LinkedIn: Rohit Yadav
πŸ’» Github: YADAV1825

πŸš€ I am actively seeking Internships and Collaborations!

Research Interests

Bio_Informatics Large Language Models MultiModal Pipelines Systems Programming AI Infrastructure Distributed Training


🧬 PathoPreter

TRAIN YOURSELF FOR FREE ON LIGHTNING.AI ON A100 40GB IN 6 HOURS UNDER THE FREE CREDITS!! (use Lambda labs!πŸ˜‰) See reproducible script section at end of readme!!


Clinical-Grade Genomic Variant Triage & Pathogenicity Predictor by AutonomousX

PathoPreter is a highly efficient, hybrid foundation model engineered to predict the pathogenicity of genetic variants. Built on a 500M parameter Nucleotide Transformer backbone with a custom hybrid classification head, it natively processes both raw DNA sequences and clinical tabular features (conservation scores, gnomAD AF).

By acting as a highly calibrated ranker rather than a simple binary classifier, PathoPreter is specifically engineered to solve the Clinical Triage Problem. It delivers state-of-the-art diagnostic insights, explicitly outperforming industry standards like CADD, REVEL, and Google DeepMind's AlphaMissense on unseen clinical benchmarks.

πŸš€ Democratizing Genomic AI: Free-Tier Accessible

The current trend in raw DNA foundation models (like EVO2) relies on massive 40 Billion parameter architectures. Running these models requires clusters of $40,000 H200 GPUs, making them inaccessible to the average clinical lab.

PathoPreter shifts this paradigm. At an exceptionally lightweight ~500M parameters, PathoPreter delivers superhuman, clinical-grade variant triage that can be run entirely on a free-tier Google Colab T4 GPU or a standard consumer graphics card. No massive compute budget required.

πŸ’‘ The Clinical Triage Paradigm & ROI

In clinical genomics, standard ROC-AUC metrics are insufficient. Testing a single Variant of Uncertain Significance (VUS) can cost up to $1,500 and take weeks of labor. In a real clinical setting, the vast majority of these variants turn out to be benign.

To a clinician, the ability to rank variants is what actually drives clinical value. PathoPreter acts as an elite prioritization tool to maximize laboratory ROI by drastically reducing the time, labor, and financial waste associated with wet-lab testing.

The PathoPreter ROI Example: If a high-throughput lab sequences 1,000 variants where only 5% (50 variants) are actually pathogenic:

  • Top 10% Triage: Testing only the top 100 variants ranked by PathoPreter captures ~75% of all true pathogens (approx. 38 variants).
  • Top 5% Triage: Testing only the top 50 variants captures ~50% of all true pathogens (approx. 25 variants).

Labs can prioritize the highest-risk targets first, cutting the haystack down to the sharpest needles and saving tens of thousands of dollars.


πŸ† Performance Leaderboards: Beating the Giants

PathoPreter was subjected to a rigorous two-phase evaluation pipeline to prove it is not just getting lucky on easy datasets.

1. The Balanced Benchmark (14k Unseen ClinVar)

This dataset is aggressively skewed toward rare/ultra-rare variants (gnomAD AF < 1e-4), where even the benign variants are notoriously difficult to classify.

Evaluated on a strict 1:1 balanced test set of 14,073 unseen variants, PathoPreter completely shatters the baseline, definitively beating Google DeepMind's AlphaMissense, CADD, and REVEL by over +0.32 ROC-AUC.

Note: These Scores of models were taken from DBNSFP Database

ROC-AUC / PR-AUC Leaderboard

Model ROC-AUC PR-AUC
PathoPreter 0.9186 0.9284
BayesDel 0.5949 0.7097
CADD 0.5921 0.7079
ClinPred 0.5886 0.6095
AlphaMissense (DeepMind) 0.5879 0.6050
REVEL 0.5847 0.6026
ESM1b 0.5763 0.5938

image

F1-Optimized Baseline Comparison When pushing competitor models to their absolute best F1 thresholds, they still cap out at barely better than random chance (~53% accuracy). PathoPreter scales far beyond this.

Model Accuracy Best Threshold F1 Score AP
ClinPred 53.93% 0.2707 0.6837 0.6066
BayesDel_addAF 53.83% 0.2284 0.6838 0.7063
REVEL 53.26% 0.5393 0.6799 0.6004
AlphaMissense 53.02% 0.2655 0.6792 0.6035
CADD 52.80% 0.2114 0.6790 0.7048

image

PathoPreter Performance by Allele Frequency (14k Set) PathoPreter maintains elite performance even on ultra-rare mutations where most models fail.

AF Bin N Pathogenic % ROC-AUC
Ultra-rare (<1e-6) 8,027 67.1% 0.9238
Rare (1e-6–1e-4) 4,665 34.7% 0.9069
Low-freq (1e-4–1e-2) 967 5.5% 0.8219
Common (>1e-2) 414 0.5% 0.9472

2. The "Hard" Real-World Simulation (100k Unbalanced)

Here we flooded the 14K dataset with benign variants to make a 100k dataset with the same hard pathogens to simulate a real-world unbalanced dataset. PathoPreter still holds its dominant position.

Model ROC-AUC PR-AUC
ClinPred 0.9602 0.5921
REVEL 0.9554 0.5828
AlphaMissense 0.9526 0.5764
ESM1b 0.9440 0.5492
BayesDel_addAF 0.9355 0.6880
PathoPreter 0.9123 0.6204
CADD_raw 0.8980 0.6566

Key Triage Metrics (100k Set):

  • Recall @ Top 10%: 75.28%
  • Brier Score: 0.1676 (Highly reliable clinical probability calibration)

Ablation Study: What Matters Most?

We systematically destroyed inputs to see what drives the model's intelligence. The results prove PathoPreter is natively reading the DNA.

image

Ablation Test AUC Performance Drop
No Tabular (Pure DNA) 0.9081 πŸ“‰ -0.0099
No GERP (Evo Blind) 0.9226 πŸ“ˆ +0.0044
No PhyloP 0.9174 πŸ“‰ -0.0006
No gnomAD (Freq Blind) 0.9153 πŸ“‰ -0.0028
No Conservation (All Scores) 0.9117 πŸ“‰ -0.0064
No gnomAD + No Conservation 0.9081 πŸ“‰ -0.0099
No PhastCons 0.9010 πŸ“‰ -0.0171
No DNA (Sequence Blind) 0.5583 πŸ“‰ -0.3598
No DNA + No gnomAD 0.5579 πŸ“‰ -0.3602
No DNA + No Conservation 0.5179 πŸ“‰ -0.4001

Modality Source:

SHAP analysis and ablation reveal that 64.9% of the model's intelligence is derived directly from the raw DNA sequence context.

Even if all clinical conservation scores are removed and the model is fed pure raw DNA (No Tabular),

PathoPreter still achieves an elite 0.908 ROC-AUCβ€”suffering a negligible ~0.01 drop.

image


πŸ›‘οΈ Data Integrity: Ensuring Zero Leakage

To ensure the model wasn't simply memorizing data, we performed a strict permutation test on the 14k unseen test set.

  • Real AUC: 0.9182
  • Permutation AUC: 0.5044

βœ… Result: No leakage or memorization detected. The model is genuinely learning biological pathogenicity, not just exploiting dataset artifacts.


πŸ—οΈ Architecture & Modality

PathoPreter achieves its elite performance without leaning entirely on pre-calculated tabular conservation scores.

  • Backbone: InstaDeepAI/nucleotide-transformer-500m-human-ref
  • Custom Head: Concatenates transformer pooled DNA embeddings with normalized tabular features.

♻️ Open Science & 100% Reproducibility (Please give it a star if it helps!)

TRAIN FOR FREE ON LIGHTNING.AI ON A100 IN 6 HOURS UNDER THE FREE CREDITS!!

In clinical genomics, transparency is just as critical as performance and we do not believe in "black box" medicine or hidden methodologies. To ensure total trust and allow the community to verify our benchmarks, the entire PathoPreter ecosystem is fully open-sourced.

  • Dataset Generation Pipeline: The complete end-to-end data processing, k-mer tokenization, and feature engineering pipeline is publicly available at YADAV1825/PathoPreter.
  • From-Scratch Training Scripts: We provide the exact, fully reproducible biological fine-tuning scripts used to create the model. Anyone can train PathoPreter from scratch, verify our claims, or adapt the architecture to build specialized models for their own private genetic cohorts.YADAV1825/PathoPreter.
β”œβ”€β”€ data_preprocessing/
β”‚   β”œβ”€β”€ atgc_sequence_add.py
β”‚   β”œβ”€β”€ clinvar_clean.py
β”‚   β”œβ”€β”€ clinvar_download.py
β”‚   β”œβ”€β”€ dbnsfp_download.py
β”‚   β”œβ”€β”€ dbnsfp_merge.py
β”‚   β”œβ”€β”€ gnomAD_download.py
β”‚   β”œβ”€β”€ gnomAD_merge.py
β”‚   β”œβ”€β”€ grch38_download.py
β”‚   β”œβ”€β”€ grch38_merge.py
β”‚   └── human_genome_builder.py
β”œβ”€β”€ Instadeep_NT_500M_CPT/
β”‚   β”œβ”€β”€ 100k_testing_AUC.ipynb
β”‚   β”œβ”€β”€ 100k_testing_recall.ipynb
β”‚   β”œβ”€β”€ ablation_study_10_tests.png
β”‚   β”œβ”€β”€ Neucletide_transformer.ipynb
β”‚   β”œβ”€β”€ shap_modality_comparison.png
β”‚   └── shap_tabular_beeswarm.png
└── README.md

Organization

AutonomousX

AutonomousX focuses on open-source contributions aimed at building Large Language Models from scratch using custom training pipelines.

Our work explores different training configurations including optimizers, datasets, and scalable TPU training using JAX and pmap. The goal is to provide transparent and reproducible implementations so that researchers, students, and developers can understand how modern LLMs are trained end-to-end.

Due to the current scarcity of complete beginner-friendly guides for training LLMs on TPUs, especially using JAX, AutonomousX aims to bridge this gap by publishing full training pipelines, scripts, and documentation for the open-source community.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including autonomousX/PathoPreter-DNA-Pathogen-Clinvar_gnomAD-ranker