𧬠PathoPreter
Clinical-Grade Genomic Variant Triage & Pathogenicity Predictor by AutonomousX
PathoPreter is a highly efficient, hybrid foundation model engineered to predict the pathogenicity of genetic variants. Built on Instadeep's 500M parameter Nucleotide Transformer backbone with a custom hybrid classification head, it natively processes both raw DNA sequences and clinical tabular features (conservation scores, gnomAD AF).
By acting as a highly calibrated ranker rather than a simple binary classifier, PathoPreter is specifically engineered to solve the Clinical Triage Problem. It delivers state-of-the-art diagnostic insights, explicitly outperforming industry standards like CADD, REVEL, and Google DeepMind's AlphaMissense on unseen clinical benchmarks.
β‘ TRAIN YOURSELF FOR FREE ON LIGHTNING.AI ON A100 40GB IN 6 HOURS UNDER THE FREE CREDITS!! (use Lambda labs!π) See reproducible script section at end of readme!!Model Architecture & Specs
Author
Rohit Yadav
B.Tech 3rd Year
Dr. B.R. Ambedkar National Institute of Technology (NIT) Jalandhar, India
E-mail: yrohit1825@gmail.com
LinkedIn: Rohit Yadav
Github: YADAV1825
π I am actively seeking Internships and Collaborations!
Research Interests
Bio_Informatics Large Language Models MultiModal Pipelines Systems Programming AI Infrastructure Distributed Training
𧬠PathoPreter
TRAIN YOURSELF FOR FREE ON LIGHTNING.AI ON A100 40GB IN 6 HOURS UNDER THE FREE CREDITS!! (use Lambda labs!π) See reproducible script section at end of readme!!
Clinical-Grade Genomic Variant Triage & Pathogenicity Predictor by AutonomousX
PathoPreter is a highly efficient, hybrid foundation model engineered to predict the pathogenicity of genetic variants. Built on a 500M parameter Nucleotide Transformer backbone with a custom hybrid classification head, it natively processes both raw DNA sequences and clinical tabular features (conservation scores, gnomAD AF).
By acting as a highly calibrated ranker rather than a simple binary classifier, PathoPreter is specifically engineered to solve the Clinical Triage Problem. It delivers state-of-the-art diagnostic insights, explicitly outperforming industry standards like CADD, REVEL, and Google DeepMind's AlphaMissense on unseen clinical benchmarks.
π Democratizing Genomic AI: Free-Tier Accessible
The current trend in raw DNA foundation models (like EVO2) relies on massive 40 Billion parameter architectures. Running these models requires clusters of $40,000 H200 GPUs, making them inaccessible to the average clinical lab.
PathoPreter shifts this paradigm. At an exceptionally lightweight ~500M parameters, PathoPreter delivers superhuman, clinical-grade variant triage that can be run entirely on a free-tier Google Colab T4 GPU or a standard consumer graphics card. No massive compute budget required.
π‘ The Clinical Triage Paradigm & ROI
In clinical genomics, standard ROC-AUC metrics are insufficient. Testing a single Variant of Uncertain Significance (VUS) can cost up to $1,500 and take weeks of labor. In a real clinical setting, the vast majority of these variants turn out to be benign.
To a clinician, the ability to rank variants is what actually drives clinical value. PathoPreter acts as an elite prioritization tool to maximize laboratory ROI by drastically reducing the time, labor, and financial waste associated with wet-lab testing.
The PathoPreter ROI Example: If a high-throughput lab sequences 1,000 variants where only 5% (50 variants) are actually pathogenic:
- Top 10% Triage: Testing only the top 100 variants ranked by PathoPreter captures ~75% of all true pathogens (approx. 38 variants).
- Top 5% Triage: Testing only the top 50 variants captures ~50% of all true pathogens (approx. 25 variants).
Labs can prioritize the highest-risk targets first, cutting the haystack down to the sharpest needles and saving tens of thousands of dollars.
π Performance Leaderboards: Beating the Giants
PathoPreter was subjected to a rigorous two-phase evaluation pipeline to prove it is not just getting lucky on easy datasets.
1. The Balanced Benchmark (14k Unseen ClinVar)
This dataset is aggressively skewed toward rare/ultra-rare variants (gnomAD AF < 1e-4), where even the benign variants are notoriously difficult to classify.
Evaluated on a strict 1:1 balanced test set of 14,073 unseen variants, PathoPreter completely shatters the baseline, definitively beating Google DeepMind's AlphaMissense, CADD, and REVEL by over +0.32 ROC-AUC.
Note: These Scores of models were taken from DBNSFP Database
ROC-AUC / PR-AUC Leaderboard
| Model | ROC-AUC | PR-AUC |
|---|---|---|
| PathoPreter | 0.9186 | 0.9284 |
| BayesDel | 0.5949 | 0.7097 |
| CADD | 0.5921 | 0.7079 |
| ClinPred | 0.5886 | 0.6095 |
| AlphaMissense (DeepMind) | 0.5879 | 0.6050 |
| REVEL | 0.5847 | 0.6026 |
| ESM1b | 0.5763 | 0.5938 |
F1-Optimized Baseline Comparison When pushing competitor models to their absolute best F1 thresholds, they still cap out at barely better than random chance (~53% accuracy). PathoPreter scales far beyond this.
| Model | Accuracy | Best Threshold | F1 Score | AP |
|---|---|---|---|---|
| ClinPred | 53.93% | 0.2707 | 0.6837 | 0.6066 |
| BayesDel_addAF | 53.83% | 0.2284 | 0.6838 | 0.7063 |
| REVEL | 53.26% | 0.5393 | 0.6799 | 0.6004 |
| AlphaMissense | 53.02% | 0.2655 | 0.6792 | 0.6035 |
| CADD | 52.80% | 0.2114 | 0.6790 | 0.7048 |
PathoPreter Performance by Allele Frequency (14k Set) PathoPreter maintains elite performance even on ultra-rare mutations where most models fail.
| AF Bin | N | Pathogenic % | ROC-AUC |
|---|---|---|---|
| Ultra-rare (<1e-6) | 8,027 | 67.1% | 0.9238 |
| Rare (1e-6β1e-4) | 4,665 | 34.7% | 0.9069 |
| Low-freq (1e-4β1e-2) | 967 | 5.5% | 0.8219 |
| Common (>1e-2) | 414 | 0.5% | 0.9472 |
2. The "Hard" Real-World Simulation (100k Unbalanced)
Here we flooded the 14K dataset with benign variants to make a 100k dataset with the same hard pathogens to simulate a real-world unbalanced dataset. PathoPreter still holds its dominant position.
| Model | ROC-AUC | PR-AUC |
|---|---|---|
| ClinPred | 0.9602 | 0.5921 |
| REVEL | 0.9554 | 0.5828 |
| AlphaMissense | 0.9526 | 0.5764 |
| ESM1b | 0.9440 | 0.5492 |
| BayesDel_addAF | 0.9355 | 0.6880 |
| PathoPreter | 0.9123 | 0.6204 |
| CADD_raw | 0.8980 | 0.6566 |
Key Triage Metrics (100k Set):
- Recall @ Top 10%: 75.28%
- Brier Score: 0.1676 (Highly reliable clinical probability calibration)
Ablation Study: What Matters Most?
We systematically destroyed inputs to see what drives the model's intelligence. The results prove PathoPreter is natively reading the DNA.
| Ablation Test | AUC | Performance Drop |
|---|---|---|
| No Tabular (Pure DNA) | 0.9081 | π -0.0099 |
| No GERP (Evo Blind) | 0.9226 | π +0.0044 |
| No PhyloP | 0.9174 | π -0.0006 |
| No gnomAD (Freq Blind) | 0.9153 | π -0.0028 |
| No Conservation (All Scores) | 0.9117 | π -0.0064 |
| No gnomAD + No Conservation | 0.9081 | π -0.0099 |
| No PhastCons | 0.9010 | π -0.0171 |
| No DNA (Sequence Blind) | 0.5583 | π -0.3598 |
| No DNA + No gnomAD | 0.5579 | π -0.3602 |
| No DNA + No Conservation | 0.5179 | π -0.4001 |
Modality Source:
SHAP analysis and ablation reveal that 64.9% of the model's intelligence is derived directly from the raw DNA sequence context.
Even if all clinical conservation scores are removed and the model is fed pure raw DNA (No Tabular),
PathoPreter still achieves an elite 0.908 ROC-AUCβsuffering a negligible ~0.01 drop.
π‘οΈ Data Integrity: Ensuring Zero Leakage
To ensure the model wasn't simply memorizing data, we performed a strict permutation test on the 14k unseen test set.
- Real AUC:
0.9182 - Permutation AUC:
0.5044
β Result: No leakage or memorization detected. The model is genuinely learning biological pathogenicity, not just exploiting dataset artifacts.
ποΈ Architecture & Modality
PathoPreter achieves its elite performance without leaning entirely on pre-calculated tabular conservation scores.
- Backbone:
InstaDeepAI/nucleotide-transformer-500m-human-ref - Custom Head: Concatenates transformer pooled DNA embeddings with normalized tabular features.
β»οΈ Open Science & 100% Reproducibility (Please give it a star if it helps!)
TRAIN FOR FREE ON LIGHTNING.AI ON A100 IN 6 HOURS UNDER THE FREE CREDITS!!
In clinical genomics, transparency is just as critical as performance and we do not believe in "black box" medicine or hidden methodologies. To ensure total trust and allow the community to verify our benchmarks, the entire PathoPreter ecosystem is fully open-sourced.
- Dataset Generation Pipeline: The complete end-to-end data processing, k-mer tokenization, and feature engineering pipeline is publicly available at YADAV1825/PathoPreter.
- From-Scratch Training Scripts: We provide the exact, fully reproducible biological fine-tuning scripts used to create the model. Anyone can train PathoPreter from scratch, verify our claims, or adapt the architecture to build specialized models for their own private genetic cohorts.YADAV1825/PathoPreter.
βββ data_preprocessing/
β βββ atgc_sequence_add.py
β βββ clinvar_clean.py
β βββ clinvar_download.py
β βββ dbnsfp_download.py
β βββ dbnsfp_merge.py
β βββ gnomAD_download.py
β βββ gnomAD_merge.py
β βββ grch38_download.py
β βββ grch38_merge.py
β βββ human_genome_builder.py
βββ Instadeep_NT_500M_CPT/
β βββ 100k_testing_AUC.ipynb
β βββ 100k_testing_recall.ipynb
β βββ ablation_study_10_tests.png
β βββ Neucletide_transformer.ipynb
β βββ shap_modality_comparison.png
β βββ shap_tabular_beeswarm.png
βββ README.md
Organization
AutonomousX
AutonomousX focuses on open-source contributions aimed at building Large Language Models from scratch using custom training pipelines.
Our work explores different training configurations including optimizers, datasets, and scalable TPU training using JAX and pmap. The goal is to provide transparent and reproducible implementations so that researchers, students, and developers can understand how modern LLMs are trained end-to-end.
Due to the current scarcity of complete beginner-friendly guides for training LLMs on TPUs, especially using JAX, AutonomousX aims to bridge this gap by publishing full training pipelines, scripts, and documentation for the open-source community.
- Downloads last month
- 21



