CYP2C9 Variant Function Classifier (research artifact β honest negative)
Experimental classifier for CYP2C9 variant functional classification (no_function / decreased_function / normal_function), built by Anukriti AI.
Read this first. This repository is published as a transparent negative result. The v2 model fixes the circularity of v1 but still fails the held-out clinical test (1/6 correct). It is not a clinical predictor and must not be used for dosing decisions. It is shared so the finding β that single-assay MAVE labels do not generalize to CPIC clinical phenotype for CYP2C9 β is reproducible.
Versions
- v1 β MAVE-threshold scaffold. 8,050 training rows. 5-fold CV accuracy
0.996 (XGB) but circular: the
click_score/vamp_scorefeatures the labels were thresholded from drive ~77% of feature importance. Leave-anchors-out: 4/4 CPIC anchors misclassified without 500Γ upweighting. A MAVE-threshold reproducer, not a clinical predictor. - v2 β non-circular.
click_score/vamp_scoreremoved; AlphaMissense (genomic-coordinate-corrected) + CADD added. Trained on the 2,514-row SNV-reachable subset. 5-fold CV AUC ~0.88 (XGB 0.886) β believable, not hollow. Held-out clinical test: 1/6 = 17% (only*11predicted correctly).
The finding
Removing the circular features fixed the inflated CV score, but the model still fails clinically because it was trained on MAVE-threshold labels, and MAVE assay function β CPIC clinical function for CYP2C9. The bottleneck is the label definition β not feature quality or model architecture.
AlphaMissense is discriminative where available (monotonic class separation: normal 0.21 β decreased 0.44 β no_function 0.65 mean), but covers only 31.3% of this codon-saturation MAVE library because 67.5% of variants require multi-nucleotide AA changes that AlphaMissense cannot score by design. Coverage is the blocker, not feature quality.
Ground truth / sources
MaveDB (Click-seq + VAMP-seq CYP2C9 libraries), CPIC CYP2C9 allele-function table, PharmVar, Ensembl VEP / AlphaMissense, CADD.
Citation
Part of the Anukriti AI platform validation effort. Project-level preprint: https://doi.org/10.5281/zenodo.20727790 (This DOI covers the broader Anukriti validation study, not a CYP2C9-specific artifact.)