MuLGIT-Perturb: Perturbation Layer Extension
From Static Biomarker Ranking to Counterfactual Molecular State Prediction
Status: Design Specification β Study Plan
Repository: https://huggingface.co/vedatonuryilmaz/MuLGIT
Date: 2026-05-09
1. Motivation: What Static Models Miss
MuLGIT currently identifies causal features that correlate with survival/longevity and ranks drugs by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually worsen the molecular state of a given patient. What's missing is the ability to answer:
"What will this perturbation do to this patient's molecular state?"
Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one β useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization.
2. Capability Specification
Input
- Baseline omics state β the patient/cell's molecular profile before perturbation:
- DNA: methylation Ξ²-values, CNV copy numbers
- RNA: mRNA expression, miRNA expression
- (Future) Chromatin: WGBS, ATAC-seq GeneActivity
- Perturbation descriptor β what is being applied:
- Drug: SMILES string β molecular fingerprint or ChemBERTa embedding
- Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA)
- Drug combination: set of drug fingerprints + dose ratios
Output
- Predicted molecular state shift (Ξ) β per-gene or per-pathway expression change:
Ε·_post = f(baseline, perturbation)β predicted post-perturbation expressionΞ = Ε·_post - baselineβ direction and magnitude of change per gene
- Uncertainty quantification β per-gene prediction intervals:
- Epistemic uncertainty via deep ensembles (3-5 models with different seeds)
- Aleatoric uncertainty via predicted variance (heteroscedastic output head)
- Mechanistic attribution β why this perturbation causes this shift:
- Gene set activation scores (MSigDB pathways, Reactome, KEGG)
- Top differentially expressed genes (DEGs) with effect sizes
- Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs)
- Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z"
3. Architecture: MuLGIT-Perturb
3.1 Design Principles
Reuse MuLGIT encoders β the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become frozen feature extractors for the perturbation layer, providing biologically-grounded latent representations.
Perturbation-conditioned residual prediction β the model predicts the difference between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction).
Central dogma as structural prior β the perturbation flows through the same biological layers: drug β chromatin state β DNA methylation β mRNA expression β miRNA β predicted phenotype shift.
Uncertainty via deep ensembles β train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017).
Chemical structure encoding β use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current
drug_target.py.
3.2 Architecture Diagram
ββββββββββββββββββββββββββββββββββββ
β PERTURBATION ENCODER β
β ββββββββββββ βββββββββββββββ β
Drug SMILES βββββββΊβ β MolFormerβ β Geneformer ββββ Genetic Pert.
β β (frozen)β β (frozen) β β
β ββββββ¬ββββββ ββββββββ¬βββββββ β
β β β β
β ββββββΌββββββββββββββββββΌβββββββ β
β β Fusion Gate (learned Ξ±) β β
β β z_pert = Ξ±Β·z_drug + β β
β β (1-Ξ±)Β·z_gene β β
β βββββββββββββββ¬βββββββββββββββββ β
βββββββββββββββββββΌβββββββββββββββββββ
β z_pert β R^768
β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β MULGIT (frozen feature extractor) β
β β
β BASELINE STATE β
β Methylation ββΊ SNN Encoder ββΊ z_meth β
β CNV ββΊ SNN Encoder ββΊ z_cnv β
β mRNA ββΊ SNN Encoder ββΊ z_mrna β
β miRNA ββΊ SNN Encoder ββΊ z_mirna β
β β
β CentralDogmaFusion β
β z_dna = [z_meth, z_cnv] β
β z_dna_rna = DNAβRNA(z_dna, z_mrna) β
β z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna) β
β β β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ
β z_fused β R^48 (baseline latent state)
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ
β PERTURBATION RESPONSE PREDICTOR (trained) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Condition Fusion β β
β β z_cond = FiLM(z_fused, z_pert) β β
β β (Feature-wise Linear Modulation: scale + shift from β β
β β perturbation embedding applied to baseline state) β β
β ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ β
β β Delta Decoder (SNNStack) β β
β β z_cond ββΊ [1024] ββΊ [512] ββΊ [256] ββΊ [128] ββΊ Ξ β β
β β βββΊ ΟΒ² (UQ) β β
β β Ξ β R^G (per-gene logFC prediction) β β
β β ΟΒ² β R^G (per-gene predicted variance) β β
β ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β β
β Ε·_post = y_baseline + Ξ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β OUTPUT LAYER β
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββββ
β β Ξ per gene β β ΟΒ² per geneβ β Pathway Enrichment ββ
β β (logFC) β β (variance) β β (GSEA on Ξ) ββ
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββββ
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Mechanistic Attribution β β
β β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β β β Integrated β β Perturbation-Conditioned β β β
β β β Gradients on Ξ β β Knowledge Subgraph β β β
β β β (geneβpathway β β (AdaPert-style: GNN over β β β
β β β attribution) β β GO/Reactome/STRING + drug) β β β
β β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.3 Key Components
3.3.1 PerturbationEncoder
Encodes both drug and genetic perturbations into a unified embedding:
- Drug: SMILES β MolFormer (frozen, 768-dim) or ChemBERTa-2
- Genetic: Gene symbol β Geneformer gene embedding (frozen)
- Fusion: Learned convex combination
z_pert = Ξ±Β·z_drug + (1-Ξ±)Β·z_genewithΞ± β [0,1]
3.3.2 FiLM Conditioning Layer
Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state:
z_cond = Ξ³(z_pert) β z_fused + Ξ²(z_pert)
More expressive than concatenation; better suited for conditioning a state representation.
3.3.3 DeltaDecoderWithUncertainty
Predicts per-gene logFC (Ξ) and per-gene prediction variance (ΟΒ²):
- Shared SNNStack: z_cond β [1024] β [512] β [256] β [128]
- Ξ head: Linear(128, n_genes) β predicted expression change
- ΟΒ² head: Linear(128, n_genes) β Softplus β predicted variance (>0)
3.3.4 MuLGITPerturbEnsemble
K=3-5 identical models trained with different seeds:
- Epistemic uncertainty = variance of Ξ across ensemble
- Aleatoric uncertainty = mean ΟΒ² from each model
- Total uncertainty = epistemic + aleatoric
4. Data Strategy
4.1 Primary: Tahoe-100M
Status: β
Ready on HF Hub
Path: tahoebio/Tahoe-100M
| Property | Value |
|---|---|
| Rows | 95.6M drug-gene perturbation observations |
| Drugs | ~1,100 (with canonical SMILES, PubChem CID, MOA) |
| Cell lines | 50 cancer cell lines |
| Genes | 45,134 (full transcriptome) |
| Matched pre/post | Yes (vehicle controls per cell line) |
4.2 Supplementary Datasets
| Dataset | Perturbation Type | Samples | HF Hub | Use |
|---|---|---|---|---|
| GDSC | Drug (251) | 251K IC50 | β | IC50 validation |
| DepMap | Genetic (CRISPR KO) | 1,000+ cell lines | β | Genetic pert. validation |
| LINCS L1000 | Drug + genetic | 1.3M profiles | β | Cross-dataset benchmark |
| Norman19 | Genetic combos | 91K cells | Partial | Combinatorial benchmark |
4.3 Data Loading for Training
from datasets import load_dataset
# Drug metadata
drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")
# Expression data (streaming for 95.6M rows)
expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data",
split="train", streaming=True)
# Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id,
# canonical_smiles, pubchem_cid, moa-fine, sample
5. Training Objective
5.1 Primary Loss: Negative Log-Likelihood with Uncertainty
L_NLL = 0.5 Β· Ξ£_g [ log(ΟΒ²_g) + (Ξ_true,g β Ξ_pred,g)Β² / ΟΒ²_g ]
Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When ΟΒ²_g is large, the model is uncertain, and the MSE term is down-weighted.
5.2 Auxiliary Losses
- Pathway consistency:
L_path = MSE(GSEA(Ξ_pred), GSEA(Ξ_true))β pathways should shift coherently - Sparsity:
L_sparse = ||Ξ_pred||β / (BΒ·G)β only genes that actually respond should shift - Risk shift (optional):
L_risk = MSE(risk_post β risk_baseline, risk_shift_target)β ties back to survival
5.3 Hyperparameters
| Parameter | Value | Source |
|---|---|---|
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1) | Lingshu-Cell |
| LR schedule | 2e-4 β 2e-5, cosine | Lingshu-Cell |
| Batch size | 256 | Lingshu-Cell |
| Mixed precision | bf16 | Lingshu-Cell |
| MuLGIT backbone | Frozen | Transfer from TCGA |
| Ensemble size | K=3 models | Lakshminarayanan 2017 |
| Epochs | 50-100 (ES on val NLL) | Standard |
6. Benchmark Tasks & Leakage Controls
6.1 Task Suite
| Task | Data | Split | Key Metric |
|---|---|---|---|
| Drug response prediction | Tahoe-100M | Leave-drug-out (20%) | DES@50, Pearson-Ξ |
| Genetic perturbation | DepMap/Norman19 | Leave-gene-out (20%) | DES@50, Direction-match |
| Combinatorial pert. | Norman19 | Leave-combo-out (70%) | PDS, compositionality |
| Drug sensitivity (IC50) | GDSC | Leave-cell-line-out | Pearson r, Spearman Ο |
| Pathway rescue | Tahoe-100M (aging) | Leave-pathway-out | AUC of rescue |
6.2 Metrics
| Metric | What It Measures | Range | Target |
|---|---|---|---|
| DES@K | Fraction of true DEGs in top-K predicted | [0,1] | Higher |
| Pearson-Ξ | Correlation of predicted vs true Ξ | [-1,1] | Higher |
| Direction-match | Fraction of genes with correct sign | [0,1] | Higher |
| PDS | Discrimination between perturbations | [0,1] | Higher |
| MMD | Distributional fidelity in PCA space | [0,β) | Lower |
| RMSE | Raw expression reconstruction error | [0,β) | Lower |
7. Comparison to Existing Baselines
| Method | Gene Coverage | Drug Input | UQ | Attribution | HF-Ready |
|---|---|---|---|---|---|
| CPA | HVG (1-4K) | One-hot ID | Latent sampling | β | β |
| GEARS | All genes | One-hot ID | β | Gene graph edges | β |
| Lingshu-Cell | All 18K | One-hot ID | β | β | β |
| AdaPert | HVG only | Semantic embed | β | β KG subgraph | β |
| MuLGIT-Perturb | All 45K+ | SMILESβMolFormer | β Ensemble | β IG + KG | β |
8. Implementation Plan
8.1 New Files
mulgit/
βββ perturb/
β βββ __init__.py
β βββ encoder.py # PerturbationEncoder (drug + gene)
β βββ conditioning.py # FiLMConditioning
β βββ decoder.py # DeltaDecoderWithUncertainty
β βββ model.py # MuLGITPerturb
β βββ ensemble.py # MuLGITPerturbEnsemble
β βββ attribution.py # Integrated Gradients + KG subgraph
β βββ losses.py # NLL + pathway consistency + sparsity
β βββ data.py # Tahoe-100M perturbation data loader
β βββ trainer.py # Training loop with Trackio
β βββ evaluate.py # DES@K, PDS, Pearson-Ξ, etc.
β βββ config.py # MuLGITPerturbConfig
8.2 Five-Phase Execution
| Phase | Duration | Deliverable |
|---|---|---|
| 1. Tahoe-100M drug response | 2 weeks | Trained model + all benchmarks on leave-drug-out |
| 2. Cross-dataset generalization | 1 week | Zero-shot evaluation on LINCS, fine-tune if needed |
| 3. Genetic perturbations | 1 week | DepMap KO prediction, compare to Geneformer |
| 4. Combinatorial perturbation | 1 week | Norman19 combos, PDS evaluation |
| 5. Longevity-specific | 1 week | Aging pathway rescue, drug ranking vs static screener |
Total: 6 weeks from start to full validation.
9. Example Output
{
"perturbation": "Rapamycin",
"cell_line": "A549",
"baseline_risk": 0.342,
"predicted_post_risk": -0.187,
"risk_shift": -0.529,
"top_degs": [
{"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]},
{"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]},
{"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]}
],
"enriched_pathways": [
{"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003},
{"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001}
],
"aging_signature_reversal": 0.72,
"epistemic_uncertainty": 0.08,
"aleatoric_uncertainty": 0.15
}
10. Risk Assessment
| Risk | Mitigation |
|---|---|
| Tahoe-100M streaming slow (95.6M rows) | Local cache after first pass; pre-filter to 10M most informative rows |
| MolFormer fails for some SMILES | Fallback to Morgan fingerprints (RDKit already in deps) |
| MuLGIT backbone overfits to TCGA | Include tissue type as conditioning; evaluate zero-shot on LINCS |
| GPU memory for 45K-gene output | Start with L1000 landmarks (978); scale after validation |
| Combinatorial too hard | Phase as stretch goal; single-perturbation alone is impactful |
References
- Lotfollahi et al., "CPA," arxiv:2011.03086 (2021)
- Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986
- Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025)
- Li et al., "AdaPert," arxiv:2602.18885 (2025)
- He et al., "PerturBench," arxiv:2408.10609 (2024)
- El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024)
- Ross et al., "MolFormer," Nature Machine Intelligence 2022
- Theodoris et al., "Geneformer," Nature 2023
- Perez et al., "FiLM," AAAI 2018
- Kendall & Gal, "Bayesian UQ," NeurIPS 2017
- Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017
- Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024