| # MuLGIT-Perturb: Perturbation Layer Extension |
| ## From Static Biomarker Ranking to Counterfactual Molecular State Prediction |
|
|
| **Status:** Design Specification β Study Plan |
| **Repository:** https://huggingface.co/vedatonuryilmaz/MuLGIT |
| **Date:** 2026-05-09 |
|
|
| --- |
|
|
| ## 1. Motivation: What Static Models Miss |
|
|
| MuLGIT currently identifies **causal features** that correlate with survival/longevity and **ranks drugs** by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually **worsen** the molecular state of a given patient. What's missing is the ability to answer: |
|
|
| > **"What will this perturbation do to this patient's molecular state?"** |
|
|
| Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one β useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization. |
|
|
| --- |
|
|
| ## 2. Capability Specification |
|
|
| ### Input |
| 1. **Baseline omics state** β the patient/cell's molecular profile before perturbation: |
| - DNA: methylation Ξ²-values, CNV copy numbers |
| - RNA: mRNA expression, miRNA expression |
| - (Future) Chromatin: WGBS, ATAC-seq GeneActivity |
| 2. **Perturbation descriptor** β what is being applied: |
| - Drug: SMILES string β molecular fingerprint or ChemBERTa embedding |
| - Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA) |
| - Drug combination: set of drug fingerprints + dose ratios |
|
|
| ### Output |
| 1. **Predicted molecular state shift (Ξ)** β per-gene or per-pathway expression change: |
| - `Ε·_post = f(baseline, perturbation)` β predicted post-perturbation expression |
| - `Ξ = Ε·_post - baseline` β direction and magnitude of change per gene |
| 2. **Uncertainty quantification** β per-gene prediction intervals: |
| - Epistemic uncertainty via deep ensembles (3-5 models with different seeds) |
| - Aleatoric uncertainty via predicted variance (heteroscedastic output head) |
| 3. **Mechanistic attribution** β why this perturbation causes this shift: |
| - Gene set activation scores (MSigDB pathways, Reactome, KEGG) |
| - Top differentially expressed genes (DEGs) with effect sizes |
| - Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs) |
| - Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z" |
|
|
| --- |
|
|
| ## 3. Architecture: MuLGIT-Perturb |
|
|
| ### 3.1 Design Principles |
|
|
| 1. **Reuse MuLGIT encoders** β the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become **frozen feature extractors** for the perturbation layer, providing biologically-grounded latent representations. |
|
|
| 2. **Perturbation-conditioned residual prediction** β the model predicts the **difference** between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction). |
|
|
| 3. **Central dogma as structural prior** β the perturbation flows through the same biological layers: drug β chromatin state β DNA methylation β mRNA expression β miRNA β predicted phenotype shift. |
|
|
| 4. **Uncertainty via deep ensembles** β train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017). |
|
|
| 5. **Chemical structure encoding** β use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current `drug_target.py`. |
|
|
| ### 3.2 Architecture Diagram |
|
|
| ``` |
| ββββββββββββββββββββββββββββββββββββ |
| β PERTURBATION ENCODER β |
| β ββββββββββββ βββββββββββββββ β |
| Drug SMILES βββββββΊβ β MolFormerβ β Geneformer ββββ Genetic Pert. |
| β β (frozen)β β (frozen) β β |
| β ββββββ¬ββββββ ββββββββ¬βββββββ β |
| β β β β |
| β ββββββΌββββββββββββββββββΌβββββββ β |
| β β Fusion Gate (learned Ξ±) β β |
| β β z_pert = Ξ±Β·z_drug + β β |
| β β (1-Ξ±)Β·z_gene β β |
| β βββββββββββββββ¬βββββββββββββββββ β |
| βββββββββββββββββββΌβββββββββββββββββββ |
| β z_pert β R^768 |
| β |
| βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ |
| β MULGIT (frozen feature extractor) β |
| β β |
| β BASELINE STATE β |
| β Methylation ββΊ SNN Encoder ββΊ z_meth β |
| β CNV ββΊ SNN Encoder ββΊ z_cnv β |
| β mRNA ββΊ SNN Encoder ββΊ z_mrna β |
| β miRNA ββΊ SNN Encoder ββΊ z_mirna β |
| β β |
| β CentralDogmaFusion β |
| β z_dna = [z_meth, z_cnv] β |
| β z_dna_rna = DNAβRNA(z_dna, z_mrna) β |
| β z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna) β |
| β β β |
| ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ |
| β z_fused β R^48 (baseline latent state) |
| β |
| ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ |
| β PERTURBATION RESPONSE PREDICTOR (trained) β |
| β β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Condition Fusion β β |
| β β z_cond = FiLM(z_fused, z_pert) β β |
| β β (Feature-wise Linear Modulation: scale + shift from β β |
| β β perturbation embedding applied to baseline state) β β |
| β ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β |
| β β β |
| β ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ β |
| β β Delta Decoder (SNNStack) β β |
| β β z_cond ββΊ [1024] ββΊ [512] ββΊ [256] ββΊ [128] ββΊ Ξ β β |
| β β βββΊ ΟΒ² (UQ) β β |
| β β Ξ β R^G (per-gene logFC prediction) β β |
| β β ΟΒ² β R^G (per-gene predicted variance) β β |
| β ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β |
| β β β |
| β Ε·_post = y_baseline + Ξ β |
| β β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ |
| β OUTPUT LAYER β |
| β β |
| β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββββ |
| β β Ξ per gene β β ΟΒ² per geneβ β Pathway Enrichment ββ |
| β β (logFC) β β (variance) β β (GSEA on Ξ) ββ |
| β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββββ |
| β β |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Mechanistic Attribution β β |
| β β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β |
| β β β Integrated β β Perturbation-Conditioned β β β |
| β β β Gradients on Ξ β β Knowledge Subgraph β β β |
| β β β (geneβpathway β β (AdaPert-style: GNN over β β β |
| β β β attribution) β β GO/Reactome/STRING + drug) β β β |
| β β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ### 3.3 Key Components |
|
|
| #### 3.3.1 PerturbationEncoder |
|
|
| Encodes both drug and genetic perturbations into a unified embedding: |
| - Drug: SMILES β MolFormer (frozen, 768-dim) or ChemBERTa-2 |
| - Genetic: Gene symbol β Geneformer gene embedding (frozen) |
| - Fusion: Learned convex combination `z_pert = Ξ±Β·z_drug + (1-Ξ±)Β·z_gene` with `Ξ± β [0,1]` |
|
|
| #### 3.3.2 FiLM Conditioning Layer |
|
|
| Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state: |
| ``` |
| z_cond = Ξ³(z_pert) β z_fused + Ξ²(z_pert) |
| ``` |
| More expressive than concatenation; better suited for conditioning a state representation. |
|
|
| #### 3.3.3 DeltaDecoderWithUncertainty |
|
|
| Predicts per-gene logFC (Ξ) and per-gene prediction variance (ΟΒ²): |
| - Shared SNNStack: z_cond β [1024] β [512] β [256] β [128] |
| - Ξ head: Linear(128, n_genes) β predicted expression change |
| - ΟΒ² head: Linear(128, n_genes) β Softplus β predicted variance (>0) |
| |
| #### 3.3.4 MuLGITPerturbEnsemble |
| |
| K=3-5 identical models trained with different seeds: |
| - Epistemic uncertainty = variance of Ξ across ensemble |
| - Aleatoric uncertainty = mean ΟΒ² from each model |
| - Total uncertainty = epistemic + aleatoric |
| |
| --- |
| |
| ## 4. Data Strategy |
| |
| ### 4.1 Primary: Tahoe-100M |
| |
| **Status:** β
Ready on HF Hub |
| **Path:** `tahoebio/Tahoe-100M` |
| |
| | Property | Value | |
| |----------|-------| |
| | Rows | 95.6M drug-gene perturbation observations | |
| | Drugs | ~1,100 (with canonical SMILES, PubChem CID, MOA) | |
| | Cell lines | 50 cancer cell lines | |
| | Genes | 45,134 (full transcriptome) | |
| | Matched pre/post | Yes (vehicle controls per cell line) | |
| |
| ### 4.2 Supplementary Datasets |
| |
| | Dataset | Perturbation Type | Samples | HF Hub | Use | |
| |---------|-------------------|---------|--------|-----| |
| | GDSC | Drug (251) | 251K IC50 | β | IC50 validation | |
| | DepMap | Genetic (CRISPR KO) | 1,000+ cell lines | β | Genetic pert. validation | |
| | LINCS L1000 | Drug + genetic | 1.3M profiles | β | Cross-dataset benchmark | |
| | Norman19 | Genetic combos | 91K cells | Partial | Combinatorial benchmark | |
| |
| ### 4.3 Data Loading for Training |
| |
| ```python |
| from datasets import load_dataset |
|
|
| # Drug metadata |
| drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train") |
| |
| # Expression data (streaming for 95.6M rows) |
| expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data", |
| split="train", streaming=True) |
| |
| # Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id, |
| # canonical_smiles, pubchem_cid, moa-fine, sample |
| ``` |
| |
| --- |
| |
| ## 5. Training Objective |
| |
| ### 5.1 Primary Loss: Negative Log-Likelihood with Uncertainty |
| |
| ``` |
| L_NLL = 0.5 Β· Ξ£_g [ log(ΟΒ²_g) + (Ξ_true,g β Ξ_pred,g)Β² / ΟΒ²_g ] |
| ``` |
| Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When ΟΒ²_g is large, the model is uncertain, and the MSE term is down-weighted. |
| |
| ### 5.2 Auxiliary Losses |
| |
| 1. **Pathway consistency:** `L_path = MSE(GSEA(Ξ_pred), GSEA(Ξ_true))` β pathways should shift coherently |
| 2. **Sparsity:** `L_sparse = ||Ξ_pred||β / (BΒ·G)` β only genes that actually respond should shift |
| 3. **Risk shift (optional):** `L_risk = MSE(risk_post β risk_baseline, risk_shift_target)` β ties back to survival |
| |
| ### 5.3 Hyperparameters |
| |
| | Parameter | Value | Source | |
| |-----------|-------|--------| |
| | Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1) | Lingshu-Cell | |
| | LR schedule | 2e-4 β 2e-5, cosine | Lingshu-Cell | |
| | Batch size | 256 | Lingshu-Cell | |
| | Mixed precision | bf16 | Lingshu-Cell | |
| | MuLGIT backbone | Frozen | Transfer from TCGA | |
| | Ensemble size | K=3 models | Lakshminarayanan 2017 | |
| | Epochs | 50-100 (ES on val NLL) | Standard | |
| |
| --- |
| |
| ## 6. Benchmark Tasks & Leakage Controls |
| |
| ### 6.1 Task Suite |
| |
| | Task | Data | Split | Key Metric | |
| |------|------|-------|------------| |
| | Drug response prediction | Tahoe-100M | Leave-drug-out (20%) | DES@50, Pearson-Ξ | |
| | Genetic perturbation | DepMap/Norman19 | Leave-gene-out (20%) | DES@50, Direction-match | |
| | Combinatorial pert. | Norman19 | Leave-combo-out (70%) | PDS, compositionality | |
| | Drug sensitivity (IC50) | GDSC | Leave-cell-line-out | Pearson r, Spearman Ο | |
| | Pathway rescue | Tahoe-100M (aging) | Leave-pathway-out | AUC of rescue | |
| |
| ### 6.2 Metrics |
| |
| | Metric | What It Measures | Range | Target | |
| |--------|-----------------|-------|--------| |
| | **DES@K** | Fraction of true DEGs in top-K predicted | [0,1] | Higher | |
| | **Pearson-Ξ** | Correlation of predicted vs true Ξ | [-1,1] | Higher | |
| | **Direction-match** | Fraction of genes with correct sign | [0,1] | Higher | |
| | **PDS** | Discrimination between perturbations | [0,1] | Higher | |
| | **MMD** | Distributional fidelity in PCA space | [0,β) | Lower | |
| | **RMSE** | Raw expression reconstruction error | [0,β) | Lower | |
| |
| --- |
| |
| ## 7. Comparison to Existing Baselines |
| |
| | Method | Gene Coverage | Drug Input | UQ | Attribution | HF-Ready | |
| |--------|--------------|------------|-----|-------------|----------| |
| | CPA | HVG (1-4K) | One-hot ID | Latent sampling | β | β | |
| | GEARS | All genes | One-hot ID | β | Gene graph edges | β | |
| | Lingshu-Cell | All 18K | One-hot ID | β | β | β | |
| | AdaPert | HVG only | Semantic embed | β | β
KG subgraph | β | |
| | **MuLGIT-Perturb** | **All 45K+** | **SMILESβMolFormer** | **β
Ensemble** | **β
IG + KG** | **β
** | |
| |
| --- |
| |
| ## 8. Implementation Plan |
| |
| ### 8.1 New Files |
| |
| ``` |
| mulgit/ |
| βββ perturb/ |
| β βββ __init__.py |
| β βββ encoder.py # PerturbationEncoder (drug + gene) |
| β βββ conditioning.py # FiLMConditioning |
| β βββ decoder.py # DeltaDecoderWithUncertainty |
| β βββ model.py # MuLGITPerturb |
| β βββ ensemble.py # MuLGITPerturbEnsemble |
| β βββ attribution.py # Integrated Gradients + KG subgraph |
| β βββ losses.py # NLL + pathway consistency + sparsity |
| β βββ data.py # Tahoe-100M perturbation data loader |
| β βββ trainer.py # Training loop with Trackio |
| β βββ evaluate.py # DES@K, PDS, Pearson-Ξ, etc. |
| β βββ config.py # MuLGITPerturbConfig |
| ``` |
| |
| ### 8.2 Five-Phase Execution |
| |
| | Phase | Duration | Deliverable | |
| |-------|----------|-------------| |
| | 1. Tahoe-100M drug response | 2 weeks | Trained model + all benchmarks on leave-drug-out | |
| | 2. Cross-dataset generalization | 1 week | Zero-shot evaluation on LINCS, fine-tune if needed | |
| | 3. Genetic perturbations | 1 week | DepMap KO prediction, compare to Geneformer | |
| | 4. Combinatorial perturbation | 1 week | Norman19 combos, PDS evaluation | |
| | 5. Longevity-specific | 1 week | Aging pathway rescue, drug ranking vs static screener | |
| |
| **Total: 6 weeks from start to full validation.** |
| |
| --- |
| |
| ## 9. Example Output |
| |
| ```json |
| { |
| "perturbation": "Rapamycin", |
| "cell_line": "A549", |
| "baseline_risk": 0.342, |
| "predicted_post_risk": -0.187, |
| "risk_shift": -0.529, |
| |
| "top_degs": [ |
| {"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]}, |
| {"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]}, |
| {"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]} |
| ], |
| |
| "enriched_pathways": [ |
| {"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003}, |
| {"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001} |
| ], |
| |
| "aging_signature_reversal": 0.72, |
| "epistemic_uncertainty": 0.08, |
| "aleatoric_uncertainty": 0.15 |
| } |
| ``` |
| |
| --- |
|
|
| ## 10. Risk Assessment |
|
|
| | Risk | Mitigation | |
| |------|------------| |
| | Tahoe-100M streaming slow (95.6M rows) | Local cache after first pass; pre-filter to 10M most informative rows | |
| | MolFormer fails for some SMILES | Fallback to Morgan fingerprints (RDKit already in deps) | |
| | MuLGIT backbone overfits to TCGA | Include tissue type as conditioning; evaluate zero-shot on LINCS | |
| | GPU memory for 45K-gene output | Start with L1000 landmarks (978); scale after validation | |
| | Combinatorial too hard | Phase as stretch goal; single-perturbation alone is impactful | |
|
|
| --- |
|
|
| ## References |
|
|
| 1. Lotfollahi et al., "CPA," arxiv:2011.03086 (2021) |
| 2. Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986 |
| 3. Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025) |
| 4. Li et al., "AdaPert," arxiv:2602.18885 (2025) |
| 5. He et al., "PerturBench," arxiv:2408.10609 (2024) |
| 6. El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024) |
| 7. Ross et al., "MolFormer," Nature Machine Intelligence 2022 |
| 8. Theodoris et al., "Geneformer," Nature 2023 |
| 9. Perez et al., "FiLM," AAAI 2018 |
| 10. Kendall & Gal, "Bayesian UQ," NeurIPS 2017 |
| 11. Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017 |
| 12. Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024 |
|
|