# MuLGIT-Perturb: Perturbation Layer Extension ## From Static Biomarker Ranking to Counterfactual Molecular State Prediction **Status:** Design Specification — Study Plan **Repository:** https://huggingface.co/vedatonuryilmaz/MuLGIT **Date:** 2026-05-09 --- ## 1. Motivation: What Static Models Miss MuLGIT currently identifies **causal features** that correlate with survival/longevity and **ranks drugs** by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually **worsen** the molecular state of a given patient. What's missing is the ability to answer: > **"What will this perturbation do to this patient's molecular state?"** Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one — useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization. --- ## 2. Capability Specification ### Input 1. **Baseline omics state** — the patient/cell's molecular profile before perturbation: - DNA: methylation β-values, CNV copy numbers - RNA: mRNA expression, miRNA expression - (Future) Chromatin: WGBS, ATAC-seq GeneActivity 2. **Perturbation descriptor** — what is being applied: - Drug: SMILES string → molecular fingerprint or ChemBERTa embedding - Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA) - Drug combination: set of drug fingerprints + dose ratios ### Output 1. **Predicted molecular state shift (Δ)** — per-gene or per-pathway expression change: - `ŷ_post = f(baseline, perturbation)` → predicted post-perturbation expression - `Δ = ŷ_post - baseline` → direction and magnitude of change per gene 2. **Uncertainty quantification** — per-gene prediction intervals: - Epistemic uncertainty via deep ensembles (3-5 models with different seeds) - Aleatoric uncertainty via predicted variance (heteroscedastic output head) 3. **Mechanistic attribution** — why this perturbation causes this shift: - Gene set activation scores (MSigDB pathways, Reactome, KEGG) - Top differentially expressed genes (DEGs) with effect sizes - Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs) - Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z" --- ## 3. Architecture: MuLGIT-Perturb ### 3.1 Design Principles 1. **Reuse MuLGIT encoders** — the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become **frozen feature extractors** for the perturbation layer, providing biologically-grounded latent representations. 2. **Perturbation-conditioned residual prediction** — the model predicts the **difference** between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction). 3. **Central dogma as structural prior** — the perturbation flows through the same biological layers: drug → chromatin state → DNA methylation → mRNA expression → miRNA → predicted phenotype shift. 4. **Uncertainty via deep ensembles** — train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017). 5. **Chemical structure encoding** — use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current `drug_target.py`. ### 3.2 Architecture Diagram ``` ┌──────────────────────────────────┐ │ PERTURBATION ENCODER │ │ ┌──────────┐ ┌─────────────┐ │ Drug SMILES ──────►│ │ MolFormer│ │ Geneformer │◄── Genetic Pert. │ │ (frozen)│ │ (frozen) │ │ │ └────┬─────┘ └──────┬──────┘ │ │ │ │ │ │ ┌────▼─────────────────▼──────┐ │ │ │ Fusion Gate (learned α) │ │ │ │ z_pert = α·z_drug + │ │ │ │ (1-α)·z_gene │ │ │ └─────────────┬────────────────┘ │ └─────────────────┼──────────────────┘ │ z_pert ∈ R^768 │ ┌─────────────────────────────────────┼─────────────────────────────┐ │ MULGIT (frozen feature extractor) │ │ │ │ BASELINE STATE │ │ Methylation ─► SNN Encoder ─► z_meth │ │ CNV ─► SNN Encoder ─► z_cnv │ │ mRNA ─► SNN Encoder ─► z_mrna │ │ miRNA ─► SNN Encoder ─► z_mirna │ │ │ │ CentralDogmaFusion │ │ z_dna = [z_meth, z_cnv] │ │ z_dna_rna = DNA→RNA(z_dna, z_mrna) │ │ z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna) │ │ │ │ └────────────────────┼──────────────────────────────────────────────┘ │ z_fused ∈ R^48 (baseline latent state) │ ┌────────────────────▼──────────────────────────────────────────────┐ │ PERTURBATION RESPONSE PREDICTOR (trained) │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Condition Fusion │ │ │ │ z_cond = FiLM(z_fused, z_pert) │ │ │ │ (Feature-wise Linear Modulation: scale + shift from │ │ │ │ perturbation embedding applied to baseline state) │ │ │ └────────────────────────────┬───────────────────────────┘ │ │ │ │ │ ┌────────────────────────────▼───────────────────────────┐ │ │ │ Delta Decoder (SNNStack) │ │ │ │ z_cond ─► [1024] ─► [512] ─► [256] ─► [128] ─► Δ │ │ │ │ └─► σ² (UQ) │ │ │ │ Δ ∈ R^G (per-gene logFC prediction) │ │ │ │ σ² ∈ R^G (per-gene predicted variance) │ │ │ └────────────────────────────┬───────────────────────────┘ │ │ │ │ │ ŷ_post = y_baseline + Δ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ ┌──────────────────────────────▼────────────────────────────────────┐ │ OUTPUT LAYER │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐│ │ │ Δ per gene │ │ σ² per gene│ │ Pathway Enrichment ││ │ │ (logFC) │ │ (variance) │ │ (GSEA on Δ) ││ │ └──────────────┘ └──────────────┘ └──────────────────────────┘│ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Mechanistic Attribution │ │ │ │ ┌─────────────────┐ ┌──────────────────────────────────┐ │ │ │ │ │ Integrated │ │ Perturbation-Conditioned │ │ │ │ │ │ Gradients on Δ │ │ Knowledge Subgraph │ │ │ │ │ │ (gene→pathway │ │ (AdaPert-style: GNN over │ │ │ │ │ │ attribution) │ │ GO/Reactome/STRING + drug) │ │ │ │ │ └─────────────────┘ └──────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────┘ ``` ### 3.3 Key Components #### 3.3.1 PerturbationEncoder Encodes both drug and genetic perturbations into a unified embedding: - Drug: SMILES → MolFormer (frozen, 768-dim) or ChemBERTa-2 - Genetic: Gene symbol → Geneformer gene embedding (frozen) - Fusion: Learned convex combination `z_pert = α·z_drug + (1-α)·z_gene` with `α ∈ [0,1]` #### 3.3.2 FiLM Conditioning Layer Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state: ``` z_cond = γ(z_pert) ⊙ z_fused + β(z_pert) ``` More expressive than concatenation; better suited for conditioning a state representation. #### 3.3.3 DeltaDecoderWithUncertainty Predicts per-gene logFC (Δ) and per-gene prediction variance (σ²): - Shared SNNStack: z_cond → [1024] → [512] → [256] → [128] - Δ head: Linear(128, n_genes) → predicted expression change - σ² head: Linear(128, n_genes) → Softplus → predicted variance (>0) #### 3.3.4 MuLGITPerturbEnsemble K=3-5 identical models trained with different seeds: - Epistemic uncertainty = variance of Δ across ensemble - Aleatoric uncertainty = mean σ² from each model - Total uncertainty = epistemic + aleatoric --- ## 4. Data Strategy ### 4.1 Primary: Tahoe-100M **Status:** ✅ Ready on HF Hub **Path:** `tahoebio/Tahoe-100M` | Property | Value | |----------|-------| | Rows | 95.6M drug-gene perturbation observations | | Drugs | ~1,100 (with canonical SMILES, PubChem CID, MOA) | | Cell lines | 50 cancer cell lines | | Genes | 45,134 (full transcriptome) | | Matched pre/post | Yes (vehicle controls per cell line) | ### 4.2 Supplementary Datasets | Dataset | Perturbation Type | Samples | HF Hub | Use | |---------|-------------------|---------|--------|-----| | GDSC | Drug (251) | 251K IC50 | ❌ | IC50 validation | | DepMap | Genetic (CRISPR KO) | 1,000+ cell lines | ❌ | Genetic pert. validation | | LINCS L1000 | Drug + genetic | 1.3M profiles | ❌ | Cross-dataset benchmark | | Norman19 | Genetic combos | 91K cells | Partial | Combinatorial benchmark | ### 4.3 Data Loading for Training ```python from datasets import load_dataset # Drug metadata drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train") # Expression data (streaming for 95.6M rows) expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data", split="train", streaming=True) # Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id, # canonical_smiles, pubchem_cid, moa-fine, sample ``` --- ## 5. Training Objective ### 5.1 Primary Loss: Negative Log-Likelihood with Uncertainty ``` L_NLL = 0.5 · Σ_g [ log(σ²_g) + (Δ_true,g − Δ_pred,g)² / σ²_g ] ``` Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When σ²_g is large, the model is uncertain, and the MSE term is down-weighted. ### 5.2 Auxiliary Losses 1. **Pathway consistency:** `L_path = MSE(GSEA(Δ_pred), GSEA(Δ_true))` — pathways should shift coherently 2. **Sparsity:** `L_sparse = ||Δ_pred||₁ / (B·G)` — only genes that actually respond should shift 3. **Risk shift (optional):** `L_risk = MSE(risk_post − risk_baseline, risk_shift_target)` — ties back to survival ### 5.3 Hyperparameters | Parameter | Value | Source | |-----------|-------|--------| | Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.1) | Lingshu-Cell | | LR schedule | 2e-4 → 2e-5, cosine | Lingshu-Cell | | Batch size | 256 | Lingshu-Cell | | Mixed precision | bf16 | Lingshu-Cell | | MuLGIT backbone | Frozen | Transfer from TCGA | | Ensemble size | K=3 models | Lakshminarayanan 2017 | | Epochs | 50-100 (ES on val NLL) | Standard | --- ## 6. Benchmark Tasks & Leakage Controls ### 6.1 Task Suite | Task | Data | Split | Key Metric | |------|------|-------|------------| | Drug response prediction | Tahoe-100M | Leave-drug-out (20%) | DES@50, Pearson-Δ | | Genetic perturbation | DepMap/Norman19 | Leave-gene-out (20%) | DES@50, Direction-match | | Combinatorial pert. | Norman19 | Leave-combo-out (70%) | PDS, compositionality | | Drug sensitivity (IC50) | GDSC | Leave-cell-line-out | Pearson r, Spearman ρ | | Pathway rescue | Tahoe-100M (aging) | Leave-pathway-out | AUC of rescue | ### 6.2 Metrics | Metric | What It Measures | Range | Target | |--------|-----------------|-------|--------| | **DES@K** | Fraction of true DEGs in top-K predicted | [0,1] | Higher | | **Pearson-Δ** | Correlation of predicted vs true Δ | [-1,1] | Higher | | **Direction-match** | Fraction of genes with correct sign | [0,1] | Higher | | **PDS** | Discrimination between perturbations | [0,1] | Higher | | **MMD** | Distributional fidelity in PCA space | [0,∞) | Lower | | **RMSE** | Raw expression reconstruction error | [0,∞) | Lower | --- ## 7. Comparison to Existing Baselines | Method | Gene Coverage | Drug Input | UQ | Attribution | HF-Ready | |--------|--------------|------------|-----|-------------|----------| | CPA | HVG (1-4K) | One-hot ID | Latent sampling | ❌ | ❌ | | GEARS | All genes | One-hot ID | ❌ | Gene graph edges | ❌ | | Lingshu-Cell | All 18K | One-hot ID | ❌ | ❌ | ❌ | | AdaPert | HVG only | Semantic embed | ❌ | ✅ KG subgraph | ❌ | | **MuLGIT-Perturb** | **All 45K+** | **SMILES→MolFormer** | **✅ Ensemble** | **✅ IG + KG** | **✅** | --- ## 8. Implementation Plan ### 8.1 New Files ``` mulgit/ ├── perturb/ │ ├── __init__.py │ ├── encoder.py # PerturbationEncoder (drug + gene) │ ├── conditioning.py # FiLMConditioning │ ├── decoder.py # DeltaDecoderWithUncertainty │ ├── model.py # MuLGITPerturb │ ├── ensemble.py # MuLGITPerturbEnsemble │ ├── attribution.py # Integrated Gradients + KG subgraph │ ├── losses.py # NLL + pathway consistency + sparsity │ ├── data.py # Tahoe-100M perturbation data loader │ ├── trainer.py # Training loop with Trackio │ ├── evaluate.py # DES@K, PDS, Pearson-Δ, etc. │ └── config.py # MuLGITPerturbConfig ``` ### 8.2 Five-Phase Execution | Phase | Duration | Deliverable | |-------|----------|-------------| | 1. Tahoe-100M drug response | 2 weeks | Trained model + all benchmarks on leave-drug-out | | 2. Cross-dataset generalization | 1 week | Zero-shot evaluation on LINCS, fine-tune if needed | | 3. Genetic perturbations | 1 week | DepMap KO prediction, compare to Geneformer | | 4. Combinatorial perturbation | 1 week | Norman19 combos, PDS evaluation | | 5. Longevity-specific | 1 week | Aging pathway rescue, drug ranking vs static screener | **Total: 6 weeks from start to full validation.** --- ## 9. Example Output ```json { "perturbation": "Rapamycin", "cell_line": "A549", "baseline_risk": 0.342, "predicted_post_risk": -0.187, "risk_shift": -0.529, "top_degs": [ {"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]}, {"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]}, {"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]} ], "enriched_pathways": [ {"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003}, {"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001} ], "aging_signature_reversal": 0.72, "epistemic_uncertainty": 0.08, "aleatoric_uncertainty": 0.15 } ``` --- ## 10. Risk Assessment | Risk | Mitigation | |------|------------| | Tahoe-100M streaming slow (95.6M rows) | Local cache after first pass; pre-filter to 10M most informative rows | | MolFormer fails for some SMILES | Fallback to Morgan fingerprints (RDKit already in deps) | | MuLGIT backbone overfits to TCGA | Include tissue type as conditioning; evaluate zero-shot on LINCS | | GPU memory for 45K-gene output | Start with L1000 landmarks (978); scale after validation | | Combinatorial too hard | Phase as stretch goal; single-perturbation alone is impactful | --- ## References 1. Lotfollahi et al., "CPA," arxiv:2011.03086 (2021) 2. Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986 3. Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025) 4. Li et al., "AdaPert," arxiv:2602.18885 (2025) 5. He et al., "PerturBench," arxiv:2408.10609 (2024) 6. El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024) 7. Ross et al., "MolFormer," Nature Machine Intelligence 2022 8. Theodoris et al., "Geneformer," Nature 2023 9. Perez et al., "FiLM," AAAI 2018 10. Kendall & Gal, "Bayesian UQ," NeurIPS 2017 11. Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017 12. Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024