MuLGIT / docs /MuLGIT_Perturb_Study_Plan.md
vedatonuryilmaz's picture
Upload docs/MuLGIT_Perturb_Study_Plan.md
6dc8bd4 verified

MuLGIT-Perturb: Perturbation Layer Extension

From Static Biomarker Ranking to Counterfactual Molecular State Prediction

Status: Design Specification β€” Study Plan
Repository: https://huggingface.co/vedatonuryilmaz/MuLGIT
Date: 2026-05-09


1. Motivation: What Static Models Miss

MuLGIT currently identifies causal features that correlate with survival/longevity and ranks drugs by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually worsen the molecular state of a given patient. What's missing is the ability to answer:

"What will this perturbation do to this patient's molecular state?"

Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one β€” useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization.


2. Capability Specification

Input

  1. Baseline omics state β€” the patient/cell's molecular profile before perturbation:
    • DNA: methylation Ξ²-values, CNV copy numbers
    • RNA: mRNA expression, miRNA expression
    • (Future) Chromatin: WGBS, ATAC-seq GeneActivity
  2. Perturbation descriptor β€” what is being applied:
    • Drug: SMILES string β†’ molecular fingerprint or ChemBERTa embedding
    • Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA)
    • Drug combination: set of drug fingerprints + dose ratios

Output

  1. Predicted molecular state shift (Ξ”) β€” per-gene or per-pathway expression change:
    • Ε·_post = f(baseline, perturbation) β†’ predicted post-perturbation expression
    • Ξ” = Ε·_post - baseline β†’ direction and magnitude of change per gene
  2. Uncertainty quantification β€” per-gene prediction intervals:
    • Epistemic uncertainty via deep ensembles (3-5 models with different seeds)
    • Aleatoric uncertainty via predicted variance (heteroscedastic output head)
  3. Mechanistic attribution β€” why this perturbation causes this shift:
    • Gene set activation scores (MSigDB pathways, Reactome, KEGG)
    • Top differentially expressed genes (DEGs) with effect sizes
    • Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs)
    • Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z"

3. Architecture: MuLGIT-Perturb

3.1 Design Principles

  1. Reuse MuLGIT encoders β€” the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become frozen feature extractors for the perturbation layer, providing biologically-grounded latent representations.

  2. Perturbation-conditioned residual prediction β€” the model predicts the difference between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction).

  3. Central dogma as structural prior β€” the perturbation flows through the same biological layers: drug β†’ chromatin state β†’ DNA methylation β†’ mRNA expression β†’ miRNA β†’ predicted phenotype shift.

  4. Uncertainty via deep ensembles β€” train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017).

  5. Chemical structure encoding β€” use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current drug_target.py.

3.2 Architecture Diagram

                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   PERTURBATION ENCODER            β”‚
                        β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
     Drug SMILES ──────►│   β”‚ MolFormerβ”‚   β”‚ Geneformer  │◄── Genetic Pert.
                        β”‚   β”‚  (frozen)β”‚   β”‚  (frozen)   β”‚ β”‚
                        β”‚   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
                        β”‚        β”‚                 β”‚        β”‚
                        β”‚   β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚
                        β”‚   β”‚   Fusion Gate (learned Ξ±)    β”‚ β”‚
                        β”‚   β”‚   z_pert = Ξ±Β·z_drug +        β”‚ β”‚
                        β”‚   β”‚           (1-Ξ±)Β·z_gene       β”‚ β”‚
                        β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          β”‚ z_pert ∈ R^768
                                          β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                    MULGIT (frozen feature extractor)               β”‚
    β”‚                                                                   β”‚
    β”‚  BASELINE STATE                                                   β”‚
    β”‚  Methylation ─► SNN Encoder ─► z_meth                             β”‚
    β”‚  CNV         ─► SNN Encoder ─► z_cnv                              β”‚
    β”‚  mRNA        ─► SNN Encoder ─► z_mrna                             β”‚
    β”‚  miRNA       ─► SNN Encoder ─► z_mirna                            β”‚
    β”‚                                                                   β”‚
    β”‚  CentralDogmaFusion                                               β”‚
    β”‚  z_dna = [z_meth, z_cnv]                                          β”‚
    │  z_dna_rna = DNA→RNA(z_dna, z_mrna)                               │
    β”‚  z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna)                 β”‚
    β”‚                    β”‚                                              β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ z_fused ∈ R^48 (baseline latent state)
                         β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚              PERTURBATION RESPONSE PREDICTOR (trained)             β”‚
    β”‚                                                                    β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
    β”‚  β”‚  Condition Fusion                                       β”‚      β”‚
    β”‚  β”‚  z_cond = FiLM(z_fused, z_pert)                        β”‚      β”‚
    β”‚  β”‚  (Feature-wise Linear Modulation: scale + shift from    β”‚      β”‚
    β”‚  β”‚   perturbation embedding applied to baseline state)     β”‚      β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
    β”‚                               β”‚                                    β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
    β”‚  β”‚  Delta Decoder (SNNStack)                               β”‚      β”‚
    β”‚  β”‚  z_cond ─► [1024] ─► [512] ─► [256] ─► [128] ─► Ξ”     β”‚      β”‚
    β”‚  β”‚                                          └─► σ² (UQ)   β”‚      β”‚
    β”‚  β”‚  Ξ” ∈ R^G (per-gene logFC prediction)                   β”‚      β”‚
    β”‚  β”‚  σ² ∈ R^G (per-gene predicted variance)                β”‚      β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
    β”‚                               β”‚                                    β”‚
    β”‚  Ε·_post = y_baseline + Ξ”                                         β”‚
    β”‚                                                                    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                     OUTPUT LAYER                                   β”‚
    β”‚                                                                    β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
    β”‚  β”‚   Ξ” per gene β”‚  β”‚   σ² per geneβ”‚  β”‚  Pathway Enrichment      β”‚β”‚
    β”‚  β”‚  (logFC)      β”‚  β”‚  (variance)  β”‚  β”‚  (GSEA on Ξ”)            β”‚β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
    β”‚                                                                    β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β”‚  β”‚  Mechanistic Attribution                                     β”‚ β”‚
    β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚ β”‚
    β”‚  β”‚  β”‚ Integrated       β”‚  β”‚ Perturbation-Conditioned        β”‚   β”‚ β”‚
    β”‚  β”‚  β”‚ Gradients on Ξ”   β”‚  β”‚ Knowledge Subgraph              β”‚   β”‚ β”‚
    │  │  │ (gene→pathway    │  │ (AdaPert-style: GNN over        │   │ │
    β”‚  β”‚  β”‚  attribution)    β”‚  β”‚  GO/Reactome/STRING + drug)     β”‚   β”‚ β”‚
    β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.3 Key Components

3.3.1 PerturbationEncoder

Encodes both drug and genetic perturbations into a unified embedding:

  • Drug: SMILES β†’ MolFormer (frozen, 768-dim) or ChemBERTa-2
  • Genetic: Gene symbol β†’ Geneformer gene embedding (frozen)
  • Fusion: Learned convex combination z_pert = Ξ±Β·z_drug + (1-Ξ±)Β·z_gene with Ξ± ∈ [0,1]

3.3.2 FiLM Conditioning Layer

Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state:

z_cond = Ξ³(z_pert) βŠ™ z_fused + Ξ²(z_pert)

More expressive than concatenation; better suited for conditioning a state representation.

3.3.3 DeltaDecoderWithUncertainty

Predicts per-gene logFC (Ξ”) and per-gene prediction variance (σ²):

  • Shared SNNStack: z_cond β†’ [1024] β†’ [512] β†’ [256] β†’ [128]
  • Ξ” head: Linear(128, n_genes) β†’ predicted expression change
  • σ² head: Linear(128, n_genes) β†’ Softplus β†’ predicted variance (>0)

3.3.4 MuLGITPerturbEnsemble

K=3-5 identical models trained with different seeds:

  • Epistemic uncertainty = variance of Ξ” across ensemble
  • Aleatoric uncertainty = mean σ² from each model
  • Total uncertainty = epistemic + aleatoric

4. Data Strategy

4.1 Primary: Tahoe-100M

Status: βœ… Ready on HF Hub
Path: tahoebio/Tahoe-100M

Property Value
Rows 95.6M drug-gene perturbation observations
Drugs ~1,100 (with canonical SMILES, PubChem CID, MOA)
Cell lines 50 cancer cell lines
Genes 45,134 (full transcriptome)
Matched pre/post Yes (vehicle controls per cell line)

4.2 Supplementary Datasets

Dataset Perturbation Type Samples HF Hub Use
GDSC Drug (251) 251K IC50 ❌ IC50 validation
DepMap Genetic (CRISPR KO) 1,000+ cell lines ❌ Genetic pert. validation
LINCS L1000 Drug + genetic 1.3M profiles ❌ Cross-dataset benchmark
Norman19 Genetic combos 91K cells Partial Combinatorial benchmark

4.3 Data Loading for Training

from datasets import load_dataset

# Drug metadata
drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")

# Expression data (streaming for 95.6M rows)
expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data", 
                        split="train", streaming=True)

# Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id, 
#           canonical_smiles, pubchem_cid, moa-fine, sample

5. Training Objective

5.1 Primary Loss: Negative Log-Likelihood with Uncertainty

L_NLL = 0.5 Β· Ξ£_g [ log(σ²_g) + (Ξ”_true,g βˆ’ Ξ”_pred,g)Β² / σ²_g ]

Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When σ²_g is large, the model is uncertain, and the MSE term is down-weighted.

5.2 Auxiliary Losses

  1. Pathway consistency: L_path = MSE(GSEA(Ξ”_pred), GSEA(Ξ”_true)) β€” pathways should shift coherently
  2. Sparsity: L_sparse = ||Ξ”_pred||₁ / (BΒ·G) β€” only genes that actually respond should shift
  3. Risk shift (optional): L_risk = MSE(risk_post βˆ’ risk_baseline, risk_shift_target) β€” ties back to survival

5.3 Hyperparameters

Parameter Value Source
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1) Lingshu-Cell
LR schedule 2e-4 β†’ 2e-5, cosine Lingshu-Cell
Batch size 256 Lingshu-Cell
Mixed precision bf16 Lingshu-Cell
MuLGIT backbone Frozen Transfer from TCGA
Ensemble size K=3 models Lakshminarayanan 2017
Epochs 50-100 (ES on val NLL) Standard

6. Benchmark Tasks & Leakage Controls

6.1 Task Suite

Task Data Split Key Metric
Drug response prediction Tahoe-100M Leave-drug-out (20%) DES@50, Pearson-Ξ”
Genetic perturbation DepMap/Norman19 Leave-gene-out (20%) DES@50, Direction-match
Combinatorial pert. Norman19 Leave-combo-out (70%) PDS, compositionality
Drug sensitivity (IC50) GDSC Leave-cell-line-out Pearson r, Spearman ρ
Pathway rescue Tahoe-100M (aging) Leave-pathway-out AUC of rescue

6.2 Metrics

Metric What It Measures Range Target
DES@K Fraction of true DEGs in top-K predicted [0,1] Higher
Pearson-Ξ” Correlation of predicted vs true Ξ” [-1,1] Higher
Direction-match Fraction of genes with correct sign [0,1] Higher
PDS Discrimination between perturbations [0,1] Higher
MMD Distributional fidelity in PCA space [0,∞) Lower
RMSE Raw expression reconstruction error [0,∞) Lower

7. Comparison to Existing Baselines

Method Gene Coverage Drug Input UQ Attribution HF-Ready
CPA HVG (1-4K) One-hot ID Latent sampling ❌ ❌
GEARS All genes One-hot ID ❌ Gene graph edges ❌
Lingshu-Cell All 18K One-hot ID ❌ ❌ ❌
AdaPert HVG only Semantic embed ❌ βœ… KG subgraph ❌
MuLGIT-Perturb All 45K+ SMILESβ†’MolFormer βœ… Ensemble βœ… IG + KG βœ…

8. Implementation Plan

8.1 New Files

mulgit/
β”œβ”€β”€ perturb/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ encoder.py          # PerturbationEncoder (drug + gene)
β”‚   β”œβ”€β”€ conditioning.py     # FiLMConditioning
β”‚   β”œβ”€β”€ decoder.py          # DeltaDecoderWithUncertainty
β”‚   β”œβ”€β”€ model.py            # MuLGITPerturb
β”‚   β”œβ”€β”€ ensemble.py         # MuLGITPerturbEnsemble
β”‚   β”œβ”€β”€ attribution.py      # Integrated Gradients + KG subgraph
β”‚   β”œβ”€β”€ losses.py           # NLL + pathway consistency + sparsity
β”‚   β”œβ”€β”€ data.py             # Tahoe-100M perturbation data loader
β”‚   β”œβ”€β”€ trainer.py          # Training loop with Trackio
β”‚   β”œβ”€β”€ evaluate.py         # DES@K, PDS, Pearson-Ξ”, etc.
β”‚   └── config.py           # MuLGITPerturbConfig

8.2 Five-Phase Execution

Phase Duration Deliverable
1. Tahoe-100M drug response 2 weeks Trained model + all benchmarks on leave-drug-out
2. Cross-dataset generalization 1 week Zero-shot evaluation on LINCS, fine-tune if needed
3. Genetic perturbations 1 week DepMap KO prediction, compare to Geneformer
4. Combinatorial perturbation 1 week Norman19 combos, PDS evaluation
5. Longevity-specific 1 week Aging pathway rescue, drug ranking vs static screener

Total: 6 weeks from start to full validation.


9. Example Output

{
  "perturbation": "Rapamycin",
  "cell_line": "A549",
  "baseline_risk": 0.342,
  "predicted_post_risk": -0.187,
  "risk_shift": -0.529,
  
  "top_degs": [
    {"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]},
    {"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]},
    {"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]}
  ],
  
  "enriched_pathways": [
    {"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003},
    {"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001}
  ],
  
  "aging_signature_reversal": 0.72,
  "epistemic_uncertainty": 0.08,
  "aleatoric_uncertainty": 0.15
}

10. Risk Assessment

Risk Mitigation
Tahoe-100M streaming slow (95.6M rows) Local cache after first pass; pre-filter to 10M most informative rows
MolFormer fails for some SMILES Fallback to Morgan fingerprints (RDKit already in deps)
MuLGIT backbone overfits to TCGA Include tissue type as conditioning; evaluate zero-shot on LINCS
GPU memory for 45K-gene output Start with L1000 landmarks (978); scale after validation
Combinatorial too hard Phase as stretch goal; single-perturbation alone is impactful

References

  1. Lotfollahi et al., "CPA," arxiv:2011.03086 (2021)
  2. Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986
  3. Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025)
  4. Li et al., "AdaPert," arxiv:2602.18885 (2025)
  5. He et al., "PerturBench," arxiv:2408.10609 (2024)
  6. El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024)
  7. Ross et al., "MolFormer," Nature Machine Intelligence 2022
  8. Theodoris et al., "Geneformer," Nature 2023
  9. Perez et al., "FiLM," AAAI 2018
  10. Kendall & Gal, "Bayesian UQ," NeurIPS 2017
  11. Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017
  12. Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024