MuLGIT / docs /MuLGIT_Perturb_Study_Plan.md
vedatonuryilmaz's picture
Upload docs/MuLGIT_Perturb_Study_Plan.md
6dc8bd4 verified
# MuLGIT-Perturb: Perturbation Layer Extension
## From Static Biomarker Ranking to Counterfactual Molecular State Prediction
**Status:** Design Specification β€” Study Plan
**Repository:** https://huggingface.co/vedatonuryilmaz/MuLGIT
**Date:** 2026-05-09
---
## 1. Motivation: What Static Models Miss
MuLGIT currently identifies **causal features** that correlate with survival/longevity and **ranks drugs** by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually **worsen** the molecular state of a given patient. What's missing is the ability to answer:
> **"What will this perturbation do to this patient's molecular state?"**
Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one β€” useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization.
---
## 2. Capability Specification
### Input
1. **Baseline omics state** β€” the patient/cell's molecular profile before perturbation:
- DNA: methylation Ξ²-values, CNV copy numbers
- RNA: mRNA expression, miRNA expression
- (Future) Chromatin: WGBS, ATAC-seq GeneActivity
2. **Perturbation descriptor** β€” what is being applied:
- Drug: SMILES string β†’ molecular fingerprint or ChemBERTa embedding
- Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA)
- Drug combination: set of drug fingerprints + dose ratios
### Output
1. **Predicted molecular state shift (Ξ”)** β€” per-gene or per-pathway expression change:
- `Ε·_post = f(baseline, perturbation)` β†’ predicted post-perturbation expression
- `Ξ” = Ε·_post - baseline` β†’ direction and magnitude of change per gene
2. **Uncertainty quantification** β€” per-gene prediction intervals:
- Epistemic uncertainty via deep ensembles (3-5 models with different seeds)
- Aleatoric uncertainty via predicted variance (heteroscedastic output head)
3. **Mechanistic attribution** β€” why this perturbation causes this shift:
- Gene set activation scores (MSigDB pathways, Reactome, KEGG)
- Top differentially expressed genes (DEGs) with effect sizes
- Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs)
- Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z"
---
## 3. Architecture: MuLGIT-Perturb
### 3.1 Design Principles
1. **Reuse MuLGIT encoders** β€” the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become **frozen feature extractors** for the perturbation layer, providing biologically-grounded latent representations.
2. **Perturbation-conditioned residual prediction** β€” the model predicts the **difference** between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction).
3. **Central dogma as structural prior** β€” the perturbation flows through the same biological layers: drug β†’ chromatin state β†’ DNA methylation β†’ mRNA expression β†’ miRNA β†’ predicted phenotype shift.
4. **Uncertainty via deep ensembles** β€” train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017).
5. **Chemical structure encoding** β€” use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current `drug_target.py`.
### 3.2 Architecture Diagram
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PERTURBATION ENCODER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
Drug SMILES ──────►│ β”‚ MolFormerβ”‚ β”‚ Geneformer │◄── Genetic Pert.
β”‚ β”‚ (frozen)β”‚ β”‚ (frozen) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Fusion Gate (learned Ξ±) β”‚ β”‚
β”‚ β”‚ z_pert = Ξ±Β·z_drug + β”‚ β”‚
β”‚ β”‚ (1-Ξ±)Β·z_gene β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ z_pert ∈ R^768
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MULGIT (frozen feature extractor) β”‚
β”‚ β”‚
β”‚ BASELINE STATE β”‚
β”‚ Methylation ─► SNN Encoder ─► z_meth β”‚
β”‚ CNV ─► SNN Encoder ─► z_cnv β”‚
β”‚ mRNA ─► SNN Encoder ─► z_mrna β”‚
β”‚ miRNA ─► SNN Encoder ─► z_mirna β”‚
β”‚ β”‚
β”‚ CentralDogmaFusion β”‚
β”‚ z_dna = [z_meth, z_cnv] β”‚
│ z_dna_rna = DNA→RNA(z_dna, z_mrna) │
β”‚ z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna) β”‚
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ z_fused ∈ R^48 (baseline latent state)
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PERTURBATION RESPONSE PREDICTOR (trained) β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Condition Fusion β”‚ β”‚
β”‚ β”‚ z_cond = FiLM(z_fused, z_pert) β”‚ β”‚
β”‚ β”‚ (Feature-wise Linear Modulation: scale + shift from β”‚ β”‚
β”‚ β”‚ perturbation embedding applied to baseline state) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Delta Decoder (SNNStack) β”‚ β”‚
β”‚ β”‚ z_cond ─► [1024] ─► [512] ─► [256] ─► [128] ─► Ξ” β”‚ β”‚
β”‚ β”‚ └─► σ² (UQ) β”‚ β”‚
β”‚ β”‚ Ξ” ∈ R^G (per-gene logFC prediction) β”‚ β”‚
β”‚ β”‚ σ² ∈ R^G (per-gene predicted variance) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ Ε·_post = y_baseline + Ξ” β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OUTPUT LAYER β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚ β”‚ Ξ” per gene β”‚ β”‚ σ² per geneβ”‚ β”‚ Pathway Enrichment β”‚β”‚
β”‚ β”‚ (logFC) β”‚ β”‚ (variance) β”‚ β”‚ (GSEA on Ξ”) β”‚β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Mechanistic Attribution β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Integrated β”‚ β”‚ Perturbation-Conditioned β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ Gradients on Ξ” β”‚ β”‚ Knowledge Subgraph β”‚ β”‚ β”‚
│ │ │ (gene→pathway │ │ (AdaPert-style: GNN over │ │ │
β”‚ β”‚ β”‚ attribution) β”‚ β”‚ GO/Reactome/STRING + drug) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### 3.3 Key Components
#### 3.3.1 PerturbationEncoder
Encodes both drug and genetic perturbations into a unified embedding:
- Drug: SMILES β†’ MolFormer (frozen, 768-dim) or ChemBERTa-2
- Genetic: Gene symbol β†’ Geneformer gene embedding (frozen)
- Fusion: Learned convex combination `z_pert = α·z_drug + (1-α)·z_gene` with `α ∈ [0,1]`
#### 3.3.2 FiLM Conditioning Layer
Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state:
```
z_cond = Ξ³(z_pert) βŠ™ z_fused + Ξ²(z_pert)
```
More expressive than concatenation; better suited for conditioning a state representation.
#### 3.3.3 DeltaDecoderWithUncertainty
Predicts per-gene logFC (Ξ”) and per-gene prediction variance (σ²):
- Shared SNNStack: z_cond β†’ [1024] β†’ [512] β†’ [256] β†’ [128]
- Ξ” head: Linear(128, n_genes) β†’ predicted expression change
- σ² head: Linear(128, n_genes) β†’ Softplus β†’ predicted variance (>0)
#### 3.3.4 MuLGITPerturbEnsemble
K=3-5 identical models trained with different seeds:
- Epistemic uncertainty = variance of Ξ” across ensemble
- Aleatoric uncertainty = mean σ² from each model
- Total uncertainty = epistemic + aleatoric
---
## 4. Data Strategy
### 4.1 Primary: Tahoe-100M
**Status:** βœ… Ready on HF Hub
**Path:** `tahoebio/Tahoe-100M`
| Property | Value |
|----------|-------|
| Rows | 95.6M drug-gene perturbation observations |
| Drugs | ~1,100 (with canonical SMILES, PubChem CID, MOA) |
| Cell lines | 50 cancer cell lines |
| Genes | 45,134 (full transcriptome) |
| Matched pre/post | Yes (vehicle controls per cell line) |
### 4.2 Supplementary Datasets
| Dataset | Perturbation Type | Samples | HF Hub | Use |
|---------|-------------------|---------|--------|-----|
| GDSC | Drug (251) | 251K IC50 | ❌ | IC50 validation |
| DepMap | Genetic (CRISPR KO) | 1,000+ cell lines | ❌ | Genetic pert. validation |
| LINCS L1000 | Drug + genetic | 1.3M profiles | ❌ | Cross-dataset benchmark |
| Norman19 | Genetic combos | 91K cells | Partial | Combinatorial benchmark |
### 4.3 Data Loading for Training
```python
from datasets import load_dataset
# Drug metadata
drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")
# Expression data (streaming for 95.6M rows)
expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data",
split="train", streaming=True)
# Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id,
# canonical_smiles, pubchem_cid, moa-fine, sample
```
---
## 5. Training Objective
### 5.1 Primary Loss: Negative Log-Likelihood with Uncertainty
```
L_NLL = 0.5 Β· Ξ£_g [ log(σ²_g) + (Ξ”_true,g βˆ’ Ξ”_pred,g)Β² / σ²_g ]
```
Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When σ²_g is large, the model is uncertain, and the MSE term is down-weighted.
### 5.2 Auxiliary Losses
1. **Pathway consistency:** `L_path = MSE(GSEA(Ξ”_pred), GSEA(Ξ”_true))` β€” pathways should shift coherently
2. **Sparsity:** `L_sparse = ||Ξ”_pred||₁ / (BΒ·G)` β€” only genes that actually respond should shift
3. **Risk shift (optional):** `L_risk = MSE(risk_post βˆ’ risk_baseline, risk_shift_target)` β€” ties back to survival
### 5.3 Hyperparameters
| Parameter | Value | Source |
|-----------|-------|--------|
| Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1) | Lingshu-Cell |
| LR schedule | 2e-4 β†’ 2e-5, cosine | Lingshu-Cell |
| Batch size | 256 | Lingshu-Cell |
| Mixed precision | bf16 | Lingshu-Cell |
| MuLGIT backbone | Frozen | Transfer from TCGA |
| Ensemble size | K=3 models | Lakshminarayanan 2017 |
| Epochs | 50-100 (ES on val NLL) | Standard |
---
## 6. Benchmark Tasks & Leakage Controls
### 6.1 Task Suite
| Task | Data | Split | Key Metric |
|------|------|-------|------------|
| Drug response prediction | Tahoe-100M | Leave-drug-out (20%) | DES@50, Pearson-Ξ” |
| Genetic perturbation | DepMap/Norman19 | Leave-gene-out (20%) | DES@50, Direction-match |
| Combinatorial pert. | Norman19 | Leave-combo-out (70%) | PDS, compositionality |
| Drug sensitivity (IC50) | GDSC | Leave-cell-line-out | Pearson r, Spearman ρ |
| Pathway rescue | Tahoe-100M (aging) | Leave-pathway-out | AUC of rescue |
### 6.2 Metrics
| Metric | What It Measures | Range | Target |
|--------|-----------------|-------|--------|
| **DES@K** | Fraction of true DEGs in top-K predicted | [0,1] | Higher |
| **Pearson-Ξ”** | Correlation of predicted vs true Ξ” | [-1,1] | Higher |
| **Direction-match** | Fraction of genes with correct sign | [0,1] | Higher |
| **PDS** | Discrimination between perturbations | [0,1] | Higher |
| **MMD** | Distributional fidelity in PCA space | [0,∞) | Lower |
| **RMSE** | Raw expression reconstruction error | [0,∞) | Lower |
---
## 7. Comparison to Existing Baselines
| Method | Gene Coverage | Drug Input | UQ | Attribution | HF-Ready |
|--------|--------------|------------|-----|-------------|----------|
| CPA | HVG (1-4K) | One-hot ID | Latent sampling | ❌ | ❌ |
| GEARS | All genes | One-hot ID | ❌ | Gene graph edges | ❌ |
| Lingshu-Cell | All 18K | One-hot ID | ❌ | ❌ | ❌ |
| AdaPert | HVG only | Semantic embed | ❌ | βœ… KG subgraph | ❌ |
| **MuLGIT-Perturb** | **All 45K+** | **SMILESβ†’MolFormer** | **βœ… Ensemble** | **βœ… IG + KG** | **βœ…** |
---
## 8. Implementation Plan
### 8.1 New Files
```
mulgit/
β”œβ”€β”€ perturb/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ encoder.py # PerturbationEncoder (drug + gene)
β”‚ β”œβ”€β”€ conditioning.py # FiLMConditioning
β”‚ β”œβ”€β”€ decoder.py # DeltaDecoderWithUncertainty
β”‚ β”œβ”€β”€ model.py # MuLGITPerturb
β”‚ β”œβ”€β”€ ensemble.py # MuLGITPerturbEnsemble
β”‚ β”œβ”€β”€ attribution.py # Integrated Gradients + KG subgraph
β”‚ β”œβ”€β”€ losses.py # NLL + pathway consistency + sparsity
β”‚ β”œβ”€β”€ data.py # Tahoe-100M perturbation data loader
β”‚ β”œβ”€β”€ trainer.py # Training loop with Trackio
β”‚ β”œβ”€β”€ evaluate.py # DES@K, PDS, Pearson-Ξ”, etc.
β”‚ └── config.py # MuLGITPerturbConfig
```
### 8.2 Five-Phase Execution
| Phase | Duration | Deliverable |
|-------|----------|-------------|
| 1. Tahoe-100M drug response | 2 weeks | Trained model + all benchmarks on leave-drug-out |
| 2. Cross-dataset generalization | 1 week | Zero-shot evaluation on LINCS, fine-tune if needed |
| 3. Genetic perturbations | 1 week | DepMap KO prediction, compare to Geneformer |
| 4. Combinatorial perturbation | 1 week | Norman19 combos, PDS evaluation |
| 5. Longevity-specific | 1 week | Aging pathway rescue, drug ranking vs static screener |
**Total: 6 weeks from start to full validation.**
---
## 9. Example Output
```json
{
"perturbation": "Rapamycin",
"cell_line": "A549",
"baseline_risk": 0.342,
"predicted_post_risk": -0.187,
"risk_shift": -0.529,
"top_degs": [
{"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]},
{"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]},
{"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]}
],
"enriched_pathways": [
{"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003},
{"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001}
],
"aging_signature_reversal": 0.72,
"epistemic_uncertainty": 0.08,
"aleatoric_uncertainty": 0.15
}
```
---
## 10. Risk Assessment
| Risk | Mitigation |
|------|------------|
| Tahoe-100M streaming slow (95.6M rows) | Local cache after first pass; pre-filter to 10M most informative rows |
| MolFormer fails for some SMILES | Fallback to Morgan fingerprints (RDKit already in deps) |
| MuLGIT backbone overfits to TCGA | Include tissue type as conditioning; evaluate zero-shot on LINCS |
| GPU memory for 45K-gene output | Start with L1000 landmarks (978); scale after validation |
| Combinatorial too hard | Phase as stretch goal; single-perturbation alone is impactful |
---
## References
1. Lotfollahi et al., "CPA," arxiv:2011.03086 (2021)
2. Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986
3. Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025)
4. Li et al., "AdaPert," arxiv:2602.18885 (2025)
5. He et al., "PerturBench," arxiv:2408.10609 (2024)
6. El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024)
7. Ross et al., "MolFormer," Nature Machine Intelligence 2022
8. Theodoris et al., "Geneformer," Nature 2023
9. Perez et al., "FiLM," AAAI 2018
10. Kendall & Gal, "Bayesian UQ," NeurIPS 2017
11. Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017
12. Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024