Rhaister

Back to basics: Observed statistics are sufficient to predict drug responses

📄 Read the paper

Rhaister predicts perturbation responses in unseen (cell, perturbation) combinations from aggregated summary statistics. It operates directly on the output of standard single-cell analysis pipelines — log2 fold change, Mann-Whitney U-test p-values, and expression deltas — enabling training in seconds and prediction in milliseconds.

img

Overview

Large-scale cellular perturbation screens measure how cells respond to drugs, cytokines, and genetic perturbations across many biological contexts. Rhaister learns from observed (context, perturbation) combinations to predict responses for unseen combinations, expressing each unseen perturbation as a linear combination of observed panel perturbations.

Key properties:

  • Operates on summary statistics (not single cells), compatible with agentic reasoning workflows
  • Trains in seconds, predicts in milliseconds
  • Outperforms virtual cell models (STATE) that are orders of magnitude more expensive
  • Performance approaches the half-sample reference ceiling (how well the data can predict itself)
  • Supports both fewshot (unseen combinations) and zeroshot (unseen cell lines) prediction

Method

Fewshot model

For each unseen (context c*, perturbation p*) combination, Rhaister fits an additive baseline via alternating least squares and a drug-as-linear-combination ridge regression:

  1. Additive ALS: y = μ + α_c + β_p — global mean + context effect + perturbation effect
  2. Drug ridge: expresses each unseen perturbation as a weighted combination of panel perturbations, learned per held-out context via ridge regression
  3. Calibration network: a small MLP that maps (|FC|, -log₁₀p, log reference expression, gene-level statistics) to calibrated p-values

Three views are predicted jointly: log2 fold change (Y), Mann-Whitney p-values (P), and expression deltas (D). Final p-values are an even blend of ridge and calibration network predictions.

Zeroshot model (Rhaister-O)

For entirely unseen cell lines with no perturbation panel, Rhaister-O uses only baseline (DMSO) expression to predict responses:

y(c*, p, g) = β(p,g) + γ(p,g) · x(c*,g)

where x is the baseline pseudobulk expression. See docs/zeroshot_architecture.md for the full model description.

Datasets

img

Dataset Contexts Perturbations Modality
Tahoe-100M 50 cell lines 384 drugs × 3 doses Small molecule drug screen
Parse PBMC 12 donors × 18 cell types 90 cytokines Cytokine screen
Replogle-Nadig 4 cell lines ~2023 gene knockdowns CRISPR Perturb-seq
Emerald Bay 52 cell lines 27 drugs × multiple doses Growth rate (scalar)
PRISM 480 cell lines 1448 drugs × 8 doses Growth rate (scalar)

Results

Transcriptional prediction (fewshot)

Dataset Splits Pearson Δ PR-AUC Spearman LFC DE overlap
Tahoe-100M 5 0.87 0.73 0.81 0.59
Parse PBMC 1 0.46 0.40 0.59 0.28
Replogle-Nadig 4 0.41 0.08 0.63 0.14

Zeroshot prediction (Rhaister-O)

Dataset Splits Pearson Δ PR-AUC Spearman LFC DE overlap
Tahoe-100M 5 0.63 0.53 0.49 0.34

Sensitivity prediction (R²)

Dataset Splits Rhaister Rhaister + features
Emerald Bay 5 0.26 0.31
PRISM 5 0.87

All numbers are reproducible from the code in this repository. See CLAUDE.md for full reproduction instructions.

Installation

Download the repository from the HuggingFace Hub and install it:

from huggingface_hub import snapshot_download

# Download the full repo into ./Rhaister
snapshot_download("tahoebio/Rhaister", local_dir="Rhaister")
cd Rhaister
uv pip install -e ".[dev]"
Alternative: install with git clone
git clone https://huggingface.co/tahoebio/Rhaister
cd Rhaister
uv pip install -e ".[dev]"

Data is loaded automatically from HuggingFace the first time a dataset is used, and cached locally thereafter. Set RHAISTER_DATA_ROOT to point to a local data directory to use existing data instead.

Usage

# Fewshot model (tahoe_5_holdout by default)
python -m rhaister.train [experiment_name]

# Other datasets
python -m rhaister.train parse_test --split parse/split_0
python -m rhaister.train replogle_test --split replogle_nadig/split_0

# Zeroshot model
HP_ZEROSHOT=1 python -m rhaister.train zs_test

# Multi-split evaluation
python -m rhaister.eval_splits [experiment_name]

# Unit tests
python -m pytest tests/ -v

Repository structure

rhaister/                  # Python package
├── train.py               # Model: ALS + drug ridge + calnet (fewshot & zeroshot)
├── prepare_combined.py    # Data loading, evaluation (6 State metrics), logging
├── state_metrics.py       # Six metric functions from the State paper
└── eval_splits.py         # Multi-split evaluation harness

splits/                    # Dataset configs (dataset.toml) and train/test split TOMLs
tests/                     # Unit tests (57 tests)
scripts/                   # Helper scripts (baselines, sweeps, data prep)
figures/                   # Paper figure generation
docs/                      # Architecture docs, zeroshot model description

Citation

If you use Rhaister in your work, please cite:

@article{rhaister2026,
  title={Back to basics: Observed statistics are sufficient to predict drug responses},
  year={2026},
  publisher={Tahoe Bio},
  url={https://tahoebio-assets.com/rhaister-manuscript.pdf}
}

License

Apache 2.0. See LICENSE.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train tahoebio/Rhaister

Collection including tahoebio/Rhaister