CellFlow trained on Norman 2019 (ESM2 3B variant)
Produced as part of the sc-interp single-cell model comparison repo.
Provenance
- Source code commit:
fdc2ae0 - Runner:
scripts/run_cellflow.py - Dataset manifest:
data/norman/manifest.yaml
Base model
Trained from scratch. CellFlow is a flow-matching based perturbation prediction framework and does not ship a foundation checkpoint. Perturbation conditions are encoded via ESM2 embeddings of the perturbed gene(s) using facebook/esm2_t36_3B_UR50D (3B parameter model, 2560-dim per-gene embeddings), matching the CellFlow reproducibility repo's default.
Training
- Architecture and training hyperparameters match the cellflow_reproducibility repo's
suppl_fig/norman/downstream_analysis/cellflow/configs verbatim:condition_embedding_dim=1024,hidden_dims=(4096,4096,4096),decoder_dims=(4096,4096,4096),decoder_dropout=0.2time_encoder_dims=(2048,2048,2048),time_freqs=1024,cond_output_dropout=0.9layers_before_pool.target_gene = mlp[1024,1024] dropout 0.5,layers_after_pool = mlp[1024,1024] dropout 0.2match_fn = match_linear(epsilon=0.1, scale_cost='mean', tau_a=1.0, tau_b=1.0)optimizer = optax.MultiSteps(optax.adam(5e-5), 20)probability_path = {'constant_noise': 1.0}pooling = 'attention_token'
- Sample representation: 50-dim PCA (
sample_rep='X_pca'), fit on the train split cells and projected onto val and test. - Perturbation encoding: ESM2 3B embeddings per gene symbol, stored in
adata.uns['esm2']and referenced viaperturbation_covariate_reps={'target_gene': 'esm2'}. - Split: GEARS simulation split with seed 42, not biolord (the CellFlow paper uses biolord). Deliberate divergence for internal consistency with our scGPT and scLDM runners.
Budget and stopping
| iterations | 200,000 |
| batch size | 1024 |
| valid_freq | 400,000 (larger than budget = no mid-training eval) |
| wall clock | 0.7 hours (H100 PCIe) |
| sample_rep | X_pca (50 dims) |
| esm model | esm2_t36_3B_UR50D |
Test set metrics (cell-eval)
| metric | mean | median | max |
|---|---|---|---|
| pearson_delta | 0.6061 | 0.7359 | 0.9654 |
| discrimination_score_l1 | 0.7609 | 0.8687 | 1.0000 |
| discrimination_score_l2 | 0.7736 | 0.8889 | 1.0000 |
| discrimination_score_cosine | 0.7484 | 0.9091 | 1.0000 |
| pearson_edistance | 0.6883 | 0.6883 | 0.6883 |
| clustering_agreement | 0.4352 | 0.4352 | 0.4352 |
| overlap_at_N | 0.0266 | 0.0245 | 0.1076 |
| precision_at_N | 0.0939 | 0.0981 | 0.2302 |
| mse | 0.0028 | 0.0018 | 0.0132 |
| mae | 0.0146 | 0.0127 | 0.0341 |
The CellFlow paper reports Norman results in R² and energy distance space. Our numbers use cell-eval's metric set on the GEARS simulation split so they are not directly comparable to the paper's Figure 4N, but they reproduce the paper's headline claim (CellFlow > scGPT on Norman) across every distributional metric. A sibling variant using ESM2 8M instead of 3B is available at matthewshu/cellflow-norman; the 3B variant shows meaningfully better pearson_delta (+0.04 mean, +0.055 median) and clustering_agreement (+0.11), while DE gene metrics (overlap_at_N, precision_at_N) are unchanged. This suggests larger protein language models help CellFlow's condition encoder learn broader cell-state structure but not specific regulatory gene identification.
Known limitations
- Uses GEARS simulation split instead of biolord's 5 random splits. Our test perturbations are a different subset of Norman than the paper's.
- Training uses
valid_freq > num_iterationsso there is no mid-training val evaluation. Convergence was not verified via a val curve; future runs should use a smaller valid_freq to plot the learning curve. - DE gene identification metrics (
overlap_at_N,precision_at_N) did not improve from the 8M ESM variant to this 3B variant, suggesting that the DE gene bottleneck is architectural/data, not gene-embedding quality.
Files
CellFlow.pkl— Trained CellFlow model, pickled viacf.save(). Load viacellflow.model.CellFlow.load(path).training_stats.json— iterations, wall clock, wandb run URL.
Usage
from huggingface_hub import hf_hub_download
from cellflow.model import CellFlow
path = hf_hub_download(
repo_id="matthewshu/cellflow-norman-esm3b",
filename="CellFlow.pkl",
)
cf = CellFlow.load(path)
# Then use sc-interp's run_cellflow.py with --esm-model esm2_t36_3B_UR50D
Citation
Dataset: Norman et al. 2019 (Science). Model: Klein, Fleck, Becker et al. 2025 bioRxiv (CellFlow). ESM2: Lin et al. 2023 Science. See the respective repos for proper BibTeX entries.