| --- |
| library_name: pytorch |
| pipeline_tag: other |
| tags: |
| - single-cell |
| - perturbation-prediction |
| - cellflow |
| - flow-matching |
| - sc-interp |
| - norman |
| datasets: |
| - norman-2019 |
| --- |
| # CellFlow trained on Norman 2019 |
|
|
| Produced as part of the [sc-interp](https://github.com/mattshu0410/sc-interp) single-cell model comparison repo. |
|
|
| ## Provenance |
|
|
| - Source code commit: [`fdc2ae0`](https://github.com/mattshu0410/sc-interp/tree/fdc2ae0413aa04efdf3d392a5a7fddfcec06b7b9) |
| - Runner: [`scripts/run_cellflow.py`](https://github.com/mattshu0410/sc-interp/blob/fdc2ae0413aa04efdf3d392a5a7fddfcec06b7b9/scripts/run_cellflow.py) |
| - Dataset manifest: [`data/norman/manifest.yaml`](https://github.com/mattshu0410/sc-interp/blob/fdc2ae0413aa04efdf3d392a5a7fddfcec06b7b9/data/norman/manifest.yaml) |
|
|
| ## Base model |
|
|
| Trained from scratch. CellFlow is a flow-matching based perturbation prediction framework and does not ship a foundation checkpoint. Perturbation conditions are encoded via ESM2 embeddings of the perturbed gene(s) ([facebook/esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D)). |
|
|
| ## Training |
|
|
| - Architecture and training hyperparameters match the [cellflow_reproducibility](https://github.com/theislab/cellflow_reproducibility) repo's `suppl_fig/norman/downstream_analysis/cellflow/` configs verbatim: |
| - `condition_embedding_dim=1024`, `hidden_dims=(4096,4096,4096)`, `decoder_dims=(4096,4096,4096)`, `decoder_dropout=0.2` |
| - `time_encoder_dims=(2048,2048,2048)`, `time_freqs=1024`, `cond_output_dropout=0.9` |
| - `layers_before_pool.target_gene = mlp[1024,1024] dropout 0.5`, `layers_after_pool = mlp[1024,1024] dropout 0.2` |
| - `match_fn = match_linear(epsilon=0.1, scale_cost='mean', tau_a=1.0, tau_b=1.0)` |
| - `optimizer = optax.MultiSteps(optax.adam(5e-5), 20)` |
| - `probability_path = {'constant_noise': 1.0}` |
| - `pooling = 'attention_token'` |
| - Sample representation: 50-dim PCA (`sample_rep='X_pca'`), fit on the train split cells and projected onto val and test. |
| - Perturbation encoding: ESM2 embeddings per gene symbol, stored in `adata.uns['esm2']` and referenced via `perturbation_covariate_reps={'target_gene': 'esm2'}`. |
| - Split: **GEARS simulation split with seed 42**, not biolord (the CellFlow paper uses biolord). This is a deliberate divergence so our three-way comparison with scGPT and scLDM uses a single split definition. |
|
|
| ### Budget and stopping |
|
|
| | | | |
| |---|---| |
| | iterations | 200,000 | |
| | batch size | 1024 | |
| | valid_freq | 400,000 (larger than budget = no mid-training eval) | |
| | wall clock | 0.7 hours (H100 PCIe) | |
| | sample_rep | X_pca (50 dims) | |
| |
| |
| ## Test set metrics (cell-eval) |
| |
| | metric | mean | median | max | |
| |---|---|---|---| |
| | pearson_delta | 0.5630 | 0.6814 | 0.9651 | |
| | discrimination_score_l1 | 0.7270 | 0.8182 | 1.0000 | |
| | discrimination_score_l2 | 0.7452 | 0.8586 | 1.0000 | |
| | discrimination_score_cosine | 0.7413 | 0.8788 | 1.0000 | |
| | pearson_edistance | 0.6707 | 0.6707 | 0.6707 | |
| | clustering_agreement | 0.3252 | 0.3252 | 0.3252 | |
| | overlap_at_N | 0.0264 | 0.0242 | 0.1008 | |
| | precision_at_N | 0.0936 | 0.0977 | 0.2267 | |
| | mse | 0.0032 | 0.0022 | 0.0132 | |
| | mae | 0.0156 | 0.0142 | 0.0350 | |
|
|
| The CellFlow paper reports Norman results in terms of R² in gene space and energy distance in 10-dim PCA space (Figure 4N, Methods section 3.5). Our numbers use cell-eval's standard metric set on the GEARS simulation split, so they are not directly comparable to Figure 4N, but they reproduce the paper's headline claim (CellFlow > scGPT on Norman): on our matched evaluation, CellFlow outperforms scGPT on `pearson_delta`, all `discrimination_score` variants, `pearson_edistance`, `clustering_agreement`, `mse`, and `mae`. The two models are tied on DE gene overlap / precision, consistent with the broader observation that current perturbation models capture broad transcriptional programs better than specific regulatory effects. |
|
|
| ## Known limitations |
|
|
| - Uses ESM2 `esm2_t6_8M_UR50D` (8M param) instead of the paper's `esm2_t36_3B_UR50D` (3B param). Speed gain for research iteration; gene embedding quality may be slightly lower than the paper. |
| - Uses GEARS simulation split instead of biolord's 5 random splits. Our test perturbations are a different subset of Norman than the paper's. |
| - Training uses `valid_freq > num_iterations` so there is no mid-training val evaluation. Convergence was not verified via a val curve; future runs should use a smaller valid_freq to plot the learning curve. |
| |
| ## Files |
| |
| - `CellFlow.pkl` — Trained CellFlow model, pickled via `cf.save()`. Load via `cellflow.model.CellFlow.load(path)`. |
| - `training_stats.json` — iterations, wall clock, wandb run URL. |
|
|
| ## Usage |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| from cellflow.model import CellFlow |
| |
| path = hf_hub_download( |
| repo_id="matthewshu/cellflow-norman", |
| filename="CellFlow.pkl", |
| ) |
| cf = CellFlow.load(path) |
| # Then use sc-interp's run_cellflow.py --hf-repo matthewshu/cellflow-norman |
| ``` |
|
|
| ## Citation |
|
|
| Dataset: Norman et al. 2019 (Science). Model: Klein, Fleck, Becker et al. 2025 bioRxiv (CellFlow). See the CellFlow repo and the Norman 2019 paper for proper BibTeX entries. |