Upload README.md with huggingface_hub

5160d1b verified about 2 months ago

5.12 kB

	---
	library_name: pytorch
	pipeline_tag: other
	tags:
	- single-cell
	- perturbation-prediction
	- cellflow
	- flow-matching
	- sc-interp
	- norman
	datasets:
	- norman-2019
	---
	# CellFlow trained on Norman 2019

	Produced as part of the [sc-interp](https://github.com/mattshu0410/sc-interp) single-cell model comparison repo.

	## Provenance

	- Source code commit: [`fdc2ae0`](https://github.com/mattshu0410/sc-interp/tree/fdc2ae0413aa04efdf3d392a5a7fddfcec06b7b9)
	- Runner: [`scripts/run_cellflow.py`](https://github.com/mattshu0410/sc-interp/blob/fdc2ae0413aa04efdf3d392a5a7fddfcec06b7b9/scripts/run_cellflow.py)
	- Dataset manifest: [`data/norman/manifest.yaml`](https://github.com/mattshu0410/sc-interp/blob/fdc2ae0413aa04efdf3d392a5a7fddfcec06b7b9/data/norman/manifest.yaml)

	## Base model

	Trained from scratch. CellFlow is a flow-matching based perturbation prediction framework and does not ship a foundation checkpoint. Perturbation conditions are encoded via ESM2 embeddings of the perturbed gene(s) ([facebook/esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D)).

	## Training

	- Architecture and training hyperparameters match the [cellflow_reproducibility](https://github.com/theislab/cellflow_reproducibility) repo's `suppl_fig/norman/downstream_analysis/cellflow/` configs verbatim:
	- `condition_embedding_dim=1024`, `hidden_dims=(4096,4096,4096)`, `decoder_dims=(4096,4096,4096)`, `decoder_dropout=0.2`
	- `time_encoder_dims=(2048,2048,2048)`, `time_freqs=1024`, `cond_output_dropout=0.9`
	- `layers_before_pool.target_gene = mlp[1024,1024] dropout 0.5`, `layers_after_pool = mlp[1024,1024] dropout 0.2`
	- `match_fn = match_linear(epsilon=0.1, scale_cost='mean', tau_a=1.0, tau_b=1.0)`
	- `optimizer = optax.MultiSteps(optax.adam(5e-5), 20)`
	- `probability_path = {'constant_noise': 1.0}`
	- `pooling = 'attention_token'`
	- Sample representation: 50-dim PCA (`sample_rep='X_pca'`), fit on the train split cells and projected onto val and test.
	- Perturbation encoding: ESM2 embeddings per gene symbol, stored in `adata.uns['esm2']` and referenced via `perturbation_covariate_reps={'target_gene': 'esm2'}`.
	- Split: GEARS simulation split with seed 42, not biolord (the CellFlow paper uses biolord). This is a deliberate divergence so our three-way comparison with scGPT and scLDM uses a single split definition.

	### Budget and stopping

	\| \| \|
	\|---\|---\|
	\| iterations \| 200,000 \|
	\| batch size \| 1024 \|
	\| valid_freq \| 400,000 (larger than budget = no mid-training eval) \|
	\| wall clock \| 0.7 hours (H100 PCIe) \|
	\| sample_rep \| X_pca (50 dims) \|


	## Test set metrics (cell-eval)

	\| metric \| mean \| median \| max \|
	\|---\|---\|---\|---\|
	\| pearson_delta \| 0.5630 \| 0.6814 \| 0.9651 \|
	\| discrimination_score_l1 \| 0.7270 \| 0.8182 \| 1.0000 \|
	\| discrimination_score_l2 \| 0.7452 \| 0.8586 \| 1.0000 \|
	\| discrimination_score_cosine \| 0.7413 \| 0.8788 \| 1.0000 \|
	\| pearson_edistance \| 0.6707 \| 0.6707 \| 0.6707 \|
	\| clustering_agreement \| 0.3252 \| 0.3252 \| 0.3252 \|
	\| overlap_at_N \| 0.0264 \| 0.0242 \| 0.1008 \|
	\| precision_at_N \| 0.0936 \| 0.0977 \| 0.2267 \|
	\| mse \| 0.0032 \| 0.0022 \| 0.0132 \|
	\| mae \| 0.0156 \| 0.0142 \| 0.0350 \|

	The CellFlow paper reports Norman results in terms of R² in gene space and energy distance in 10-dim PCA space (Figure 4N, Methods section 3.5). Our numbers use cell-eval's standard metric set on the GEARS simulation split, so they are not directly comparable to Figure 4N, but they reproduce the paper's headline claim (CellFlow > scGPT on Norman): on our matched evaluation, CellFlow outperforms scGPT on `pearson_delta`, all `discrimination_score` variants, `pearson_edistance`, `clustering_agreement`, `mse`, and `mae`. The two models are tied on DE gene overlap / precision, consistent with the broader observation that current perturbation models capture broad transcriptional programs better than specific regulatory effects.

	## Known limitations

	- Uses ESM2 `esm2_t6_8M_UR50D` (8M param) instead of the paper's `esm2_t36_3B_UR50D` (3B param). Speed gain for research iteration; gene embedding quality may be slightly lower than the paper.
	- Uses GEARS simulation split instead of biolord's 5 random splits. Our test perturbations are a different subset of Norman than the paper's.
	- Training uses `valid_freq > num_iterations` so there is no mid-training val evaluation. Convergence was not verified via a val curve; future runs should use a smaller valid_freq to plot the learning curve.

	## Files

	- `CellFlow.pkl` — Trained CellFlow model, pickled via `cf.save()`. Load via `cellflow.model.CellFlow.load(path)`.
	- `training_stats.json` — iterations, wall clock, wandb run URL.

	## Usage

	```python
	from huggingface_hub import hf_hub_download
	from cellflow.model import CellFlow

	path = hf_hub_download(
	repo_id="matthewshu/cellflow-norman",
	filename="CellFlow.pkl",
	)
	cf = CellFlow.load(path)
	# Then use sc-interp's run_cellflow.py --hf-repo matthewshu/cellflow-norman
	```

	## Citation

	Dataset: Norman et al. 2019 (Science). Model: Klein, Fleck, Becker et al. 2025 bioRxiv (CellFlow). See the CellFlow repo and the Norman 2019 paper for proper BibTeX entries.