Update PaSCient citation to published Cell Systems paper

cf16572 verified 21 days ago

5.86 kB

	---
	tags:
	- biology
	- genomics
	- single-cell-rna-seq
	- patient-classification
	library_name: transformers
	license: apache-2.0
	---

	# Virtual Cell — Patient Model

	A patient-level disease classification model trained on single-cell RNA-seq data.
	Given a matrix of gene expression profiles (one row per cell), the model produces
	a disease-category prediction for the patient.

	## Model architecture

	```
	input [batch, num_cells, 18301 genes]
	→ MLP cell embedder → [batch, num_cells, 512]
	→ Attention aggregator → [batch, 512]
	→ Dropout + Linear head → [batch, 10 classes]
	```

	## Pretrained classification task

	The pretrained checkpoint classifies patients into 10 disease categories:
	`oncological`, `immune_inflammatory`, `neurological`, `metabolic_vascular`,
	`gastrointestinal`, `respiratory`, `epithelial_barrier`, `sensory_specialized`,
	`healthy_control`, `other`.

	The pretrained embedder generalizes well to other classification tasks. Common
	fine-tuning scenarios include binary sick vs. healthy or treatment response
	prediction — see [Fine-tuning](#fine-tuning) below.

	## Installation

	All repository files are required to run `train.py`. Download them all
	(or clone the repo) and install dependencies:

	```bash
	pip install -r requirements.txt
	```

	`wandb` is optional and only needed when training with `--wandb_project`.

	> Tip: `train.py` uses multiple workers for data loading. A machine with
	> at least 8 CPU cores is recommended for good throughput — set
	> `--num_workers` to match your core count.

	## Quick start

	### Verify the model loads

	```python
	import torch
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"ConvergeBio/virtual-cell-patient",
	trust_remote_code=True,
	).eval()

	x = torch.randn(1, 500, 18_301) # [batch, num_cells, num_genes]
	with torch.no_grad():
	out = model(input_ids=x)

	print(out.logits.shape) # [1, 10]
	print(out.logits.softmax(-1))
	```

	### Inference on real data

	```python
	from datasets import load_dataset
	import torch
	from transformers import AutoModel

	ds = load_dataset("ConvergeBio/virtual-cell-patient-example", split="validation")

	model = AutoModel.from_pretrained(
	"ConvergeBio/virtual-cell-patient",
	trust_remote_code=True,
	).eval()

	sample = torch.tensor(ds[0]["input_ids"]).unsqueeze(0) # [1, 500, 18_301]
	with torch.no_grad():
	out = model(input_ids=sample)

	print(out.logits.softmax(-1))
	```

	> Note: `ConvergeBio/virtual-cell-patient-example` is a minimal sample dataset
	> intended only to verify the data format and run a quick end-to-end check. It
	> contains a small number of patients and is not representative of a real training
	> or evaluation distribution. Metrics produced from inference or training on this
	> dataset should not be interpreted.

	## Preparing your data

	`train.py` expects a HuggingFace dataset with `train` (and optionally `validation`)
	splits. Each row represents one cell sample for a patient, with the following
	required columns:

	\| Column \| Shape \| Type \| Description \|
	\|---\|---\|---\|---\|
	\| `input_ids` \| [500, 18301] \| float32 \| Log-normalized gene expression matrix, aligned to `gene_names.txt` \|
	\| `attention_mask` \| [500] \| bool \| Cell mask (all ones for fixed cell count) \|
	\| `labels` \| scalar \| int \| Class index \|
	\| `entity_id` \| scalar \| int \| Patient identifier — groups augmented views of the same patient \|

	Augmentation is strongly encouraged — multiple independent random cell samples
	from the same patient should be included as separate rows sharing the same
	`entity_id`. At inference, the model averages softmax probabilities across views
	for a more robust prediction. A factor of 5 augmentations per patient is a good
	default.

	For a guide on building this dataset from raw scRNA-seq (h5ad) files, see the
	[example dataset](https://huggingface.co/datasets/ConvergeBio/virtual-cell-patient-example).

	## Fine-tuning

	Binary classification (e.g. sick vs. healthy):

	```bash
	python train.py \
	--dataset_path <your_dataset> \
	--num_classes 2 \
	--freeze_embedder \
	--output_dir ./my_binary_model
	```

	`--freeze_embedder` keeps the pretrained cell embedder frozen and only trains
	the new head — recommended when your dataset is small.

	Multi-class fine-tuning on a different label set:

	```bash
	python train.py \
	--dataset_path <your_dataset> \
	--num_classes <N> \
	--output_dir ./my_finetuned_model \
	--num_train_epochs 15 \
	--learning_rate 1e-4
	```

	## Training from scratch

	```bash
	python train.py \
	--dataset_path <your_dataset> \
	--from_scratch \
	--output_dir ./my_scratch_model
	```

	## Repository contents

	\| File \| Description \|
	\|---\|---\|
	\| `modeling_virtual_cell.py` \| Full model implementation \|
	\| `config.json` \| Architecture config \|
	\| `gene_names.txt` \| Ordered list of 18,301 HGNC gene symbols \|
	\| `train.py` \| Fine-tuning / training script \|
	\| `requirements.txt` \| Python dependencies \|
	\| `model.safetensors` \| Pretrained weights \|

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{convergecell2026,
	author = {ConvergeBio},
	title = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses},
	year = {2026},
	note = {Preprint available on bioRxiv},
	}
	```

	The model architecture and data processing approach were inspired by:

	```bibtex
	@article{liu2026pascient,
	author = {Liu, T. and De Brouwer, E. and Verma, A. and Missarova, A. and
	Kuo, T. and others},
	title = {Learning multi-cellular representations of single-cell transcriptomics
	data enables characterization of patient-level disease states},
	journal = {Cell Systems},
	volume = {17},
	pages = {101570},
	year = {2026},
	}
	```

	## License

	Apache 2.0 — see [LICENSE](LICENSE) and [NOTICE](NOTICE).