Feature Extraction
Transformers
Safetensors
virtual_cell_patient
biology
genomics
single-cell-rna-seq
patient-classification
custom_code
Instructions to use ConvergeBio/virtual-cell-patient with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ConvergeBio/virtual-cell-patient with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="ConvergeBio/virtual-cell-patient", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ConvergeBio/virtual-cell-patient", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - biology | |
| - genomics | |
| - single-cell-rna-seq | |
| - patient-classification | |
| library_name: transformers | |
| license: apache-2.0 | |
| # Virtual Cell — Patient Model | |
| A patient-level disease classification model trained on single-cell RNA-seq data. | |
| Given a matrix of gene expression profiles (one row per cell), the model produces | |
| a disease-category prediction for the patient. | |
| ## Model architecture | |
| ``` | |
| input [batch, num_cells, 18301 genes] | |
| → MLP cell embedder → [batch, num_cells, 512] | |
| → Attention aggregator → [batch, 512] | |
| → Dropout + Linear head → [batch, 10 classes] | |
| ``` | |
| ## Pretrained classification task | |
| The pretrained checkpoint classifies patients into **10 disease categories**: | |
| `oncological`, `immune_inflammatory`, `neurological`, `metabolic_vascular`, | |
| `gastrointestinal`, `respiratory`, `epithelial_barrier`, `sensory_specialized`, | |
| `healthy_control`, `other`. | |
| The pretrained embedder generalizes well to other classification tasks. Common | |
| fine-tuning scenarios include binary sick vs. healthy or treatment response | |
| prediction — see [Fine-tuning](#fine-tuning) below. | |
| ## Installation | |
| All repository files are required to run `train.py`. Download them all | |
| (or clone the repo) and install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| `wandb` is optional and only needed when training with `--wandb_project`. | |
| > **Tip:** `train.py` uses multiple workers for data loading. A machine with | |
| > at least 8 CPU cores is recommended for good throughput — set | |
| > `--num_workers` to match your core count. | |
| ## Quick start | |
| ### Verify the model loads | |
| ```python | |
| import torch | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained( | |
| "ConvergeBio/virtual-cell-patient", | |
| trust_remote_code=True, | |
| ).eval() | |
| x = torch.randn(1, 500, 18_301) # [batch, num_cells, num_genes] | |
| with torch.no_grad(): | |
| out = model(input_ids=x) | |
| print(out.logits.shape) # [1, 10] | |
| print(out.logits.softmax(-1)) | |
| ``` | |
| ### Inference on real data | |
| ```python | |
| from datasets import load_dataset | |
| import torch | |
| from transformers import AutoModel | |
| ds = load_dataset("ConvergeBio/virtual-cell-patient-example", split="validation") | |
| model = AutoModel.from_pretrained( | |
| "ConvergeBio/virtual-cell-patient", | |
| trust_remote_code=True, | |
| ).eval() | |
| sample = torch.tensor(ds[0]["input_ids"]).unsqueeze(0) # [1, 500, 18_301] | |
| with torch.no_grad(): | |
| out = model(input_ids=sample) | |
| print(out.logits.softmax(-1)) | |
| ``` | |
| > **Note:** `ConvergeBio/virtual-cell-patient-example` is a minimal sample dataset | |
| > intended only to verify the data format and run a quick end-to-end check. It | |
| > contains a small number of patients and is not representative of a real training | |
| > or evaluation distribution. Metrics produced from inference or training on this | |
| > dataset should not be interpreted. | |
| ## Preparing your data | |
| `train.py` expects a HuggingFace dataset with `train` (and optionally `validation`) | |
| splits. Each row represents one cell sample for a patient, with the following | |
| required columns: | |
| | Column | Shape | Type | Description | | |
| |---|---|---|---| | |
| | `input_ids` | [500, 18301] | float32 | Log-normalized gene expression matrix, aligned to `gene_names.txt` | | |
| | `attention_mask` | [500] | bool | Cell mask (all ones for fixed cell count) | | |
| | `labels` | scalar | int | Class index | | |
| | `entity_id` | scalar | int | Patient identifier — groups augmented views of the same patient | | |
| **Augmentation is strongly encouraged** — multiple independent random cell samples | |
| from the same patient should be included as separate rows sharing the same | |
| `entity_id`. At inference, the model averages softmax probabilities across views | |
| for a more robust prediction. A factor of 5 augmentations per patient is a good | |
| default. | |
| For a guide on building this dataset from raw scRNA-seq (h5ad) files, see the | |
| [example dataset](https://huggingface.co/datasets/ConvergeBio/virtual-cell-patient-example). | |
| ## Fine-tuning | |
| **Binary classification (e.g. sick vs. healthy):** | |
| ```bash | |
| python train.py \ | |
| --dataset_path <your_dataset> \ | |
| --num_classes 2 \ | |
| --freeze_embedder \ | |
| --output_dir ./my_binary_model | |
| ``` | |
| `--freeze_embedder` keeps the pretrained cell embedder frozen and only trains | |
| the new head — recommended when your dataset is small. | |
| **Multi-class fine-tuning on a different label set:** | |
| ```bash | |
| python train.py \ | |
| --dataset_path <your_dataset> \ | |
| --num_classes <N> \ | |
| --output_dir ./my_finetuned_model \ | |
| --num_train_epochs 15 \ | |
| --learning_rate 1e-4 | |
| ``` | |
| ## Training from scratch | |
| ```bash | |
| python train.py \ | |
| --dataset_path <your_dataset> \ | |
| --from_scratch \ | |
| --output_dir ./my_scratch_model | |
| ``` | |
| ## Repository contents | |
| | File | Description | | |
| |---|---| | |
| | `modeling_virtual_cell.py` | Full model implementation | | |
| | `config.json` | Architecture config | | |
| | `gene_names.txt` | Ordered list of 18,301 HGNC gene symbols | | |
| | `train.py` | Fine-tuning / training script | | |
| | `requirements.txt` | Python dependencies | | |
| | `model.safetensors` | Pretrained weights | | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @article{convergecell2026, | |
| author = {ConvergeBio}, | |
| title = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses}, | |
| year = {2026}, | |
| note = {Preprint available on bioRxiv}, | |
| } | |
| ``` | |
| The model architecture and data processing approach were inspired by: | |
| ```bibtex | |
| @article{liu2026pascient, | |
| author = {Liu, T. and De Brouwer, E. and Verma, A. and Missarova, A. and | |
| Kuo, T. and others}, | |
| title = {Learning multi-cellular representations of single-cell transcriptomics | |
| data enables characterization of patient-level disease states}, | |
| journal = {Cell Systems}, | |
| volume = {17}, | |
| pages = {101570}, | |
| year = {2026}, | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 — see [LICENSE](LICENSE) and [NOTICE](NOTICE). | |