Virtual Cell β Distilled Bulk Encoder
A bulk RNA-seq encoder distilled from ConvergeBio/virtual-cell-patient. It maps bulk gene expression directly into the same 512-dimensional patient embedding space, making single-cell-trained representations accessible when only bulk data is available.
Model architecture
input [batch, 18301 genes]
β MLP encoder (Linear β BN β PReLU)Β² β [batch, 512]
Training objective: cosine distillation loss, with teacher embeddings produced by
virtual-cell-patient on matched single-cell RNA-seq data from the same patients.
Relationship to virtual-cell-patient
| virtual-cell-patient | virtual-cell-distil-bulk | |
|---|---|---|
| Input | [batch, n_cells, 18301] single-cell matrix |
[batch, 18301] bulk expression vector |
| Output | [batch, 512] patient embedding + class logits |
[batch, 512] patient embedding |
| Requires single-cell data | Yes | No |
Both models use the same 18,301-gene vocabulary (gene_names.txt) and produce embeddings
in the same 512-dimensional space.
Installation
pip install -r requirements.txt
wandb is optional and only needed when training with --wandb_project.
Quick start
Inference β extract embeddings
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"ConvergeBio/virtual-cell-distil-bulk",
trust_remote_code=True,
).eval()
x = torch.randn(4, 18_301) # [batch, num_genes]
with torch.no_grad():
out = model(input_ids=x)
print(out["embeddings"].shape) # [4, 512]
Note: the model uses BatchNorm β always call
.eval()for inference.
Inference on real data
from datasets import load_dataset
import torch
from transformers import AutoModel
ds = load_dataset("ConvergeBio/virtual-cell-distil-bulk-example", split="validation")
model = AutoModel.from_pretrained(
"ConvergeBio/virtual-cell-distil-bulk",
trust_remote_code=True,
).eval()
sample = torch.tensor(ds[0]["bulk_expression"]).unsqueeze(0) # [1, 18301]
with torch.no_grad():
out = model(input_ids=sample)
print(out["embeddings"].shape) # [1, 512]
Note:
ConvergeBio/virtual-cell-distil-bulk-exampleis a minimal sample dataset intended only to verify the data format and run a quick end-to-end check. Metrics produced from this dataset should not be interpreted.
Fine-tuning for classification
The pretrained encoder can be fine-tuned on any bulk RNA-seq classification task. A linear head is added on top; the encoder weights are initialised from the distilled checkpoint and optionally frozen.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"ConvergeBio/virtual-cell-distil-bulk",
num_labels=2,
ignore_mismatched_sizes=True, # classification head is randomly initialised
trust_remote_code=True,
)
Binary classification (e.g. disease vs. healthy) with frozen encoder:
python train.py \
--dataset_path <your_dataset> \
--num_classes 2 \
--freeze_encoder \
--output_dir ./my_binary_model
Multi-class fine-tuning:
python train.py \
--dataset_path <your_dataset> \
--num_classes <N> \
--output_dir ./my_finetuned_model \
--num_train_epochs 15 \
--learning_rate 1e-4
Preparing your data
train.py expects a HuggingFace dataset with train (and optionally validation) splits.
Each row represents one patient sample:
| Column | Shape | Type | Description |
|---|---|---|---|
bulk_expression |
[18301] | float32 | Log-normalised bulk gene expression, aligned to gene_names.txt |
labels |
scalar | int | Class index |
Input expression should be library-size normalised (target sum 10,000) and log1p
transformed. The gene axis must be aligned to the 18,301 genes in gene_names.txt β
missing genes are zero-filled, extra genes are dropped.
For a guide on building this dataset from raw count matrices, see the example dataset.
Repository contents
| File | Description |
|---|---|
modeling_virtual_cell_distil.py |
Full model implementation |
config.json |
Architecture config |
gene_names.txt |
Ordered list of 18,301 HGNC gene symbols |
train.py |
Classification fine-tuning script |
requirements.txt |
Python dependencies |
model.safetensors |
Pretrained encoder weights |
Citation
If you use this model, please cite:
@article{convergecell2026,
author = {ConvergeBio},
title = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses},
year = {2026},
note = {Preprint available on bioRxiv},
}
License
- Downloads last month
- 16