Virtual Cell β€” Distilled Bulk Encoder

A bulk RNA-seq encoder distilled from ConvergeBio/virtual-cell-patient. It maps bulk gene expression directly into the same 512-dimensional patient embedding space, making single-cell-trained representations accessible when only bulk data is available.

Model architecture

input  [batch, 18301 genes]
  β†’ MLP encoder (Linear β†’ BN β†’ PReLU)Β²   β†’ [batch, 512]

Training objective: cosine distillation loss, with teacher embeddings produced by virtual-cell-patient on matched single-cell RNA-seq data from the same patients.

Relationship to virtual-cell-patient

virtual-cell-patient virtual-cell-distil-bulk
Input [batch, n_cells, 18301] single-cell matrix [batch, 18301] bulk expression vector
Output [batch, 512] patient embedding + class logits [batch, 512] patient embedding
Requires single-cell data Yes No

Both models use the same 18,301-gene vocabulary (gene_names.txt) and produce embeddings in the same 512-dimensional space.

Installation

pip install -r requirements.txt

wandb is optional and only needed when training with --wandb_project.

Quick start

Inference β€” extract embeddings

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "ConvergeBio/virtual-cell-distil-bulk",
    trust_remote_code=True,
).eval()

x = torch.randn(4, 18_301)   # [batch, num_genes]
with torch.no_grad():
    out = model(input_ids=x)

print(out["embeddings"].shape)   # [4, 512]

Note: the model uses BatchNorm β€” always call .eval() for inference.

Inference on real data

from datasets import load_dataset
import torch
from transformers import AutoModel

ds = load_dataset("ConvergeBio/virtual-cell-distil-bulk-example", split="validation")

model = AutoModel.from_pretrained(
    "ConvergeBio/virtual-cell-distil-bulk",
    trust_remote_code=True,
).eval()

sample = torch.tensor(ds[0]["bulk_expression"]).unsqueeze(0)  # [1, 18301]
with torch.no_grad():
    out = model(input_ids=sample)

print(out["embeddings"].shape)   # [1, 512]

Note: ConvergeBio/virtual-cell-distil-bulk-example is a minimal sample dataset intended only to verify the data format and run a quick end-to-end check. Metrics produced from this dataset should not be interpreted.

Fine-tuning for classification

The pretrained encoder can be fine-tuned on any bulk RNA-seq classification task. A linear head is added on top; the encoder weights are initialised from the distilled checkpoint and optionally frozen.

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "ConvergeBio/virtual-cell-distil-bulk",
    num_labels=2,
    ignore_mismatched_sizes=True,   # classification head is randomly initialised
    trust_remote_code=True,
)

Binary classification (e.g. disease vs. healthy) with frozen encoder:

python train.py \
  --dataset_path <your_dataset> \
  --num_classes 2 \
  --freeze_encoder \
  --output_dir ./my_binary_model

Multi-class fine-tuning:

python train.py \
  --dataset_path <your_dataset> \
  --num_classes <N> \
  --output_dir ./my_finetuned_model \
  --num_train_epochs 15 \
  --learning_rate 1e-4

Preparing your data

train.py expects a HuggingFace dataset with train (and optionally validation) splits. Each row represents one patient sample:

Column Shape Type Description
bulk_expression [18301] float32 Log-normalised bulk gene expression, aligned to gene_names.txt
labels scalar int Class index

Input expression should be library-size normalised (target sum 10,000) and log1p transformed. The gene axis must be aligned to the 18,301 genes in gene_names.txt β€” missing genes are zero-filled, extra genes are dropped.

For a guide on building this dataset from raw count matrices, see the example dataset.

Repository contents

File Description
modeling_virtual_cell_distil.py Full model implementation
config.json Architecture config
gene_names.txt Ordered list of 18,301 HGNC gene symbols
train.py Classification fine-tuning script
requirements.txt Python dependencies
model.safetensors Pretrained encoder weights

Citation

If you use this model, please cite:

@article{convergecell2026,
  author    = {ConvergeBio},
  title     = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses},
  year      = {2026},
  note      = {Preprint available on bioRxiv},
}

License

Apache 2.0 β€” see LICENSE and NOTICE.

Downloads last month
16
Safetensors
Model size
9.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support