Virtual Cell — Distilled Bulk Encoder

A bulk RNA-seq encoder distilled from ConvergeBio/virtual-cell-patient. It maps bulk gene expression directly into the same 512-dimensional patient embedding space, making single-cell-trained representations accessible when only bulk data is available.

Model architecture

input  [batch, 18301 genes]
  → MLP encoder (Linear → BN → PReLU)²   → [batch, 512]

Training objective: cosine distillation loss, with teacher embeddings produced by virtual-cell-patient on matched single-cell RNA-seq data from the same patients.

Relationship to virtual-cell-patient

	virtual-cell-patient	virtual-cell-distil-bulk
Input	`[batch, n_cells, 18301]` single-cell matrix	`[batch, 18301]` bulk expression vector
Output	`[batch, 512]` patient embedding + class logits	`[batch, 512]` patient embedding
Requires single-cell data	Yes	No

Both models use the same 18,301-gene vocabulary (gene_names.txt) and produce embeddings in the same 512-dimensional space.

Installation

pip install -r requirements.txt

wandb is optional and only needed when training with --wandb_project.

Quick start

Inference — extract embeddings

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "ConvergeBio/virtual-cell-distil-bulk",
    trust_remote_code=True,
).eval()

x = torch.randn(4, 18_301)   # [batch, num_genes]
with torch.no_grad():
    out = model(input_ids=x)

print(out["embeddings"].shape)   # [4, 512]

Note: the model uses BatchNorm — always call .eval() for inference.

Inference on real data

from datasets import load_dataset
import torch
from transformers import AutoModel

ds = load_dataset("ConvergeBio/virtual-cell-distil-bulk-example", split="validation")

model = AutoModel.from_pretrained(
    "ConvergeBio/virtual-cell-distil-bulk",
    trust_remote_code=True,
).eval()

sample = torch.tensor(ds[0]["bulk_expression"]).unsqueeze(0)  # [1, 18301]
with torch.no_grad():
    out = model(input_ids=sample)

print(out["embeddings"].shape)   # [1, 512]

Note: ConvergeBio/virtual-cell-distil-bulk-example is a minimal sample dataset intended only to verify the data format and run a quick end-to-end check. Metrics produced from this dataset should not be interpreted.

Fine-tuning for classification

The pretrained encoder can be fine-tuned on any bulk RNA-seq classification task. A linear head is added on top; the encoder weights are initialised from the distilled checkpoint and optionally frozen.

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "ConvergeBio/virtual-cell-distil-bulk",
    num_labels=2,
    ignore_mismatched_sizes=True,   # classification head is randomly initialised
    trust_remote_code=True,
)

Binary classification (e.g. disease vs. healthy) with frozen encoder:

python train.py \
  --dataset_path <your_dataset> \
  --num_classes 2 \
  --freeze_encoder \
  --output_dir ./my_binary_model

Multi-class fine-tuning:

python train.py \
  --dataset_path <your_dataset> \
  --num_classes <N> \
  --output_dir ./my_finetuned_model \
  --num_train_epochs 15 \
  --learning_rate 1e-4

Preparing your data

train.py expects a HuggingFace dataset with train (and optionally validation) splits. Each row represents one patient sample:

Column	Shape	Type	Description
`bulk_expression`	[18301]	float32	Log-normalised bulk gene expression, aligned to `gene_names.txt`
`labels`	scalar	int	Class index

Input expression should be library-size normalised (target sum 10,000) and log1p transformed. The gene axis must be aligned to the 18,301 genes in gene_names.txt — missing genes are zero-filled, extra genes are dropped.

For a guide on building this dataset from raw count matrices, see the example dataset.

Repository contents

File	Description
`modeling_virtual_cell_distil.py`	Full model implementation
`config.json`	Architecture config
`gene_names.txt`	Ordered list of 18,301 HGNC gene symbols
`train.py`	Classification fine-tuning script
`requirements.txt`	Python dependencies
`model.safetensors`	Pretrained encoder weights

Citation

If you use this model, please cite:

@article{convergecell2026,
  author    = {ConvergeBio},
  title     = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses},
  year      = {2026},
  note      = {Preprint available on bioRxiv},
}

License

Apache 2.0 — see LICENSE and NOTICE.

Downloads last month: 16

Safetensors

Model size

9.9M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support