Transformers
Safetensors
virtual_cell_distil
biology
genomics
bulk-rna-seq
patient-embedding
custom_code
Instructions to use ConvergeBio/virtual-cell-distil-bulk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ConvergeBio/virtual-cell-distil-bulk with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ConvergeBio/virtual-cell-distil-bulk", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - biology | |
| - genomics | |
| - bulk-rna-seq | |
| - patient-embedding | |
| library_name: transformers | |
| license: apache-2.0 | |
| # Virtual Cell — Distilled Bulk Encoder | |
| A bulk RNA-seq encoder distilled from | |
| [ConvergeBio/virtual-cell-patient](https://huggingface.co/ConvergeBio/virtual-cell-patient). | |
| It maps bulk gene expression directly into the same 512-dimensional patient embedding space, | |
| making single-cell-trained representations accessible when only bulk data is available. | |
| ## Model architecture | |
| ``` | |
| input [batch, 18301 genes] | |
| → MLP encoder (Linear → BN → PReLU)² → [batch, 512] | |
| ``` | |
| Training objective: cosine distillation loss, with teacher embeddings produced by | |
| `virtual-cell-patient` on matched single-cell RNA-seq data from the same patients. | |
| ## Relationship to virtual-cell-patient | |
| | | [virtual-cell-patient](https://huggingface.co/ConvergeBio/virtual-cell-patient) | virtual-cell-distil-bulk | | |
| |---|---|---| | |
| | Input | `[batch, n_cells, 18301]` single-cell matrix | `[batch, 18301]` bulk expression vector | | |
| | Output | `[batch, 512]` patient embedding + class logits | `[batch, 512]` patient embedding | | |
| | Requires single-cell data | Yes | No | | |
| Both models use the same 18,301-gene vocabulary (`gene_names.txt`) and produce embeddings | |
| in the same 512-dimensional space. | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| `wandb` is optional and only needed when training with `--wandb_project`. | |
| ## Quick start | |
| ### Inference — extract embeddings | |
| ```python | |
| import torch | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained( | |
| "ConvergeBio/virtual-cell-distil-bulk", | |
| trust_remote_code=True, | |
| ).eval() | |
| x = torch.randn(4, 18_301) # [batch, num_genes] | |
| with torch.no_grad(): | |
| out = model(input_ids=x) | |
| print(out["embeddings"].shape) # [4, 512] | |
| ``` | |
| > **Note:** the model uses BatchNorm — always call `.eval()` for inference. | |
| ### Inference on real data | |
| ```python | |
| from datasets import load_dataset | |
| import torch | |
| from transformers import AutoModel | |
| ds = load_dataset("ConvergeBio/virtual-cell-distil-bulk-example", split="validation") | |
| model = AutoModel.from_pretrained( | |
| "ConvergeBio/virtual-cell-distil-bulk", | |
| trust_remote_code=True, | |
| ).eval() | |
| sample = torch.tensor(ds[0]["bulk_expression"]).unsqueeze(0) # [1, 18301] | |
| with torch.no_grad(): | |
| out = model(input_ids=sample) | |
| print(out["embeddings"].shape) # [1, 512] | |
| ``` | |
| > **Note:** `ConvergeBio/virtual-cell-distil-bulk-example` is a minimal sample dataset | |
| > intended only to verify the data format and run a quick end-to-end check. | |
| > Metrics produced from this dataset should not be interpreted. | |
| ## Fine-tuning for classification | |
| The pretrained encoder can be fine-tuned on any bulk RNA-seq classification task. | |
| A linear head is added on top; the encoder weights are initialised from the distilled | |
| checkpoint and optionally frozen. | |
| ```python | |
| from transformers import AutoModelForSequenceClassification | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| "ConvergeBio/virtual-cell-distil-bulk", | |
| num_labels=2, | |
| ignore_mismatched_sizes=True, # classification head is randomly initialised | |
| trust_remote_code=True, | |
| ) | |
| ``` | |
| **Binary classification (e.g. disease vs. healthy) with frozen encoder:** | |
| ```bash | |
| python train.py \ | |
| --dataset_path <your_dataset> \ | |
| --num_classes 2 \ | |
| --freeze_encoder \ | |
| --output_dir ./my_binary_model | |
| ``` | |
| **Multi-class fine-tuning:** | |
| ```bash | |
| python train.py \ | |
| --dataset_path <your_dataset> \ | |
| --num_classes <N> \ | |
| --output_dir ./my_finetuned_model \ | |
| --num_train_epochs 15 \ | |
| --learning_rate 1e-4 | |
| ``` | |
| ## Preparing your data | |
| `train.py` expects a HuggingFace dataset with `train` (and optionally `validation`) splits. | |
| Each row represents one patient sample: | |
| | Column | Shape | Type | Description | | |
| |---|---|---|---| | |
| | `bulk_expression` | [18301] | float32 | Log-normalised bulk gene expression, aligned to `gene_names.txt` | | |
| | `labels` | scalar | int | Class index | | |
| Input expression should be library-size normalised (target sum 10,000) and log1p | |
| transformed. The gene axis must be aligned to the 18,301 genes in `gene_names.txt` — | |
| missing genes are zero-filled, extra genes are dropped. | |
| For a guide on building this dataset from raw count matrices, see the | |
| [example dataset](https://huggingface.co/datasets/ConvergeBio/virtual-cell-distil-bulk-example). | |
| ## Repository contents | |
| | File | Description | | |
| |---|---| | |
| | `modeling_virtual_cell_distil.py` | Full model implementation | | |
| | `config.json` | Architecture config | | |
| | `gene_names.txt` | Ordered list of 18,301 HGNC gene symbols | | |
| | `train.py` | Classification fine-tuning script | | |
| | `requirements.txt` | Python dependencies | | |
| | `model.safetensors` | Pretrained encoder weights | | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @article{convergecell2026, | |
| author = {ConvergeBio}, | |
| title = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses}, | |
| year = {2026}, | |
| note = {Preprint available on bioRxiv}, | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 — see [LICENSE](LICENSE) and [NOTICE](NOTICE). | |