| | --- |
| | library_name: transformers |
| | license: mit |
| | --- |
| | |
| | # Model Card for PULSAR-pbmc |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| |
|
| | **PULSAR** (Patient Understanding Leveraging Single-cell universAl Representation) is a multi-scale, multi-cellular foundation model for human peripheral blood mononuclear cells (PBMCs). It transforms a set of single-cell transcriptomes into an interpretable **donor-level embedding** that preserves single-cell resolution while capturing multicellular composition and coordination. |
| |
|
| | This repo hosts the **zero-shot PBMC model** (`PULSAR-pbmc`) used to produce donor embeddings without task-specific fine-tuning. A disease-aligned variant is also available (see **Model Sources**). |
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | <!-- Provide a longer summary of what this model is. --> |
| | PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale foundation model for PBMC scRNA-seq that converts unordered sets of single cells into a 512-d donor embedding while preserving single-cell resolution. It integrates molecular priors from ESM2 protein embeddings, cellular representations via Universal Cell Embeddings (UCE, 1,280-d), and a Multicellular Transformer encoder–decoder trained with a high-masking, Masked Cell Modeling objective. Pretraining proceeds in two stages: a pan-tissue CELLxGENE corpus (≈36.2M cells; 6,807 samples) followed by continual pretraining on blood (≈8.74M cells; 2,588 samples). The resulting donor embeddings support zero-shot and lightweight-head downstream tasks, including large-scale reference mapping for disease classification (state-of-the-art accuracy with strong external generalization), regression of plasma proteomics from transcriptomes, forecasting of future outcomes (e.g., RA conversion in ACPA+ individuals and influenza vaccine responsiveness), and individualized cytokine perturbation modeling across donor, cellular, and gene levels. A “virtual instrument” conditions on cytokine protein embeddings to transform baseline donor states and, with the decoder and an optional UCE→expression head, generates perturbed cell distributions and gene programs. Attention over cells provides mechanistic interpretability, highlighting disease- and severity-relevant subsets and enriching for antigen-specific clonotypes in viral infection. PULSAR thus operationalizes the AI Virtual Cell vision by linking molecular, cellular, and multicellular organization into a unified, transferable representation for precision immunology. |
| |
|
| | - **Developed by:** Kuan Pang (Stanford University, kuanpang@stanford.edu) |
| | - **Model type:** Transformer |
| | - **License:** MIT |
| |
|
| | ### Model Sources [optional] |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:**: https://github.com/snap-stanford/PULSAR |
| | - **Paper:** https://www.biorxiv.org/content/10.1101/2025.11.24.685470v1 |
| | - **Aligned version:** https://huggingface.co/KuanP/PULSAR-aligned |
| |
|
| | ## Uses |
| |
|
| | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| |
|
| | ### Direct Use |
| |
|
| |
|
| | - Generate 512-d **donor embeddings** from PBMC scRNA-seq to: |
| | - Perform **reference mapping/retrieval** (kNN) for disease phenotypes |
| | - Build **lightweight predictors** for clinical variables (e.g., plasma proteomics, vaccine response) |
| | - Support **in-silico perturbation** pipelines (with the provided virtual-instrument and decoders) |
| | - Enable **interpretability** via attention over single cells and cell types |
| |
|
| | ### Downstream Use [optional] |
| |
|
| |
|
| | - Fine-tune/align the embedding space for a labeled task (e.g., contrastive alignment by disease label). |
| | - Integrate with perturbation modules to predict donor-, cell-, and gene-level responses to cytokines. |
| |
|
| |
|
| | ### Out-of-Scope Use |
| |
|
| | The model might not work for tissue types other than PBMC, that also includes cell sorting samples. |
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to get started with the model. |
| |
|
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | Stage-1 pretraining corpus: CZ CELLxGENE Census (LTS 2023-07-25), 36.2M cells, 6,807 samples across 53 tissues and 69 conditions. |
| |
|
| | Stage-2 continual pretraining (blood focus): 8.736M cells, 2,588 blood/PBMC samples (balanced sexes; broad ages). |
| |
|
| |
|
| | More details can be found in the Paper and GitHub. |
| |
|
| | ## Citation |
| |
|
| | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
|
| | **BibTeX:** |
| |
|
| | ``` |
| | @article{pang2025pulsar, |
| | title={PULSAR: a Foundation Model for Multi-scale and Multicellular Biology}, |
| | author={Pang, Kuan and Rosen, Yanay and Kedzierska, Kasia and He, Ziyuan and Rajagopal, Abhe and Gustafson, Claire E and Huynh, Grace and Leskovec, Jure}, |
| | journal={bioRxiv}, |
| | pages={2025--11}, |
| | year={2025}, |
| | publisher={Cold Spring Harbor Laboratory} |
| | } |
| | ``` |
| |
|