File size: 4,895 Bytes

c7394ba
 
72e9d1b
c7394ba
 
72e9d1b
c7394ba
 
 
72e9d1b
 
 
c7394ba
 
 
 
 
 
 
72e9d1b
c7394ba
72e9d1b
 
 
c7394ba
 
 
 
 
72e9d1b
57e8dfa
9ed0a71
c7394ba
 
 
 
 
 
 
 
72e9d1b
 
 
 
 
c7394ba
 
 
 
72e9d1b
 
c7394ba
 
72e9d1b
c7394ba
72e9d1b
c7394ba
 
 
 
 
 
 
 
 
 
 
72e9d1b
c7394ba
72e9d1b
c7394ba
 
72e9d1b
c7394ba
b136e3f
c7394ba
 
 
 
 
57e8dfa

---
library_name: transformers
license: mit
---

# Model Card for PULSAR-pbmc 

<!-- Provide a quick summary of what the model is/does. -->

**PULSAR** (Patient Understanding Leveraging Single-cell universAl Representation) is a multi-scale, multi-cellular foundation model for human peripheral blood mononuclear cells (PBMCs). It transforms a set of single-cell transcriptomes into an interpretable **donor-level embedding** that preserves single-cell resolution while capturing multicellular composition and coordination.

This repo hosts the **zero-shot PBMC model** (`PULSAR-pbmc`) used to produce donor embeddings without task-specific fine-tuning. A disease-aligned variant is also available (see **Model Sources**).


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale foundation model for PBMC scRNA-seq that converts unordered sets of single cells into a 512-d donor embedding while preserving single-cell resolution. It integrates molecular priors from ESM2 protein embeddings, cellular representations via Universal Cell Embeddings (UCE, 1,280-d), and a Multicellular Transformer encoder–decoder trained with a high-masking, Masked Cell Modeling objective. Pretraining proceeds in two stages: a pan-tissue CELLxGENE corpus (≈36.2M cells; 6,807 samples) followed by continual pretraining on blood (≈8.74M cells; 2,588 samples). The resulting donor embeddings support zero-shot and lightweight-head downstream tasks, including large-scale reference mapping for disease classification (state-of-the-art accuracy with strong external generalization), regression of plasma proteomics from transcriptomes, forecasting of future outcomes (e.g., RA conversion in ACPA+ individuals and influenza vaccine responsiveness), and individualized cytokine perturbation modeling across donor, cellular, and gene levels. A  “virtual instrument” conditions on cytokine protein embeddings to transform baseline donor states and, with the decoder and an optional UCE→expression head, generates perturbed cell distributions and gene programs. Attention over cells provides mechanistic interpretability, highlighting disease- and severity-relevant subsets and enriching for antigen-specific clonotypes in viral infection. PULSAR thus operationalizes the AI Virtual Cell vision by linking molecular, cellular, and multicellular organization into a unified, transferable representation for precision immunology.

- **Developed by:** Kuan Pang (Stanford University, kuanpang@stanford.edu)
- **Model type:** Transformer
- **License:** MIT

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:**: https://github.com/snap-stanford/PULSAR
- **Paper:** https://www.biorxiv.org/content/10.1101/2025.11.24.685470v1
- **Aligned version:** https://huggingface.co/KuanP/PULSAR-aligned

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use


- Generate 512-d **donor embeddings** from PBMC scRNA-seq to:
  - Perform **reference mapping/retrieval** (kNN) for disease phenotypes
  - Build **lightweight predictors** for clinical variables (e.g., plasma proteomics, vaccine response)
  - Support **in-silico perturbation** pipelines (with the provided virtual-instrument and decoders)
  - Enable **interpretability** via attention over single cells and cell types

### Downstream Use [optional]


- Fine-tune/align the embedding space for a labeled task (e.g., contrastive alignment by disease label).
- Integrate with perturbation modules to predict donor-, cell-, and gene-level responses to cytokines.


### Out-of-Scope Use

The model might not work for tissue types other than PBMC, that also includes cell sorting samples.


## How to Get Started with the Model

Use the code below to get started with the model.


## Training Details

### Training Data

Stage-1 pretraining corpus: CZ CELLxGENE Census (LTS 2023-07-25), 36.2M cells, 6,807 samples across 53 tissues and 69 conditions.

Stage-2 continual pretraining (blood focus): 8.736M cells, 2,588 blood/PBMC samples (balanced sexes; broad ages).


More details can be found in the Paper and GitHub.

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@article{pang2025pulsar,
  title={PULSAR: a Foundation Model for Multi-scale and Multicellular Biology},
  author={Pang, Kuan and Rosen, Yanay and Kedzierska, Kasia and He, Ziyuan and Rajagopal, Abhe and Gustafson, Claire E and Huynh, Grace and Leskovec, Jure},
  journal={bioRxiv},
  pages={2025--11},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}
```