File size: 4,390 Bytes
a37b388
 
e6b5339
 
 
a37b388
 
e6b5339
a37b388
 
 
e6b5339
 
 
a37b388
 
 
 
 
 
 
e6b5339
a37b388
e6b5339
 
 
a37b388
 
 
 
 
b630af3
 
e6b5339
a37b388
 
 
 
 
 
 
 
e6b5339
 
a37b388
 
 
 
 
e6b5339
a37b388
 
 
 
 
 
 
 
 
 
 
e6b5339
a37b388
e6b5339
a37b388
 
e6b5339
a37b388
b630af3
a37b388
 
 
 
 
b630af3
 
 
 
 
 
 
 
 
 
a37b388
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
library_name: transformers
license: mit
base_model:
- KuanP/PULSAR-pbmc
---

# Model Card for PULSAR-pbmc 

<!-- Provide a quick summary of what the model is/does. -->

**PULSAR** (Patient Understanding Leveraging Single-cell universAl Representation) is a multi-scale, multi-cellular foundation model for human peripheral blood mononuclear cells (PBMCs). It transforms a set of single-cell transcriptomes into an interpretable **donor-level embedding** that preserves single-cell resolution while capturing multicellular composition and coordination.

This repo hosts the **aligned PBMC model** (`PULSAR-aligned`) used to produce donor embeddings aligned for disease classification. A base-model is also available (see **Model Sources**).


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale foundation model for PBMC scRNA-seq that converts unordered sets of single cells into a 512-d donor embedding while preserving single-cell resolution. It integrates molecular priors from ESM2 protein embeddings, cellular representations via Universal Cell Embeddings (UCE, 1,280-d), and a Multicellular Transformer encoder–decoder trained with a high-masking, Masked Cell Modeling objective. Pretraining proceeds in two stages: a pan-tissue CELLxGENE corpus (≈36.2M cells; 6,807 samples) followed by continual pretraining on blood (≈8.74M cells; 2,588 samples). The resulting donor embeddings support zero-shot and lightweight-head downstream tasks, including large-scale reference mapping for disease classification (state-of-the-art accuracy with strong external generalization), regression of plasma proteomics from transcriptomes, forecasting of future outcomes (e.g., RA conversion in ACPA+ individuals and influenza vaccine responsiveness), and individualized cytokine perturbation modeling across donor, cellular, and gene levels. A  “virtual instrument” conditions on cytokine protein embeddings to transform baseline donor states and, with the decoder and an optional UCE→expression head, generates perturbed cell distributions and gene programs. Attention over cells provides mechanistic interpretability, highlighting disease- and severity-relevant subsets and enriching for antigen-specific clonotypes in viral infection. PULSAR thus operationalizes the AI Virtual Cell vision by linking molecular, cellular, and multicellular organization into a unified, transferable representation for precision immunology.

- **Developed by:** Kuan Pang (Stanford University, kuanpang@stanford.edu)
- **Model type:** Transformer
- **License:** MIT

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/snap-stanford/PULSAR
- **Paper:** https://www.biorxiv.org/content/10.1101/2025.11.24.685470v1
- **Aligned version:** https://huggingface.co/KuanP/PULSAR-pbmc

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use


- Generate 512-d **donor embeddings** from PBMC scRNA-seq to:
  - Perform **reference mapping/retrieval** (kNN) for disease phenotypes



### Out-of-Scope Use

The model might not work for tissue types other than PBMC, which also includes cell sorting samples.


## How to Get Started with the Model

Use the code below to get started with the model.


## Training Details

### Training Data

Stage-1 pretraining corpus: CZ CELLxGENE Census (LTS 2023-07-25), 36.2M cells, 6,807 samples across 53 tissues and 69 conditions.

Stage-2 continual pretraining (blood focus): 8.736M cells, 2,588 blood/PBMC samples (balanced sexes; broad ages).


More details can be found in the Paper and GitHub.

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@article{pang2025pulsar,
  title={PULSAR: a Foundation Model for Multi-scale and Multicellular Biology},
  author={Pang, Kuan and Rosen, Yanay and Kedzierska, Kasia and He, Ziyuan and Rajagopal, Abhe and Gustafson, Claire E and Huynh, Grace and Leskovec, Jure},
  journal={bioRxiv},
  pages={2025--11},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}
```