File size: 5,570 Bytes
3caa347 574879c 3caa347 574879c 3caa347 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | ---
license: cc-by-nc-4.0
tags:
- histopathology
- spatial-transcriptomics
- multimodal
- vision-language
- clip
- cell-type-annotation
library_name: pytorch
pipeline_tag: zero-shot-image-classification
---
# SpatialWhisperer
SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.
This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306).
## Model architecture
Three encoders project into a shared embedding space:
| Modality | Encoder | Freezing |
|----------------|------------------|----------|
| Image (H&E) | UNI2 | locked |
| Transcriptome | Geneformer (12L) | locked |
| Text | BioBERT v1.1 | unfrozen |
Following LiT convention, the freezing pattern is **LUL** (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048.
## Training data
Three paired datasets cover the three modality pairs:
- **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots)
- **cellxgene_census** — gene expression ↔ free-text cell/sample metadata
- **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions
Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624.
## Evaluation
Reported AUROC on cell-type benchmarks (mean across cell types):
| Benchmark | SpatialWhisperer | Best published baseline | Δ rel. |
|-----------|------------------|-------------------------|---------|
| PathoCell | **0.630** | 0.554 | +13.7% |
| Lizard | (see paper) | — | +15.9% |
| PanNuke | (see paper) | — | +13.7% |
Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers.
## How to use
The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its `hyper_parameters` block. **Foundation model weights are NOT included** — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's `hyper_parameters.model_config.use_cache = True` flag triggers the `FrozenCachedModel` wrapping that excludes the locked towers from `state_dict` during load.
Loading requires the cellwhisperer code at <https://github.com/Good-Lab/spatialwhisperer> (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts.
```python
from cellwhisperer.utils.model_io import load_cellwhisperer_model
model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model(
model_path="hf://Good-Lab/spatialwhisperer"
)
# model is a TranscriptomeTextDualEncoderLightning in eval mode
```
While the repo is private, export a token first:
```bash
export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write) # or any read token with access
```
To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See `examples/zero_shot_celltype.py` in the model code repository.
## Intended use & limitations
**Intended.** Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.
**Not intended.** Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.
**Known limitations.**
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.
## File contents
- `spatialwhisperer.ckpt` — Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped).
- `README.md` — this card.
## Citation
```bibtex
@inproceedings{schaefer2026spatialwhisperer,
title = {Transitive Representation Learning Enhances Histopathology Annotation},
author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
publisher = {PMLR},
address = {Seoul, South Korea},
month = jul,
year = {2026},
url = {https://openreview.net/forum?id=Ze7U293Zw4}
}
```
## License
CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories.
|