| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - histopathology |
| - spatial-transcriptomics |
| - multimodal |
| - vision-language |
| - clip |
| - cell-type-annotation |
| library_name: pytorch |
| pipeline_tag: zero-shot-image-classification |
| --- |
| |
| # SpatialWhisperer |
|
|
| SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data. |
|
|
| This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). |
|
|
| ## Model architecture |
|
|
| Three encoders project into a shared embedding space: |
|
|
| | Modality | Encoder | Freezing | |
| |----------------|------------------|----------| |
| | Image (H&E) | UNI2 | locked | |
| | Transcriptome | Geneformer (12L) | locked | |
| | Text | BioBERT v1.1 | unfrozen | |
|
|
| Following LiT convention, the freezing pattern is **LUL** (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048. |
|
|
| ## Training data |
|
|
| Three paired datasets cover the three modality pairs: |
|
|
| - **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots) |
| - **cellxgene_census** — gene expression ↔ free-text cell/sample metadata |
| - **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions |
| |
| Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624. |
| |
| ## Evaluation |
| |
| Reported AUROC on cell-type benchmarks (mean across cell types): |
| |
| | Benchmark | SpatialWhisperer | Best published baseline | Δ rel. | |
| |-----------|------------------|-------------------------|---------| |
| | PathoCell | **0.630** | 0.554 | +13.7% | |
| | Lizard | (see paper) | — | +15.9% | |
| | PanNuke | (see paper) | — | +13.7% | |
| |
| Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers. |
| |
| ## How to use |
| |
| The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its `hyper_parameters` block. **Foundation model weights are NOT included** — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's `hyper_parameters.model_config.use_cache = True` flag triggers the `FrozenCachedModel` wrapping that excludes the locked towers from `state_dict` during load. |
|
|
| Loading requires the cellwhisperer code at <https://github.com/Good-Lab/spatialwhisperer> (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts. |
|
|
| ```python |
| from cellwhisperer.utils.model_io import load_cellwhisperer_model |
| |
| model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model( |
| model_path="hf://Good-Lab/spatialwhisperer" |
| ) |
| # model is a TranscriptomeTextDualEncoderLightning in eval mode |
| ``` |
|
|
| While the repo is private, export a token first: |
|
|
| ```bash |
| export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write) # or any read token with access |
| ``` |
|
|
| To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See `examples/zero_shot_celltype.py` in the model code repository. |
|
|
| ## Intended use & limitations |
|
|
| **Intended.** Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data. |
|
|
| **Not intended.** Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use. |
|
|
| **Known limitations.** |
| - Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed. |
| - BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution. |
| - The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower. |
|
|
| ## File contents |
|
|
| - `spatialwhisperer.ckpt` — Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped). |
| - `README.md` — this card. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{schaefer2026spatialwhisperer, |
| title = {Transitive Representation Learning Enhances Histopathology Annotation}, |
| author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida}, |
| booktitle = {Proceedings of the 43rd International Conference on Machine Learning}, |
| series = {Proceedings of Machine Learning Research}, |
| volume = {306}, |
| publisher = {PMLR}, |
| address = {Seoul, South Korea}, |
| month = jul, |
| year = {2026}, |
| url = {https://openreview.net/forum?id=Ze7U293Zw4} |
| } |
| ``` |
|
|
| ## License |
|
|
| CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories. |
|
|