Good-Lab
/

spatialwhisperer

+---
+license: cc-by-nc-4.0
+tags:
+  - histopathology
+  - spatial-transcriptomics
+  - multimodal
+  - vision-language
+  - clip
+  - cell-type-annotation
+library_name: pytorch
+pipeline_tag: zero-shot-image-classification
+---
+# SpatialWhisperer
+SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.
+This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper *Trimodal Learning Enhances Zero-Shot Histopathology Annotation* (anonymized name `\ourmethod`).
+## Model architecture
+Three encoders project into a shared embedding space:
+| Modality       | Encoder          | Freezing |
+|----------------|------------------|----------|
+| Image (H&E)    | UNI2             | locked   |
+| Transcriptome  | Geneformer (12L) | locked   |
+| Text           | BioBERT v1.1     | unfrozen |
+Following LiT convention, the freezing pattern is **LUL** (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048.
+## Training data
+Three paired datasets cover the three modality pairs:
+- **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots)
+- **cellxgene_census** — gene expression ↔ free-text cell/sample metadata
+- **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions
+Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624.
+## Evaluation
+Reported AUROC on cell-type benchmarks (mean across cell types):
+| Benchmark | SpatialWhisperer | Best published baseline | Δ rel.  |
+|-----------|------------------|-------------------------|---------|
+| PathoCell | **0.630**        | 0.554                   | +13.7%  |
+| Lizard    | (see paper)      | —                       | +15.9%  |
+| PanNuke   | (see paper)      | —                       | +13.7%  |
+Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers.
+## How to use
+The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its `hyper_parameters` block. **Foundation model weights are NOT included** — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's `hyper_parameters.model_config.use_cache = True` flag triggers the `FrozenCachedModel` wrapping that excludes the locked towers from `state_dict` during load.
+Loading requires the cellwhisperer code at <https://github.com/Good-Lab/spatialwhisperer> (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts.
+```python
+from cellwhisperer.utils.model_io import load_cellwhisperer_model
+model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model(
+    model_path="hf://Good-Lab/spatialwhisperer"
+)
+# model is a TranscriptomeTextDualEncoderLightning in eval mode
+```
+While the repo is private, export a token first:
+```bash
+export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write)  # or any read token with access
+```
+To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See `examples/zero_shot_celltype.py` in the model code repository.
+## Intended use & limitations
+**Intended.** Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.
+**Not intended.** Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.
+**Known limitations.**
+- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
+- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
+- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.
+## File contents
+- `spatialwhisperer.ckpt` — Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped).
+- `README.md` — this card.
+## Citation
+```bibtex
+@inproceedings{schaefer2026spatialwhisperer,
+  title  = {Trimodal Learning Enhances Zero-Shot Histopathology Annotation},
+  author = {Schaefer, Moritz and others},
+  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
+  year   = {2026},
+}
+```
+## License
+CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories.

spatialwhisperer.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee4e5afb0d8b6b8776f66b8b7623660ecf8864cb0cbbbc70ed69326322b4ed48
+size 529993226