Upload spatialwhisperer.ckpt (epoch=3, step=14624, 236 tensors, locked towers excluded)
f4c729a verified | license: cc-by-nc-4.0 | |
| tags: | |
| - histopathology | |
| - spatial-transcriptomics | |
| - multimodal | |
| - vision-language | |
| - clip | |
| - cell-type-annotation | |
| library_name: pytorch | |
| pipeline_tag: zero-shot-image-classification | |
| # SpatialWhisperer | |
| SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data. | |
| This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper *Trimodal Learning Enhances Zero-Shot Histopathology Annotation* (anonymized name `\ourmethod`). | |
| ## Model architecture | |
| Three encoders project into a shared embedding space: | |
| | Modality | Encoder | Freezing | | |
| |----------------|------------------|----------| | |
| | Image (H&E) | UNI2 | locked | | |
| | Transcriptome | Geneformer (12L) | locked | | |
| | Text | BioBERT v1.1 | unfrozen | | |
| Following LiT convention, the freezing pattern is **LUL** (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048. | |
| ## Training data | |
| Three paired datasets cover the three modality pairs: | |
| - **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots) | |
| - **cellxgene_census** — gene expression ↔ free-text cell/sample metadata | |
| - **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions | |
| Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624. | |
| ## Evaluation | |
| Reported AUROC on cell-type benchmarks (mean across cell types): | |
| | Benchmark | SpatialWhisperer | Best published baseline | Δ rel. | | |
| |-----------|------------------|-------------------------|---------| | |
| | PathoCell | **0.630** | 0.554 | +13.7% | | |
| | Lizard | (see paper) | — | +15.9% | | |
| | PanNuke | (see paper) | — | +13.7% | | |
| Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers. | |
| ## How to use | |
| The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its `hyper_parameters` block. **Foundation model weights are NOT included** — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's `hyper_parameters.model_config.use_cache = True` flag triggers the `FrozenCachedModel` wrapping that excludes the locked towers from `state_dict` during load. | |
| Loading requires the cellwhisperer code at <https://github.com/moritzschaefer/spatialwhisperer> (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts. | |
| ```python | |
| from cellwhisperer.utils.model_io import load_cellwhisperer_model | |
| model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model( | |
| model_path="hf://moritzschaefer/spatialwhisperer" | |
| ) | |
| # model is a TranscriptomeTextDualEncoderLightning in eval mode | |
| ``` | |
| While the repo is private, export a token first: | |
| ```bash | |
| export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write) # or any read token with access | |
| ``` | |
| To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See `examples/zero_shot_celltype.py` in the model code repository. | |
| ## Intended use & limitations | |
| **Intended.** Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data. | |
| **Not intended.** Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use. | |
| **Known limitations.** | |
| - Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed. | |
| - BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution. | |
| - The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower. | |
| ## File contents | |
| - `spatialwhisperer.ckpt` — Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped). | |
| - `README.md` — this card. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{schaefer2026spatialwhisperer, | |
| title = {Trimodal Learning Enhances Zero-Shot Histopathology Annotation}, | |
| author = {Schaefer, Moritz and others}, | |
| booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)}, | |
| year = {2026}, | |
| } | |
| ``` | |
| ## License | |
| CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories. | |