Upload spatialwhisperer.ckpt (epoch=3, step=14624, 236 tensors, locked towers excluded)

f4c729a verified 8 days ago

5.2 kB

	---
	license: cc-by-nc-4.0
	tags:
	- histopathology
	- spatial-transcriptomics
	- multimodal
	- vision-language
	- clip
	- cell-type-annotation
	library_name: pytorch
	pipeline_tag: zero-shot-image-classification
	---

	# SpatialWhisperer

	SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.

	This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper Trimodal Learning Enhances Zero-Shot Histopathology Annotation (anonymized name `\ourmethod`).

	## Model architecture

	Three encoders project into a shared embedding space:

	\| Modality \| Encoder \| Freezing \|
	\|----------------\|------------------\|----------\|
	\| Image (H&E) \| UNI2 \| locked \|
	\| Transcriptome \| Geneformer (12L) \| locked \|
	\| Text \| BioBERT v1.1 \| unfrozen \|

	Following LiT convention, the freezing pattern is LUL (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048.

	## Training data

	Three paired datasets cover the three modality pairs:

	- HEST-1K — H&E ↔ spatial gene expression (Visium-style spots)
	- cellxgene_census — gene expression ↔ free-text cell/sample metadata
	- ARCHS4/GEO — gene expression ↔ free-text sample descriptions

	Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624.

	## Evaluation

	Reported AUROC on cell-type benchmarks (mean across cell types):

	\| Benchmark \| SpatialWhisperer \| Best published baseline \| Δ rel. \|
	\|-----------\|------------------\|-------------------------\|---------\|
	\| PathoCell \| 0.630 \| 0.554 \| +13.7% \|
	\| Lizard \| (see paper) \| — \| +15.9% \|
	\| PanNuke \| (see paper) \| — \| +13.7% \|

	Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers.

	## How to use

	The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its `hyper_parameters` block. Foundation model weights are NOT included — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's `hyper_parameters.model_config.use_cache = True` flag triggers the `FrozenCachedModel` wrapping that excludes the locked towers from `state_dict` during load.

	Loading requires the cellwhisperer code at <https://github.com/moritzschaefer/spatialwhisperer> (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts.

	```python
	from cellwhisperer.utils.model_io import load_cellwhisperer_model

	model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model(
	model_path="hf://moritzschaefer/spatialwhisperer"
	)
	# model is a TranscriptomeTextDualEncoderLightning in eval mode
	```

	While the repo is private, export a token first:

	```bash
	export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write) # or any read token with access
	```

	To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See `examples/zero_shot_celltype.py` in the model code repository.

	## Intended use & limitations

	Intended. Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.

	Not intended. Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.

	Known limitations.
	- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
	- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
	- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.

	## File contents

	- `spatialwhisperer.ckpt` — Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped).
	- `README.md` — this card.

	## Citation

	```bibtex
	@inproceedings{schaefer2026spatialwhisperer,
	title = {Trimodal Learning Enhances Zero-Shot Histopathology Annotation},
	author = {Schaefer, Moritz and others},
	booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
	year = {2026},
	}
	```

	## License

	CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories.