You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for CARE

A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

CVPR 2026

[Arxiv] | [Github] | [Cite]

What is CARE?

CARE (Cross-modal Adaptive Region Encoder) is a slide-level foundation model designed specifically for computational pathology to better capture the heterogeneous and non-uniform tissue organization in whole-slide images (WSIs). Unlike existing pathology foundation models that largely inherit natural image backbones and focus on isolated patches, CARE explicitly models coherent, morphologically meaningful regions, which improves both pathological interpretability and clinical relevance. Its training follows a two-stage strategy: first, a unimodal self-supervised pretraining stage on 34,277 WSIs to learn morphology-aware visual representations without requiring segmentation annotations, and second, a cross-modal alignment stage that incorporates RNA and protein profiles to refine region construction and representation. With this molecular guidance, CARE can identify biologically relevant and irregular yet coherent regions of interest (ROIs), and it supports downstream tasks using either ROI-level features or slide-level features aggregated from adaptive regions. Notably, despite using only about one-tenth of the pretraining data of many mainstream foundation models, CARE achieves stronger average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis.

Why use CARE?

CARE offers a practical and high-performing solution for slide-level foundation modeling, with several notable advantages:

  • Data-efficient learning: CARE achieves strong results with only a fraction of the pretraining data used in prior work, demonstrating the effectiveness of its two-stage pretraining pipeline.
  • Annotation-free region partitioning: CARE can perform coarse morphological region partitioning on WSIs without segmentation annotations, enabling structured slide understanding at low annotation cost.
  • Molecularly guided multimodal modeling: CARE is, to the best of our knowledge, the first slide-level foundation model guided by protein information, helping bridge pathology image modeling and molecular representation learning.
  • Efficient training and inference: By using an Intra-Regional Information Interaction Strategy, CARE restricts self-attention to patches within the same region, avoiding expensive global attention over massive patch sets and reducing both computation and memory usage.
  • Consistently strong performance: Across 33 downstream tasks, CARE consistently outperforms prior approaches and demonstrates strong generalization ability.

Introduction

  • Research group: Chen Li Group
  • Model type: Pretrained vision-molecular encoder
  • Core design: Adaptive region modeling
  • Pretraining strategy: Two-stage pretraining with self-supervised WSI pretraining followed by cross-modal molecular alignment
  • Pretraining dataset: TCGA and GTEx
  • Pretraining scale: 34,277 WSIs in total, including 11,463 TCGA slides and 22,814 GTEx slides
  • Paired multimodal data: 13,289 WSI–RNA pairs and 8,225 WSI–protein pairs

Requirements

torch==2.3.0
timm==1.0.19
transformers==4.57.6

WSI Feature

CARE is a vision-molecular model trained on CONCH v1.5 patch features with patch size of 512x512 pixels at 20x magnification.

You should first request access to CONCH v1.5, and then extract patch features by following the instructions provided for CONCH v1.5. You can use the CLAM to extract CONCH v1.5 features and save them as .h5 files.

You can directly use CARE for slide-level feature extraction. CARE builds a feature grid from CONCH v1.5 patch features using patch coordinates and spatial relationships between patches.

Slide-level feature extraction can be done in the following way:

import h5py
from transformers import AutoModel
import numpy as np
import torch
device  =  torch.device("cuda"  if  torch.cuda.is_available() else  "cpu")
# load model
model = AutoModel.from_pretrained("Zipper-1/CARE",trust_remote_code=True).to(device)
model.eval()
# load CONCH v1.5 features from .h5 files following CLAM.
h5_path = 'data.h5'
with h5py.File(h5_path, 'r') as file:
    features = torch.from_numpy(file['features'][:]).to(device)
    coords = torch.from_numpy(file['coords'][:]).to(device)
    patch_size = file['coords'].attrs['patch_size_level0']
    coords = coords // patch_size
if features.dim() == 2:
    features = features.unsqueeze(0)
if coords.dim() == 2:
    coords = coords.unsqueeze(0)
N_values = torch.tensor([coords.shape[1]], dtype=torch.long, device=coords.device)
# extract CARE slide embedding
with torch.inference_mode():
    out = model(features,N_values, coords)     
    slide_embedding = out.wsi_embedding
    aux_loss = out.aux_loss

These extracted slide features can be used for downstream tasks such as slide-level classification (e.g., linear probing or kNN) and fine-tuning.

If you fine-tune CARE without including aux_loss in the overall training objective, the adaptive region partition remains fixed. To adapt the adaptive region partitioning to a downstream task, include aux_loss in the total training loss.

Contact

For any additional questions or comments, contact
Di Zhang(dzhang@stu.xjtu.edu.cn),
Zhangpeng Gong(gongzhangpeng@stu.xjtu.edu.cn),
Zeyu Gao (zg323@cam.ac.uk)

Acknowledgements

The project was built on top of amazing repositories such as ViT, iBOT, OpenClip. Thanks for their remarkable contributions and released code! Our work is built upon CONCH v1.5. We also thank the Mahmood Lab AI for Pathology team for their contributions to the computational pathology community.

BibTeX

If you find our work useful in your research, please consider citing CARE and CONCH v1.5:

@article{zhang2026care,
  title={CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis},
  author={Zhang, Di and Gong, Zhangpeng and Pang, Xiaobo and Liu, Jiashuai and Lu, Junbo and Cui, Hao and Ge, Jiusong and Zeng, Zhi and Yi, Kai and Li, Yinghua and others},
  journal={arXiv preprint arXiv:2602.21637},
  year={2026}
}
@article{ding2025multimodal,
  title={A multimodal whole-slide foundation model for pathology},
  author={Ding, Tong and Wagner, Sophia J and Song, Andrew H and Chen, Richard J and Lu, Ming Y and Zhang, Andrew and Vaidya, Anurag J and Jaume, Guillaume and Shaban, Muhammad and Kim, Ahrong and others},
  journal={Nature Medicine},
  pages={1--13},
  year={2025},
  publisher={Nature Publishing Group US New York}
}

license: cc-by-nc-4.0

Downloads last month
2
Safetensors
Model size
18.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Zipper-1/CARE