You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for CARE

A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

CVPR 2026 Highlight

What is CARE?

CARE (Cross-modal Adaptive Region Encoder) is a slide-level foundation model designed specifically for computational pathology to better capture the heterogeneous and non-uniform tissue organization in whole-slide images (WSIs). Unlike existing pathology foundation models that largely inherit natural image backbones and focus on isolated patches, CARE explicitly models coherent, morphologically meaningful regions, which improves both pathological interpretability and clinical relevance. Its training follows a two-stage strategy: first, a unimodal self-supervised pretraining stage on 34,277 WSIs to learn morphology-aware visual representations without requiring segmentation annotations, and second, a cross-modal alignment stage that incorporates RNA and protein profiles to refine region construction and representation. With this molecular guidance, CARE can identify biologically relevant and irregular yet coherent regions of interest (ROIs), and it supports downstream tasks using either ROI-level features or slide-level features aggregated from adaptive regions. Notably, despite using only about one-tenth of the pretraining data of many mainstream foundation models, CARE achieves stronger average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis.

Why use CARE?

CARE offers a practical and high-performing solution for slide-level foundation modeling, with several notable advantages:

Data-efficient learning: CARE achieves strong results with only a fraction of the pretraining data used in prior work, demonstrating the effectiveness of its two-stage pretraining pipeline.
Annotation-free region partitioning: CARE can perform coarse morphological region partitioning on WSIs without segmentation annotations, enabling structured slide understanding at low annotation cost.
Molecularly guided multimodal modeling: CARE is, to the best of our knowledge, the first slide-level foundation model guided by protein information, helping bridge pathology image modeling and molecular representation learning.
Efficient training and inference: By using an Intra-Regional Information Interaction Strategy, CARE restricts self-attention to patches within the same region, avoiding expensive global attention over massive patch sets and reducing both computation and memory usage.
Consistently strong performance: Across 33 downstream tasks, CARE consistently outperforms prior approaches and demonstrates strong generalization ability.

Introduction

Research group: Chen Li Group
Model type: Pretrained vision-molecular encoder
Core design: Adaptive region modeling
Pretraining strategy: Two-stage pretraining with self-supervised WSI pretraining followed by cross-modal molecular alignment
Pretraining dataset: TCGA and GTEx
Pretraining scale: 34,277 WSIs in total, including 11,463 TCGA slides and 22,814 GTEx slides
Paired multimodal data: 13,289 WSI–RNA pairs and 8,225 WSI–protein pairs

Requirements

torch==2.3.0
timm==1.0.19
transformers==4.57.6

WSI Feature

CARE is a vision-molecular model trained on CONCH v1.5 patch features with patch size of 512x512 pixels at 20x magnification.

You should first request access to CONCH v1.5, and then extract patch features by following the instructions provided for CONCH v1.5. You can use the CLAM to extract CONCH v1.5 features and save them as .h5 files.

You can directly use CARE for slide-level feature extraction. CARE builds a feature grid from CONCH v1.5 patch features using patch coordinates and spatial relationships between patches.

Slide-level feature extraction can be done in the following way:

import h5py
from transformers import AutoModel
import numpy as np
import torch
device  =  torch.device("cuda"  if  torch.cuda.is_available() else  "cpu")
# load model
model = AutoModel.from_pretrained("Zipper-1/CARE",trust_remote_code=True).to(device)
model.eval()
# load CONCH v1.5 features from .h5 files following CLAM.
h5_path = 'data.h5'
with h5py.File(h5_path, 'r') as file:
    features = torch.from_numpy(file['features'][:]).to(device)
    coords = torch.from_numpy(file['coords'][:]).to(device)
    patch_size = file['coords'].attrs['patch_size_level0']
    coords = coords // patch_size
if features.dim() == 2:
    features = features.unsqueeze(0)
if coords.dim() == 2:
    coords = coords.unsqueeze(0)
N_values = torch.tensor([coords.shape[1]], dtype=torch.long, device=coords.device)
# extract CARE slide embedding
with torch.inference_mode():
    out = model(features,N_values, coords)     
    slide_embedding = out.wsi_embedding
    aux_loss = out.aux_loss

These extracted slide features can be used for downstream tasks such as slide-level classification (e.g., linear probing or kNN) and fine-tuning.

If you fine-tune CARE without including aux_loss in the overall training objective, the adaptive region partition remains fixed. To adapt the adaptive region partitioning to a downstream task, include aux_loss in the total training loss.

Contact

For any additional questions or comments, contact
Di Zhang(dzhang@stu.xjtu.edu.cn),
Zhangpeng Gong(gongzhangpeng@stu.xjtu.edu.cn),
Zeyu Gao (zg323@cam.ac.uk)

Acknowledgements

The project was built on top of amazing repositories such as ViT, iBOT, OpenClip. Thanks for their remarkable contributions and released code! Our work is built upon CONCH v1.5. We also thank the Mahmood Lab AI for Pathology team for their contributions to the computational pathology community.

BibTeX

If you find our work useful in your research, please consider citing CARE and CONCH v1.5:

@inproceedings{zhang2026care,
  title={CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis},
  author={Zhang, Di and Gong, Zhangpeng and Pang, Xiaobo and Liu, Jiashuai and Lu, Junbo and Cui, Hao and Ge, Jiusong and Zeng, Zhi and Yi, Kai and Li, Yinghua and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={21078--21088},
  year={2026}
}

@article{lu2024visual,
  title={A visual-language foundation model for computational pathology},
  author={Lu, Ming Y and Chen, Bowen and Williamson, Drew FK and Chen, Richard J and Liang, Ivy and Ding, Tong and Jaume, Guillaume and Odintsov, Igor and Le, Long Phi and Gerber, Georg and others},
  journal={Nature medicine},
  volume={30},
  number={3},
  pages={863--874},
  year={2024},
  publisher={Nature Publishing Group US New York}
}

license: cc-by-nc-4.0

Downloads last month: 2,318

Safetensors

Model size

18.8M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Zipper-1/CARE

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

Paper • 2602.21637 • Published Feb 25