CheXficient

Paper | GitHub

CheXficient is a vision-language foundation model for chest X-ray (CXR) interpretation, designed to improve both data efficiency and computational efficiency during pretraining.

Instead of scaling indiscriminately to ever-larger datasets, CheXficient adopts a principled data curation strategy to selectively prioritize informative training samples. This approach demonstrates that active, structured data selection can serve as a cost-effective alternative to brute-force dataset enlargement.

The model follows a dual-encoder architecture and supports prompt-based zero-shot classification via joint image-text representation learning.

Model Overview

Architecture: Vision-language dual encoder
Image Backbone: DINOv2 (base)
Text Backbone: BioClinicalBERT
Input: Chest X-ray image + text prompts
Output: Image-text similarity logits and embeddings
Framework: PyTorch + Hugging Face Transformers
Intended Use: Research in medical AI and multimodal learning

Installation

pip install torch torchvision transformers pillow

Load the Model

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor

repo_id = "StanfordAIMI/CheXficient"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    repo_id,
    trust_remote_code=True
).to(device)

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)

model.eval()

Zero-Shot Classification Example

image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
text = ["Pneumonia", "no Pneumonia"]

image_inputs = image_processor(images=image, return_tensors="pt").to(device)
text_inputs = tokenizer(text, padding=True, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(
        pixel_values=image_inputs["pixel_values"],
        text_tokens=text_inputs,
    )

print(outputs)

Optional probability conversion:

import torch.nn.functional as F

logits = outputs["logits_per_image"]
probs = F.softmax(logits, dim=-1)
print(probs)

Citation

@article{wang2026data,
  title={A data-and compute-efficient chest X-ray foundation model beyond aggressive scaling},
  author={Wang, Chong and Zhang, Yabin and Gao, Yunhe and Varma, Maya and Mottez, Clemence and Patsatzi, Faidra and Liu, Jiaming and Long, Jin and Delbrouck, Jean-Benoit and Gatidis, Sergios and others},
  journal={arXiv preprint arXiv:2602.22843},
  year={2026}
}

Downloads last month: 67

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for StanfordAIMI/CheXficient

Base model

emilyalsentzer/Bio_ClinicalBERT

Finetuned

(67)

this model

Datasets used to train StanfordAIMI/CheXficient

Paper for StanfordAIMI/CheXficient

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Paper • 2602.22843 • Published Feb 26