A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling
Paper
• 2602.22843 • Published
CheXficient is a vision-language foundation model for chest X-ray (CXR) interpretation, designed to improve both data efficiency and computational efficiency during pretraining.
Instead of scaling indiscriminately to ever-larger datasets, CheXficient adopts a principled data curation strategy to selectively prioritize informative training samples. This approach demonstrates that active, structured data selection can serve as a cost-effective alternative to brute-force dataset enlargement.
The model follows a dual-encoder architecture and supports prompt-based zero-shot classification via joint image-text representation learning.
pip install torch torchvision transformers pillow
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
repo_id = "StanfordAIMI/CheXficient"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
repo_id,
trust_remote_code=True
).to(device)
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
text = ["Pneumonia", "no Pneumonia"]
image_inputs = image_processor(images=image, return_tensors="pt").to(device)
text_inputs = tokenizer(text, padding=True, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(
pixel_values=image_inputs["pixel_values"],
text_tokens=text_inputs,
)
print(outputs)
Optional probability conversion:
import torch.nn.functional as F
logits = outputs["logits_per_image"]
probs = F.softmax(logits, dim=-1)
print(probs)
@article{chexficient2024,
title={A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling},
author={...},
journal={...},
year={2026}
}
Base model
emilyalsentzer/Bio_ClinicalBERT