File size: 3,253 Bytes
0d50af0 83af716 0d50af0 9e4b45d 83af716 9e4b45d 83af716 9e4b45d e8e1834 9e4b45d e8e1834 9e4b45d e8e1834 9e4b45d eb1d9d4 9e4b45d 83af716 9e4b45d 83af716 9e4b45d 0d50af0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | ---
license: mit
language:
- en
base_model:
- facebook/dinov2-small
- emilyalsentzer/Bio_ClinicalBERT
pipeline_tag: zero-shot-image-classification
tags:
- medical
datasets:
- simwit/mimic-cxr
- danjacobellis/chexpert
- rajpurkarlab/ReXGradient-160K
- BahaaEldin0/NIH-Chest-Xray-14
- SampadKar/vindr-cxr
metrics:
- accuracy
- bleu
---
# CheXficient
[Paper](https://arxiv.org/abs/2602.22843) | [GitHub](https://github.com/cwangrun/CheXficient)
CheXficient is a vision-language foundation model for chest X-ray (CXR) interpretation, designed to improve both **data efficiency** and **computational efficiency** during pretraining.
Instead of scaling indiscriminately to ever-larger datasets, CheXficient adopts a principled data curation strategy to selectively prioritize informative training samples.
This approach demonstrates that active, structured data selection can serve as a cost-effective alternative to brute-force dataset enlargement.
The model follows a dual-encoder architecture and supports prompt-based zero-shot classification via joint image-text representation learning.
------------------------------------------------------------------------
## Model Overview
- **Architecture:** Vision-language dual encoder
- **Image Backbone:** DINOv2 (base)
- **Text Backbone:** BioClinicalBERT
- **Input:** Chest X-ray image + text prompts
- **Output:** Image-text similarity logits and embeddings
- **Framework:** PyTorch + Hugging Face Transformers
- **Intended Use:** Research in medical AI and multimodal learning
------------------------------------------------------------------------
## Installation
``` bash
pip install torch torchvision transformers pillow
```
------------------------------------------------------------------------
## Load the Model
``` python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
repo_id = "StanfordAIMI/CheXficient"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
repo_id,
trust_remote_code=True
).to(device)
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
```
------------------------------------------------------------------------
## Zero-Shot Classification Example
``` python
image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
text = ["Pneumonia", "no Pneumonia"]
image_inputs = image_processor(images=image, return_tensors="pt").to(device)
text_inputs = tokenizer(text, padding=True, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(
pixel_values=image_inputs["pixel_values"],
text_tokens=text_inputs,
)
print(outputs)
```
Optional probability conversion:
``` python
import torch.nn.functional as F
logits = outputs["logits_per_image"]
probs = F.softmax(logits, dim=-1)
print(probs)
```
------------------------------------------------------------------------
## Citation
``` bibtex
@article{chexficient2024,
title={A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling},
author={...},
journal={...},
year={2026}
}
``` |