File size: 3,253 Bytes
0d50af0
 
 
 
 
 
 
 
 
 
83af716
 
 
 
 
 
 
 
 
0d50af0
9e4b45d
 
83af716
 
 
 
 
 
 
 
9e4b45d
 
 
 
 
 
83af716
 
 
 
 
 
 
9e4b45d
 
 
 
 
 
 
 
 
 
 
 
 
 
e8e1834
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e4b45d
 
 
e8e1834
9e4b45d
 
 
e8e1834
 
 
 
 
 
 
 
 
 
 
9e4b45d
 
 
 
 
 
 
 
 
eb1d9d4
9e4b45d
 
 
 
 
 
 
 
 
 
 
83af716
9e4b45d
 
83af716
9e4b45d
0d50af0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: mit
language:
- en
base_model:
- facebook/dinov2-small
- emilyalsentzer/Bio_ClinicalBERT
pipeline_tag: zero-shot-image-classification
tags:
- medical
datasets:
- simwit/mimic-cxr
- danjacobellis/chexpert
- rajpurkarlab/ReXGradient-160K
- BahaaEldin0/NIH-Chest-Xray-14
- SampadKar/vindr-cxr
metrics:
- accuracy
- bleu
---
# CheXficient

[Paper](https://arxiv.org/abs/2602.22843) | [GitHub](https://github.com/cwangrun/CheXficient)

CheXficient is a vision-language foundation model for chest X-ray (CXR) interpretation, designed to improve both **data efficiency** and **computational efficiency** during pretraining.

Instead of scaling indiscriminately to ever-larger datasets, CheXficient adopts a principled data curation strategy to selectively prioritize informative training samples. 
This approach demonstrates that active, structured data selection can serve as a cost-effective alternative to brute-force dataset enlargement.

The model follows a dual-encoder architecture and supports prompt-based zero-shot classification via joint image-text representation learning.


------------------------------------------------------------------------

## Model Overview

- **Architecture:** Vision-language dual encoder  
- **Image Backbone:** DINOv2 (base)  
- **Text Backbone:** BioClinicalBERT  
- **Input:** Chest X-ray image + text prompts  
- **Output:** Image-text similarity logits and embeddings  
- **Framework:** PyTorch + Hugging Face Transformers  
- **Intended Use:** Research in medical AI and multimodal learning 

------------------------------------------------------------------------

## Installation

``` bash
pip install torch torchvision transformers pillow
```

------------------------------------------------------------------------

## Load the Model

``` python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor

repo_id = "StanfordAIMI/CheXficient"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    repo_id,
    trust_remote_code=True
).to(device)

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)

model.eval()
```

------------------------------------------------------------------------

## Zero-Shot Classification Example

``` python
image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
text = ["Pneumonia", "no Pneumonia"]

image_inputs = image_processor(images=image, return_tensors="pt").to(device)
text_inputs = tokenizer(text, padding=True, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(
        pixel_values=image_inputs["pixel_values"],
        text_tokens=text_inputs,
    )

print(outputs)
```

Optional probability conversion:

``` python
import torch.nn.functional as F

logits = outputs["logits_per_image"]
probs = F.softmax(logits, dim=-1)
print(probs)
```


------------------------------------------------------------------------

## Citation

``` bibtex
@article{chexficient2024,
  title={A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling},
  author={...},
  journal={...},
  year={2026}
}
```