| | --- |
| | license: mit |
| | language: |
| | - en |
| | base_model: |
| | - facebook/dinov2-small |
| | - emilyalsentzer/Bio_ClinicalBERT |
| | pipeline_tag: zero-shot-image-classification |
| | tags: |
| | - medical |
| | datasets: |
| | - simwit/mimic-cxr |
| | - danjacobellis/chexpert |
| | - rajpurkarlab/ReXGradient-160K |
| | - BahaaEldin0/NIH-Chest-Xray-14 |
| | - SampadKar/vindr-cxr |
| | metrics: |
| | - accuracy |
| | - bleu |
| | --- |
| | # CheXficient |
| |
|
| | [Paper](https://arxiv.org/abs/2602.22843) | [GitHub](https://github.com/cwangrun/CheXficient) |
| |
|
| | CheXficient is a vision-language foundation model for chest X-ray (CXR) interpretation, designed to improve both **data efficiency** and **computational efficiency** during pretraining. |
| |
|
| | Instead of scaling indiscriminately to ever-larger datasets, CheXficient adopts a principled data curation strategy to selectively prioritize informative training samples. |
| | This approach demonstrates that active, structured data selection can serve as a cost-effective alternative to brute-force dataset enlargement. |
| |
|
| | The model follows a dual-encoder architecture and supports prompt-based zero-shot classification via joint image-text representation learning. |
| |
|
| |
|
| | ------------------------------------------------------------------------ |
| |
|
| | ## Model Overview |
| |
|
| | - **Architecture:** Vision-language dual encoder |
| | - **Image Backbone:** DINOv2 (base) |
| | - **Text Backbone:** BioClinicalBERT |
| | - **Input:** Chest X-ray image + text prompts |
| | - **Output:** Image-text similarity logits and embeddings |
| | - **Framework:** PyTorch + Hugging Face Transformers |
| | - **Intended Use:** Research in medical AI and multimodal learning |
| |
|
| | ------------------------------------------------------------------------ |
| |
|
| | ## Installation |
| |
|
| | ``` bash |
| | pip install torch torchvision transformers pillow |
| | ``` |
| |
|
| | ------------------------------------------------------------------------ |
| |
|
| | ## Load the Model |
| |
|
| | ``` python |
| | import torch |
| | from PIL import Image |
| | from transformers import AutoModel, AutoTokenizer, AutoImageProcessor |
| | |
| | repo_id = "StanfordAIMI/CheXficient" |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | model = AutoModel.from_pretrained( |
| | repo_id, |
| | trust_remote_code=True |
| | ).to(device) |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) |
| | image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True) |
| | |
| | model.eval() |
| | ``` |
| |
|
| | ------------------------------------------------------------------------ |
| |
|
| | ## Zero-Shot Classification Example |
| |
|
| | ``` python |
| | image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB") |
| | text = ["Pneumonia", "no Pneumonia"] |
| | |
| | image_inputs = image_processor(images=image, return_tensors="pt").to(device) |
| | text_inputs = tokenizer(text, padding=True, return_tensors="pt").to(device) |
| | |
| | with torch.no_grad(): |
| | outputs = model( |
| | pixel_values=image_inputs["pixel_values"], |
| | text_tokens=text_inputs, |
| | ) |
| | |
| | print(outputs) |
| | ``` |
| |
|
| | Optional probability conversion: |
| |
|
| | ``` python |
| | import torch.nn.functional as F |
| | |
| | logits = outputs["logits_per_image"] |
| | probs = F.softmax(logits, dim=-1) |
| | print(probs) |
| | ``` |
| |
|
| |
|
| | ------------------------------------------------------------------------ |
| |
|
| | ## Citation |
| |
|
| | ``` bibtex |
| | @article{chexficient2024, |
| | title={A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling}, |
| | author={...}, |
| | journal={...}, |
| | year={2026} |
| | } |
| | ``` |