StanfordAIMI
/

CheXficient

@@ -1,3 +1,36 @@
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
@@ -14,7 +47,13 @@ tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
 image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)
 model.eval()
 image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
 text = ["Pneumonia", "no Pneumonia"]
@@ -26,4 +65,64 @@ with torch.no_grad():
         pixel_values=image_inputs["pixel_values"],
         text_tokens=text_inputs,
     )
-print(outputs)

+# CheXficient
+CheXficient is a vision-language foundation model for efficient and
+robust chest X-ray understanding. It enables joint image-text
+representation learning and supports prompt-based zero-shot
+classification.
+This repository provides a Hugging Face-compatible implementation for
+seamless integration into research workflows.
+------------------------------------------------------------------------
+## Model Overview
+-   Architecture: Vision-Language dual encoder
+-   Input: Chest X-ray image + text prompts
+-   Output: Image-text similarity logits and embeddings
+-   Framework: PyTorch + Hugging Face Transformers
+-   Intended Use: Research in medical AI and multimodal learning
+------------------------------------------------------------------------
+## Installation
+``` bash
+pip install torch torchvision transformers pillow
+```
+------------------------------------------------------------------------
+## Load the Model
+``` python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
 image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)
 model.eval()
+```
+------------------------------------------------------------------------
+## Zero-Shot Classification Example
+``` python
 image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
 text = ["Pneumonia", "no Pneumonia"]
         pixel_values=image_inputs["pixel_values"],
         text_tokens=text_inputs,
     )
+print(outputs)
+```
+Optional probability conversion:
+``` python
+import torch.nn.functional as F
+logits = outputs["logits"]
+probs = F.softmax(logits, dim=-1)
+print(probs)
+```
+------------------------------------------------------------------------
+## Model Interface
+``` python
+model(
+    pixel_values=Tensor,
+    text_tokens=dict
+)
+```
+Returns:
+-   logits
+-   image_embeds
+-   text_embeds
+------------------------------------------------------------------------
+## Intended Use
+-   Zero-shot chest X-ray classification
+-   Vision-language representation learning
+-   Prompt-based disease detection
+-   Medical AI research
+------------------------------------------------------------------------
+## Limitations
+-   Research use only
+-   Not approved for clinical deployment
+-   Performance may vary across institutions and demographics
+-   trust_remote_code=True is required
+------------------------------------------------------------------------
+## Citation
+``` bibtex
+@article{chexficient2024,
+  title={CheXficient: Efficient Vision-Language Learning for Chest X-ray Understanding},
+  author={...},
+  journal={...},
+  year={2024}
+}
+```