---
language: en
tags:
- clip
- medical-imaging
- radiology
- roco
- vision-language
base_model: openai/clip-vit-base-patch32
datasets:
- eltorio/ROCO-radiology
metrics:
- recall
license: mit
---

# ROCO-Radiology-CLIP (ViT-B/32)

> **A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.**

This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling **zero-shot classification** and **semantic search** for radiology concepts.

##  Performance (Test Set)
- **Batch-wise Recall@1:** 70.83% (State-of-the-art for T4 fine-tuning)
- **Batch-wise Recall@5:** 96.99%
- **Global Retrieval Recall@1:** ~6% (500x better than random chance)
- **Global Retrieval Recall@5:** ~16%
Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version

##  Usage

```python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Predict
image = Image.open("chest_xray.jpg")
labels = ["Pneumonia", "Normal", "Edema"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)