CLIP-ROCO-v1 / README.md
spicy03's picture
updated recall val
98b9781 verified
metadata
language: en
tags:
  - clip
  - medical-imaging
  - radiology
  - roco
  - vision-language
base_model: openai/clip-vit-base-patch32
datasets:
  - eltorio/ROCO-radiology
metrics:
  - recall
license: mit

ROCO-Radiology-CLIP (ViT-B/32)

A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.

This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling zero-shot classification and semantic search for radiology concepts.

Performance (Test Set)

  • Batch-wise Recall@1: 70.83% (State-of-the-art for T4 fine-tuning)
  • Batch-wise Recall@5: 96.99%
  • Global Retrieval Recall@1: ~6% (500x better than random chance)
  • Global Retrieval Recall@5: ~16% Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version

Usage

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Predict
image = Image.open("chest_xray.jpg")
labels = ["Pneumonia", "Normal", "Edema"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)