|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- clip |
|
|
- medical-imaging |
|
|
- radiology |
|
|
- roco |
|
|
- vision-language |
|
|
base_model: openai/clip-vit-base-patch32 |
|
|
datasets: |
|
|
- eltorio/ROCO-radiology |
|
|
metrics: |
|
|
- recall |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# ROCO-Radiology-CLIP (ViT-B/32) |
|
|
|
|
|
> **A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.** |
|
|
|
|
|
This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling **zero-shot classification** and **semantic search** for radiology concepts. |
|
|
|
|
|
## Performance (Test Set) |
|
|
- **Batch-wise Recall@1:** 70.83% (State-of-the-art for T4 fine-tuning) |
|
|
- **Batch-wise Recall@5:** 96.99% |
|
|
- **Global Retrieval Recall@1:** ~6% (500x better than random chance) |
|
|
- **Global Retrieval Recall@5:** ~16% |
|
|
Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import CLIPProcessor, CLIPModel |
|
|
from PIL import Image |
|
|
|
|
|
model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1") |
|
|
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") |
|
|
|
|
|
# Predict |
|
|
image = Image.open("chest_xray.jpg") |
|
|
labels = ["Pneumonia", "Normal", "Edema"] |
|
|
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) |
|
|
outputs = model(**inputs) |
|
|
probs = outputs.logits_per_image.softmax(dim=1) |
|
|
print(probs) |