--- language: en tags: - clip - medical-imaging - radiology - roco - vision-language base_model: openai/clip-vit-base-patch32 datasets: - eltorio/ROCO-radiology metrics: - recall license: mit --- # ROCO-Radiology-CLIP (ViT-B/32) > **A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.** This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling **zero-shot classification** and **semantic search** for radiology concepts. ## Performance (Test Set) - **Batch-wise Recall@1:** 70.83% (State-of-the-art for T4 fine-tuning) - **Batch-wise Recall@5:** 96.99% - **Global Retrieval Recall@1:** ~6% (500x better than random chance) - **Global Retrieval Recall@5:** ~16% Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version ## Usage ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") # Predict image = Image.open("chest_xray.jpg") labels = ["Pneumonia", "Normal", "Edema"] inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) print(probs)