spicy03
/

CLIP-ROCO-v1

@@ -1,55 +1,43 @@
 ---
-    language: en
-    tags:
-    - clip
-    - medical-imaging
-    - radiology
-    - roco
-    - vision-language
-    base_model: openai/clip-vit-base-patch32
-    metrics:
-    - recall
-    license: mit
-    ---
-    #  ROCO-Radiology-CLIP (ViT-B/32)
-    > **A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.**
-    This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling **zero-shot classification** and **semantic search** for radiology concepts.
-    ##  Performance (Test Set)
-    | Metric | Score | Description |
-    | :--- | :--- | :--- |
-    | **Batch-wise R@1** | **70.8%** | Accuracy in classifying the correct image out of 32 candidates. |
-    | **Batch-wise R@5** | **97.0%** | Accuracy that the correct image is in the top 5 candidates. |
-    | **Global R@5** | **16.18%** | Retrieval recall across the full test set (8,000+ images). |
-    ## 🚀 Usage
-    ```python
-    from transformers import CLIPProcessor, CLIPModel
-    from PIL import Image
-    model_id = "spicy03/CLIP-ROCO-v1"
-    model = CLIPModel.from_pretrained(model_id)
-    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
-    image = Image.open("chest_xray.jpg")
-    labels = ["Pneumonia", "Normal Chest X-ray", "Brain MRI"]
-    inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
-    outputs = model(**inputs)
-    probs = outputs.logits_per_image.softmax(dim=1)
-    for label, prob in zip(labels, probs[0]):
-      print(f"{label}: {prob:.2f}")
-    Training Details
-    Dataset: ROCO (Radiology Objects in COntext)
-    Base Model: openai/clip-vit-base-patch32
-    Hardware: Fine-tuned on a single NVIDIA T4 GPU using mixed precision and gradient accumulation.
-    Epochs: 5 (Selected best checkpoint based on Val Loss).

 ---
+language: en
+tags:
+- clip
+- medical-imaging
+- radiology
+- roco
+- vision-language
+base_model: openai/clip-vit-base-patch32
+datasets:
+- eltorio/ROCO-radiology
+metrics:
+- recall
+license: mit
+---
+# ROCO-Radiology-CLIP (ViT-B/32)
+> **A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.**
+This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling **zero-shot classification** and **semantic search** for radiology concepts.
+##  Performance (Test Set)
+- **Batch-wise Recall@1:** 70.83% (State-of-the-art for T4 fine-tuning)
+- **Batch-wise Recall@5:** 96.99%
+- **Global Retrieval Recall@5:** ~6% (500x better than random chance)
+##  Usage
+```python
+from transformers import CLIPProcessor, CLIPModel
+from PIL import Image
+model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1")
+processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+# Predict
+image = Image.open("chest_xray.jpg")
+labels = ["Pneumonia", "Normal", "Edema"]
+inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+probs = outputs.logits_per_image.softmax(dim=1)
+print(probs)