dmusingu
/

lapvqa-pg

+---
+tags:
+- chest-xray
+- radiology
+- phrase-grounding
+- mimic-cxr
+license: apache-2.0
+---
+# LAPVQA — Phrase Grounding
+Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).
+## Description
+Phrase grounding heads trained on MIMIC-CXR, predicting the bounding box of a described
+abnormality given the image and a text phrase (e.g. "Pleural Effusion").
+Six frozen encoder backbones are covered; each file contains only the grounding head weights.
+## Results (MIMIC-CXR test set)
+**Zero-shot (no fine-tuning):**
+| Encoder | mIoU | Acc@0.25 |
+|---|---|---|
+| SigLIP | 0.086 | 0.042 |
+| Florence-2 | 0.089 | 0.046 |
+| CLIP ViT-L/14 | 0.085 | 0.028 |
+| CoCa | 0.082 | 0.023 |
+| OWLv2 | 0.082 | 0.023 |
+**Fine-tuned (MAE-ViT-L/16):**
+| mIoU | Acc@0.25 | Acc@0.50 | Pointing Acc |
+|---|---|---|---|
+| 0.320 | 0.569 | 0.273 | 0.593 |
+Fine-tuning provides a ~4× improvement over zero-shot across all encoders.
+## Files
+| File | Encoder backbone |
+|---|---|
+| `clip-vit-l14.pt` | CLIP ViT-L/14 |
+| `siglip.pt` | SigLIP |
+| `florence2.pt` | Florence-2 |
+| `coca.pt` | CoCa |
+| `owlv2.pt` | OWLv2 |
+| `mae-vit-l16.pt` | MAE ViT-L/16 (fine-tuned) |