File size: 1,955 Bytes
47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de 47b83de e85b9de | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | ---
tags:
- chest-xray
- radiology
- phrase-grounding
- mimic-cxr
license: apache-2.0
---
# LAPVQA β Phrase Grounding
Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).
## Description
TransVG-style phrase grounding heads trained on MIMIC-CXR, predicting the bounding
box of a described abnormality given the chest X-ray and a text phrase.
Each checkpoint is a dict: `{state_dict, vis_dim, txt_dim, d_model, num_layers, encoder, epoch, val_miou, val_acc50}`.
## Architecture β `VisualGroundingHead`
```
vis_proj : Linear(vis_dim β 256)
txt_proj : Linear(txt_dim β 256)
reg_token : Parameter [1, 1, 256]
sequence : [REG | vis_tokens | txt_token]
transformer: 3 Γ TransformerEncoderLayer (self-attn, pre-norm)
box_head : MLP(256 β 256 β 4) # sigmoid β (cx,cy,w,h) β [0,1]
```
## Results (MIMIC-CXR test set)
**Zero-shot:** mIoU β 0.082β0.089 across all encoders.
**Fine-tuned (MAE-ViT-L/16):** mIoU 0.320, Acc@0.25 0.569, Pointing Acc 0.593.
| File | Encoder | vis_dim | txt_dim |
|---|---|---|---|
| `clip-vit-l14.pt` | CLIP ViT-L/14 | 1024 | 768 |
| `siglip.pt` | SigLIP | 1152 | 1152 |
| `florence2.pt` | Florence-2 | 1024 | 768 |
| `coca.pt` | CoCa | 768 | 768 |
| `owlv2.pt` | OWLv2 | 1024 | 768 |
| `mae-vit-l16.pt` | MAE ViT-L/16 | 1024 | 768 |
## Loading
```python
import torch
from lapvqa.pg.heads import VisualGroundingHead
ckpt = torch.load("mae-vit-l16.pt", map_location="cpu")
head = VisualGroundingHead(
vis_dim = ckpt["vis_dim"],
txt_dim = ckpt["txt_dim"],
d_model = ckpt["d_model"],
num_layers = ckpt["num_layers"],
)
head.load_state_dict(ckpt["state_dict"])
head.eval()
with torch.no_grad():
# vis_tokens: [B, HW, vis_dim] β spatial patch tokens from frozen encoder
# txt_vec: [B, txt_dim] β pooled text representation from frozen encoder
pred_boxes = head(vis_tokens, txt_vec) # [B, 4] (cx,cy,w,h) in [0,1]
```
|