| --- |
| tags: |
| - chest-xray |
| - radiology |
| - phrase-grounding |
| - mimic-cxr |
| license: apache-2.0 |
| --- |
| |
| # LAPVQA β Phrase Grounding |
|
|
| Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). |
|
|
| ## Description |
|
|
| TransVG-style phrase grounding heads trained on MIMIC-CXR, predicting the bounding |
| box of a described abnormality given the chest X-ray and a text phrase. |
| Each checkpoint is a dict: `{state_dict, vis_dim, txt_dim, d_model, num_layers, encoder, epoch, val_miou, val_acc50}`. |
|
|
| ## Architecture β `VisualGroundingHead` |
|
|
| ``` |
| vis_proj : Linear(vis_dim β 256) |
| txt_proj : Linear(txt_dim β 256) |
| reg_token : Parameter [1, 1, 256] |
| sequence : [REG | vis_tokens | txt_token] |
| transformer: 3 Γ TransformerEncoderLayer (self-attn, pre-norm) |
| box_head : MLP(256 β 256 β 4) # sigmoid β (cx,cy,w,h) β [0,1] |
| ``` |
|
|
| ## Results (MIMIC-CXR test set) |
|
|
| **Zero-shot:** mIoU β 0.082β0.089 across all encoders. |
|
|
| **Fine-tuned (MAE-ViT-L/16):** mIoU 0.320, Acc@0.25 0.569, Pointing Acc 0.593. |
|
|
| | File | Encoder | vis_dim | txt_dim | |
| |---|---|---|---| |
| | `clip-vit-l14.pt` | CLIP ViT-L/14 | 1024 | 768 | |
| | `siglip.pt` | SigLIP | 1152 | 1152 | |
| | `florence2.pt` | Florence-2 | 1024 | 768 | |
| | `coca.pt` | CoCa | 768 | 768 | |
| | `owlv2.pt` | OWLv2 | 1024 | 768 | |
| | `mae-vit-l16.pt` | MAE ViT-L/16 | 1024 | 768 | |
|
|
| ## Loading |
|
|
| ```python |
| import torch |
| from lapvqa.pg.heads import VisualGroundingHead |
| |
| ckpt = torch.load("mae-vit-l16.pt", map_location="cpu") |
| head = VisualGroundingHead( |
| vis_dim = ckpt["vis_dim"], |
| txt_dim = ckpt["txt_dim"], |
| d_model = ckpt["d_model"], |
| num_layers = ckpt["num_layers"], |
| ) |
| head.load_state_dict(ckpt["state_dict"]) |
| head.eval() |
| |
| with torch.no_grad(): |
| # vis_tokens: [B, HW, vis_dim] β spatial patch tokens from frozen encoder |
| # txt_vec: [B, txt_dim] β pooled text representation from frozen encoder |
| pred_boxes = head(vis_tokens, txt_vec) # [B, 4] (cx,cy,w,h) in [0,1] |
| ``` |
|
|