--- tags: - chest-xray - radiology - phrase-grounding - mimic-cxr license: apache-2.0 --- # LAPVQA — Phrase Grounding Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). ## Description TransVG-style phrase grounding heads trained on MIMIC-CXR, predicting the bounding box of a described abnormality given the chest X-ray and a text phrase. Each checkpoint is a dict: `{state_dict, vis_dim, txt_dim, d_model, num_layers, encoder, epoch, val_miou, val_acc50}`. ## Architecture — `VisualGroundingHead` ``` vis_proj : Linear(vis_dim → 256) txt_proj : Linear(txt_dim → 256) reg_token : Parameter [1, 1, 256] sequence : [REG | vis_tokens | txt_token] transformer: 3 × TransformerEncoderLayer (self-attn, pre-norm) box_head : MLP(256 → 256 → 4) # sigmoid → (cx,cy,w,h) ∈ [0,1] ``` ## Results (MIMIC-CXR test set) **Zero-shot:** mIoU ≈ 0.082–0.089 across all encoders. **Fine-tuned (MAE-ViT-L/16):** mIoU 0.320, Acc@0.25 0.569, Pointing Acc 0.593. | File | Encoder | vis_dim | txt_dim | |---|---|---|---| | `clip-vit-l14.pt` | CLIP ViT-L/14 | 1024 | 768 | | `siglip.pt` | SigLIP | 1152 | 1152 | | `florence2.pt` | Florence-2 | 1024 | 768 | | `coca.pt` | CoCa | 768 | 768 | | `owlv2.pt` | OWLv2 | 1024 | 768 | | `mae-vit-l16.pt` | MAE ViT-L/16 | 1024 | 768 | ## Loading ```python import torch from lapvqa.pg.heads import VisualGroundingHead ckpt = torch.load("mae-vit-l16.pt", map_location="cpu") head = VisualGroundingHead( vis_dim = ckpt["vis_dim"], txt_dim = ckpt["txt_dim"], d_model = ckpt["d_model"], num_layers = ckpt["num_layers"], ) head.load_state_dict(ckpt["state_dict"]) head.eval() with torch.no_grad(): # vis_tokens: [B, HW, vis_dim] — spatial patch tokens from frozen encoder # txt_vec: [B, txt_dim] — pooled text representation from frozen encoder pred_boxes = head(vis_tokens, txt_vec) # [B, 4] (cx,cy,w,h) in [0,1] ```