dmusingu
/

lapvqa-pg

phrase-grounding

Model card Files Files and versions

lapvqa-pg / README.md

dmusingu's picture

Update README with model loading code

e85b9de verified 21 days ago

|

History Blame Contribute Delete

1.96 kB

	---
	tags:
	- chest-xray
	- radiology
	- phrase-grounding
	- mimic-cxr
	license: apache-2.0
	---

	# LAPVQA — Phrase Grounding

	Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

	## Description

	TransVG-style phrase grounding heads trained on MIMIC-CXR, predicting the bounding
	box of a described abnormality given the chest X-ray and a text phrase.
	Each checkpoint is a dict: `{state_dict, vis_dim, txt_dim, d_model, num_layers, encoder, epoch, val_miou, val_acc50}`.

	## Architecture — `VisualGroundingHead`

	```
	vis_proj : Linear(vis_dim → 256)
	txt_proj : Linear(txt_dim → 256)
	reg_token : Parameter [1, 1, 256]
	sequence : [REG \| vis_tokens \| txt_token]
	transformer: 3 × TransformerEncoderLayer (self-attn, pre-norm)
	box_head : MLP(256 → 256 → 4) # sigmoid → (cx,cy,w,h) ∈ [0,1]
	```

	## Results (MIMIC-CXR test set)

	Zero-shot: mIoU ≈ 0.082–0.089 across all encoders.

	Fine-tuned (MAE-ViT-L/16): mIoU 0.320, Acc@0.25 0.569, Pointing Acc 0.593.

	\| File \| Encoder \| vis_dim \| txt_dim \|
	\|---\|---\|---\|---\|
	\| `clip-vit-l14.pt` \| CLIP ViT-L/14 \| 1024 \| 768 \|
	\| `siglip.pt` \| SigLIP \| 1152 \| 1152 \|
	\| `florence2.pt` \| Florence-2 \| 1024 \| 768 \|
	\| `coca.pt` \| CoCa \| 768 \| 768 \|
	\| `owlv2.pt` \| OWLv2 \| 1024 \| 768 \|
	\| `mae-vit-l16.pt` \| MAE ViT-L/16 \| 1024 \| 768 \|

	## Loading

	```python
	import torch
	from lapvqa.pg.heads import VisualGroundingHead

	ckpt = torch.load("mae-vit-l16.pt", map_location="cpu")
	head = VisualGroundingHead(
	vis_dim = ckpt["vis_dim"],
	txt_dim = ckpt["txt_dim"],
	d_model = ckpt["d_model"],
	num_layers = ckpt["num_layers"],
	)
	head.load_state_dict(ckpt["state_dict"])
	head.eval()

	with torch.no_grad():
	# vis_tokens: [B, HW, vis_dim] — spatial patch tokens from frozen encoder
	# txt_vec: [B, txt_dim] — pooled text representation from frozen encoder
	pred_boxes = head(vis_tokens, txt_vec) # [B, 4] (cx,cy,w,h) in [0,1]
	```