dmusingu
/

lapvqa-vqa

Visual Question Answering

Model card Files Files and versions

lapvqa-vqa / README.md

dmusingu's picture

Update README with model loading code

fbbce50 verified 23 days ago

|

History Blame Contribute Delete

2.04 kB

	---
	tags:
	- chest-xray
	- radiology
	- visual-question-answering
	- mimic-cxr
	license: apache-2.0
	---

	# LAPVQA — VQA (Frozen Off-the-shelf Encoders)

	Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

	## Description

	Lightweight task heads for Visual Question Answering on MIMIC-Diff-VQA,
	trained on top of five frozen off-the-shelf vision encoders.
	Each `.pt` file contains only the task head weights; load the encoder separately.

	## Architecture — `VQAHead`

	```
	vis_proj : Linear(vis_dim → 512)
	tok_emb : Embedding(50257, 512) # GPT-2 vocab, weight-tied with lm_head
	pos_emb : Embedding(150, 512)
	decoder : 6 × TransformerDecoderLayer (pre-norm, cross-attn to visual tokens)
	lm_head : Linear(512 → 50257, bias=False)
	```

	\| File \| Encoder \| vis_dim \|
	\|---\|---\|---\|
	\| `clip-vit-l14_best.pt` \| CLIP ViT-L/14 \| 1024 \|
	\| `siglip_best.pt` \| SigLIP ViT-SO400M-14-384 \| 1152 \|
	\| `florence2_best.pt` \| Florence-2 \| 1024 \|
	\| `coca_best.pt` \| CoCa \| 768 \|
	\| `owlv2_best.pt` \| OWLv2 \| 1024 \|

	## Results (test set, overall)

	\| Encoder \| BLEU-1 \| BLEU-4 \| ROUGE-L \| RadGraph-s \|
	\|---\|---\|---\|---\|---\|
	\| CLIP ViT-L/14 \| 0.602 \| 0.243 \| 0.725 \| 0.222 \|
	\| SigLIP \| 0.586 \| 0.253 \| 0.717 \| 0.214 \|
	\| Florence-2 \| 0.575 \| 0.207 \| 0.700 \| 0.217 \|
	\| CoCa \| 0.532 \| 0.173 \| 0.642 \| 0.170 \|

	## Loading

	```python
	import torch
	import tiktoken
	from lapvqa.vqa.model import VQAHead

	# checkpoint is a plain state dict
	ckpt = torch.load("clip-vit-l14_best.pt", map_location="cpu")
	head = VQAHead(vis_dim=1024)
	head.load_state_dict(ckpt)
	head.eval()

	# vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder
	# prompt_ids: [B, Q] — tokenised question (GPT-2 tokeniser)
	enc = tiktoken.get_encoding("gpt2")
	bos_id, eos_id = enc.eot_token, enc.eot_token

	answers = head.generate(
	vis_tokens = vis_tokens,
	prompt_ids = prompt_ids,
	bos_id = bos_id,
	eos_id = eos_id,
	max_new_tokens = 64,
	)
	decoded = [enc.decode(ids) for ids in answers]
	```