File size: 2,037 Bytes
1aa41d1 fbbce50 1aa41d1 fbbce50 1aa41d1 fbbce50 1aa41d1 fbbce50 1aa41d1 fbbce50 1aa41d1 fbbce50 1aa41d1 fbbce50 1aa41d1 fbbce50 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | ---
tags:
- chest-xray
- radiology
- visual-question-answering
- mimic-cxr
license: apache-2.0
---
# LAPVQA — VQA (Frozen Off-the-shelf Encoders)
Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).
## Description
Lightweight task heads for **Visual Question Answering** on MIMIC-Diff-VQA,
trained on top of five **frozen** off-the-shelf vision encoders.
Each `.pt` file contains only the task head weights; load the encoder separately.
## Architecture — `VQAHead`
```
vis_proj : Linear(vis_dim → 512)
tok_emb : Embedding(50257, 512) # GPT-2 vocab, weight-tied with lm_head
pos_emb : Embedding(150, 512)
decoder : 6 × TransformerDecoderLayer (pre-norm, cross-attn to visual tokens)
lm_head : Linear(512 → 50257, bias=False)
```
| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 |
| `siglip_best.pt` | SigLIP ViT-SO400M-14-384 | 1152 |
| `florence2_best.pt` | Florence-2 | 1024 |
| `coca_best.pt` | CoCa | 768 |
| `owlv2_best.pt` | OWLv2 | 1024 |
## Results (test set, overall)
| Encoder | BLEU-1 | BLEU-4 | ROUGE-L | RadGraph-s |
|---|---|---|---|---|
| CLIP ViT-L/14 | 0.602 | 0.243 | 0.725 | 0.222 |
| SigLIP | 0.586 | 0.253 | 0.717 | 0.214 |
| Florence-2 | 0.575 | 0.207 | 0.700 | 0.217 |
| CoCa | 0.532 | 0.173 | 0.642 | 0.170 |
## Loading
```python
import torch
import tiktoken
from lapvqa.vqa.model import VQAHead
# checkpoint is a plain state dict
ckpt = torch.load("clip-vit-l14_best.pt", map_location="cpu")
head = VQAHead(vis_dim=1024)
head.load_state_dict(ckpt)
head.eval()
# vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder
# prompt_ids: [B, Q] — tokenised question (GPT-2 tokeniser)
enc = tiktoken.get_encoding("gpt2")
bos_id, eos_id = enc.eot_token, enc.eot_token
answers = head.generate(
vis_tokens = vis_tokens,
prompt_ids = prompt_ids,
bos_id = bos_id,
eos_id = eos_id,
max_new_tokens = 64,
)
decoded = [enc.decode(ids) for ids in answers]
```
|