--- tags: - chest-xray - radiology - visual-question-answering - mimic-cxr license: apache-2.0 --- # LAPVQA — VQA (Frozen Off-the-shelf Encoders) Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). ## Description Lightweight task heads for **Visual Question Answering** on MIMIC-Diff-VQA, trained on top of five **frozen** off-the-shelf vision encoders. Each `.pt` file contains only the task head weights; load the encoder separately. ## Architecture — `VQAHead` ``` vis_proj : Linear(vis_dim → 512) tok_emb : Embedding(50257, 512) # GPT-2 vocab, weight-tied with lm_head pos_emb : Embedding(150, 512) decoder : 6 × TransformerDecoderLayer (pre-norm, cross-attn to visual tokens) lm_head : Linear(512 → 50257, bias=False) ``` | File | Encoder | vis_dim | |---|---|---| | `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 | | `siglip_best.pt` | SigLIP ViT-SO400M-14-384 | 1152 | | `florence2_best.pt` | Florence-2 | 1024 | | `coca_best.pt` | CoCa | 768 | | `owlv2_best.pt` | OWLv2 | 1024 | ## Results (test set, overall) | Encoder | BLEU-1 | BLEU-4 | ROUGE-L | RadGraph-s | |---|---|---|---|---| | CLIP ViT-L/14 | 0.602 | 0.243 | 0.725 | 0.222 | | SigLIP | 0.586 | 0.253 | 0.717 | 0.214 | | Florence-2 | 0.575 | 0.207 | 0.700 | 0.217 | | CoCa | 0.532 | 0.173 | 0.642 | 0.170 | ## Loading ```python import torch import tiktoken from lapvqa.vqa.model import VQAHead # checkpoint is a plain state dict ckpt = torch.load("clip-vit-l14_best.pt", map_location="cpu") head = VQAHead(vis_dim=1024) head.load_state_dict(ckpt) head.eval() # vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder # prompt_ids: [B, Q] — tokenised question (GPT-2 tokeniser) enc = tiktoken.get_encoding("gpt2") bos_id, eos_id = enc.eot_token, enc.eot_token answers = head.generate( vis_tokens = vis_tokens, prompt_ids = prompt_ids, bos_id = bos_id, eos_id = eos_id, max_new_tokens = 64, ) decoded = [enc.decode(ids) for ids in answers] ```