dmusingu
/

lapvqa-vqa

@@ -13,31 +13,61 @@ Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapv
 ## Description
-Lightweight task heads for **closed-domain Visual Question Answering** on MIMIC-Diff-VQA,
 trained on top of five **frozen** off-the-shelf vision encoders.
-Each `.pt` file contains the task head weights for one encoder variant;
-the underlying encoder weights are not included and must be loaded separately.
-## Setup
-The head takes the frozen encoder's patch tokens as input and is trained with cross-entropy
-over answer vocabulary. The encoder is kept frozen throughout training.
-## Results (test set, overall BLEU-4)
-| Encoder (frozen) | BLEU-1 | BLEU-4 | ROUGE-L | RadGraph-s |
 |---|---|---|---|---|
 | CLIP ViT-L/14 | 0.602 | 0.243 | 0.725 | 0.222 |
 | SigLIP | 0.586 | 0.253 | 0.717 | 0.214 |
 | Florence-2 | 0.575 | 0.207 | 0.700 | 0.217 |
 | CoCa | 0.532 | 0.173 | 0.642 | 0.170 |
-## Files
-| File | Encoder backbone |
-|---|---|
-| `clip-vit-l14_best.pt` | CLIP ViT-L/14 |
-| `siglip_best.pt` | SigLIP (ViT-SO400M-14-384) |
-| `florence2_best.pt` | Florence-2 |
-| `coca_best.pt` | CoCa |
-| `owlv2_best.pt` | OWLv2 |

 ## Description
+Lightweight task heads for **Visual Question Answering** on MIMIC-Diff-VQA,
 trained on top of five **frozen** off-the-shelf vision encoders.
+Each `.pt` file contains only the task head weights; load the encoder separately.
+## Architecture — `VQAHead`
+```
+vis_proj   : Linear(vis_dim → 512)
+tok_emb    : Embedding(50257, 512)   # GPT-2 vocab, weight-tied with lm_head
+pos_emb    : Embedding(150, 512)
+decoder    : 6 × TransformerDecoderLayer (pre-norm, cross-attn to visual tokens)
+lm_head    : Linear(512 → 50257, bias=False)
+```
+| File | Encoder | vis_dim |
+|---|---|---|
+| `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 |
+| `siglip_best.pt` | SigLIP ViT-SO400M-14-384 | 1152 |
+| `florence2_best.pt` | Florence-2 | 1024 |
+| `coca_best.pt` | CoCa | 768 |
+| `owlv2_best.pt` | OWLv2 | 1024 |
+## Results (test set, overall)
+| Encoder | BLEU-1 | BLEU-4 | ROUGE-L | RadGraph-s |
 |---|---|---|---|---|
 | CLIP ViT-L/14 | 0.602 | 0.243 | 0.725 | 0.222 |
 | SigLIP | 0.586 | 0.253 | 0.717 | 0.214 |
 | Florence-2 | 0.575 | 0.207 | 0.700 | 0.217 |
 | CoCa | 0.532 | 0.173 | 0.642 | 0.170 |
+## Loading
+```python
+import torch
+import tiktoken
+from lapvqa.vqa.model import VQAHead
+# checkpoint is a plain state dict
+ckpt = torch.load("clip-vit-l14_best.pt", map_location="cpu")
+head = VQAHead(vis_dim=1024)
+head.load_state_dict(ckpt)
+head.eval()
+# vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder
+# prompt_ids: [B, Q]           — tokenised question (GPT-2 tokeniser)
+enc = tiktoken.get_encoding("gpt2")
+bos_id, eos_id = enc.eot_token, enc.eot_token
+answers = head.generate(
+    vis_tokens  = vis_tokens,
+    prompt_ids  = prompt_ids,
+    bos_id      = bos_id,
+    eos_id      = eos_id,
+    max_new_tokens = 64,
+)
+decoded = [enc.decode(ids) for ids in answers]
+```