dmusingu
/

lapvqa-vqa-native

@@ -13,16 +13,32 @@ Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapv
 ## Description
-VQA task heads trained with **end-to-end fine-tuning** — the encoder weights are
-updated jointly with the task head, providing a baseline for how much improvement
-domain adaptation yields over the frozen-encoder setup in [`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa).
-## Files
-| File | Encoder backbone |
-|---|---|
-| `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) |
-| `siglip_best.pt` | SigLIP (fine-tuned) |
-| `florence2_best.pt` | Florence-2 (fine-tuned) |
-| `coca_best.pt` | CoCa (fine-tuned) |
-| `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) |

 ## Description
+VQA task heads trained with **end-to-end fine-tuning** (encoder + head jointly).
+Provides a baseline for comparison with the frozen-encoder variant
+[`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa).
+Each `.pt` file is a plain state dict of `VQAHead`.
+| File | Encoder | vis_dim |
+|---|---|---|
+| `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) | 1024 |
+| `siglip_best.pt` | SigLIP (fine-tuned) | 1152 |
+| `florence2_best.pt` | Florence-2 (fine-tuned) | 1024 |
+| `coca_best.pt` | CoCa (fine-tuned) | 768 |
+| `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) | 1024 |
+## Loading
+```python
+import torch
+from lapvqa.vqa.model import VQAHead
+VIS_DIMS = {
+    "clip-vit-l14": 1024, "siglip": 1152,
+    "florence2": 1024, "coca": 768, "mae-vit-l16": 1024,
+}
+encoder = "siglip"
+ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu")
+head = VQAHead(vis_dim=VIS_DIMS[encoder])
+head.load_state_dict(ckpt)
+head.eval()
+```