File size: 1,174 Bytes
6d2cc7f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
804f5c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
tags:
- chest-xray
- radiology
- visual-question-answering
- mimic-cxr
license: apache-2.0
---

# LAPVQA — VQA (Native / End-to-end)

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

VQA task heads trained with **end-to-end fine-tuning** (encoder + head jointly).
Provides a baseline for comparison with the frozen-encoder variant
[`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa).
Each `.pt` file is a plain state dict of `VQAHead`.

| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) | 1024 |
| `siglip_best.pt` | SigLIP (fine-tuned) | 1152 |
| `florence2_best.pt` | Florence-2 (fine-tuned) | 1024 |
| `coca_best.pt` | CoCa (fine-tuned) | 768 |
| `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) | 1024 |

## Loading

```python
import torch
from lapvqa.vqa.model import VQAHead

VIS_DIMS = {
    "clip-vit-l14": 1024, "siglip": 1152,
    "florence2": 1024, "coca": 768, "mae-vit-l16": 1024,
}
encoder = "siglip"
ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu")
head = VQAHead(vis_dim=VIS_DIMS[encoder])
head.load_state_dict(ckpt)
head.eval()
```