Update README with model loading code
Browse files
README.md
CHANGED
|
@@ -13,16 +13,32 @@ Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapv
|
|
| 13 |
|
| 14 |
## Description
|
| 15 |
|
| 16 |
-
VQA task heads trained with **end-to-end fine-tuning**
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
|
| 23 |
-
|--
|
| 24 |
-
| `
|
| 25 |
-
| `
|
| 26 |
-
| `
|
| 27 |
-
| `
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
## Description
|
| 15 |
|
| 16 |
+
VQA task heads trained with **end-to-end fine-tuning** (encoder + head jointly).
|
| 17 |
+
Provides a baseline for comparison with the frozen-encoder variant
|
| 18 |
+
[`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa).
|
| 19 |
+
Each `.pt` file is a plain state dict of `VQAHead`.
|
| 20 |
+
|
| 21 |
+
| File | Encoder | vis_dim |
|
| 22 |
+
|---|---|---|
|
| 23 |
+
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) | 1024 |
|
| 24 |
+
| `siglip_best.pt` | SigLIP (fine-tuned) | 1152 |
|
| 25 |
+
| `florence2_best.pt` | Florence-2 (fine-tuned) | 1024 |
|
| 26 |
+
| `coca_best.pt` | CoCa (fine-tuned) | 768 |
|
| 27 |
+
| `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) | 1024 |
|
| 28 |
+
|
| 29 |
+
## Loading
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
import torch
|
| 33 |
+
from lapvqa.vqa.model import VQAHead
|
| 34 |
+
|
| 35 |
+
VIS_DIMS = {
|
| 36 |
+
"clip-vit-l14": 1024, "siglip": 1152,
|
| 37 |
+
"florence2": 1024, "coca": 768, "mae-vit-l16": 1024,
|
| 38 |
+
}
|
| 39 |
+
encoder = "siglip"
|
| 40 |
+
ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu")
|
| 41 |
+
head = VQAHead(vis_dim=VIS_DIMS[encoder])
|
| 42 |
+
head.load_state_dict(ckpt)
|
| 43 |
+
head.eval()
|
| 44 |
+
```
|