---
tags:
- chest-xray
- radiology
- visual-question-answering
- mimic-cxr
license: apache-2.0
---

# LAPVQA — VQA (Native / End-to-end)

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

VQA task heads trained with **end-to-end fine-tuning** (encoder + head jointly).
Provides a baseline for comparison with the frozen-encoder variant
[`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa).
Each `.pt` file is a plain state dict of `VQAHead`.

| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) | 1024 |
| `siglip_best.pt` | SigLIP (fine-tuned) | 1152 |
| `florence2_best.pt` | Florence-2 (fine-tuned) | 1024 |
| `coca_best.pt` | CoCa (fine-tuned) | 768 |
| `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) | 1024 |

## Loading

```python
import torch
from lapvqa.vqa.model import VQAHead

VIS_DIMS = {
    "clip-vit-l14": 1024, "siglip": 1152,
    "florence2": 1024, "coca": 768, "mae-vit-l16": 1024,
}
encoder = "siglip"
ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu")
head = VQAHead(vis_dim=VIS_DIMS[encoder])
head.load_state_dict(ckpt)
head.eval()
```