| --- |
| tags: |
| - chest-xray |
| - radiology |
| - visual-question-answering |
| - mimic-cxr |
| license: apache-2.0 |
| --- |
| |
| # LAPVQA — VQA (Native / End-to-end) |
|
|
| Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). |
|
|
| ## Description |
|
|
| VQA task heads trained with **end-to-end fine-tuning** (encoder + head jointly). |
| Provides a baseline for comparison with the frozen-encoder variant |
| [`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa). |
| Each `.pt` file is a plain state dict of `VQAHead`. |
|
|
| | File | Encoder | vis_dim | |
| |---|---|---| |
| | `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) | 1024 | |
| | `siglip_best.pt` | SigLIP (fine-tuned) | 1152 | |
| | `florence2_best.pt` | Florence-2 (fine-tuned) | 1024 | |
| | `coca_best.pt` | CoCa (fine-tuned) | 768 | |
| | `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) | 1024 | |
|
|
| ## Loading |
|
|
| ```python |
| import torch |
| from lapvqa.vqa.model import VQAHead |
| |
| VIS_DIMS = { |
| "clip-vit-l14": 1024, "siglip": 1152, |
| "florence2": 1024, "coca": 768, "mae-vit-l16": 1024, |
| } |
| encoder = "siglip" |
| ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu") |
| head = VQAHead(vis_dim=VIS_DIMS[encoder]) |
| head.load_state_dict(ckpt) |
| head.eval() |
| ``` |
|
|