--- tags: - chest-xray - radiology - visual-question-answering - mimic-cxr license: apache-2.0 --- # LAPVQA — VQA (Native / End-to-end) Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). ## Description VQA task heads trained with **end-to-end fine-tuning** (encoder + head jointly). Provides a baseline for comparison with the frozen-encoder variant [`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa). Each `.pt` file is a plain state dict of `VQAHead`. | File | Encoder | vis_dim | |---|---|---| | `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) | 1024 | | `siglip_best.pt` | SigLIP (fine-tuned) | 1152 | | `florence2_best.pt` | Florence-2 (fine-tuned) | 1024 | | `coca_best.pt` | CoCa (fine-tuned) | 768 | | `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) | 1024 | ## Loading ```python import torch from lapvqa.vqa.model import VQAHead VIS_DIMS = { "clip-vit-l14": 1024, "siglip": 1152, "florence2": 1024, "coca": 768, "mae-vit-l16": 1024, } encoder = "siglip" ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu") head = VQAHead(vis_dim=VIS_DIMS[encoder]) head.load_state_dict(ckpt) head.eval() ```