dmusingu
/

lapvqa-vqa-native

Visual Question Answering

Model card Files Files and versions

lapvqa-vqa-native / README.md

dmusingu's picture

Update README with model loading code

804f5c7 verified 22 days ago

|

History Blame Contribute Delete

1.17 kB

	---
	tags:
	- chest-xray
	- radiology
	- visual-question-answering
	- mimic-cxr
	license: apache-2.0
	---

	# LAPVQA — VQA (Native / End-to-end)

	Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

	## Description

	VQA task heads trained with end-to-end fine-tuning (encoder + head jointly).
	Provides a baseline for comparison with the frozen-encoder variant
	[`lapvqa-vqa`](https://huggingface.co/dmusingu/lapvqa-vqa).
	Each `.pt` file is a plain state dict of `VQAHead`.

	\| File \| Encoder \| vis_dim \|
	\|---\|---\|---\|
	\| `clip-vit-l14_best.pt` \| CLIP ViT-L/14 (fine-tuned) \| 1024 \|
	\| `siglip_best.pt` \| SigLIP (fine-tuned) \| 1152 \|
	\| `florence2_best.pt` \| Florence-2 (fine-tuned) \| 1024 \|
	\| `coca_best.pt` \| CoCa (fine-tuned) \| 768 \|
	\| `mae-vit-l16_best.pt` \| MAE ViT-L/16 (fine-tuned) \| 1024 \|

	## Loading

	```python
	import torch
	from lapvqa.vqa.model import VQAHead

	VIS_DIMS = {
	"clip-vit-l14": 1024, "siglip": 1152,
	"florence2": 1024, "coca": 768, "mae-vit-l16": 1024,
	}
	encoder = "siglip"
	ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu")
	head = VQAHead(vis_dim=VIS_DIMS[encoder])
	head.load_state_dict(ckpt)
	head.eval()
	```