Add classification AUC results (NIH 0.686, CheXpert 0.808)

d55f16f verified 22 days ago

2.37 kB

	---
	tags:
	- chest-xray
	- radiology
	- report-generation
	- mimic-cxr
	- vision-encoder
	license: apache-2.0
	---

	# LAPVQA — Pretrain (Captioning)

	Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

	## Description

	A ViT-L/14 encoder + 6-layer causal decoder trained from scratch on [MIMIC-CXR](https://physionet.org/content/mimic-cxr)
	to generate full radiology reports from chest X-ray images.
	Unlike the contrastive pretrain variants, the generative objective forces the encoder
	to retain fine-grained spatial information sufficient for region-level text generation.
	The encoder weights (`encoder_final.pt`) serve as the strongest feature extractor
	in the LAPVQA downstream tasks.

	## Architecture

	\| Component \| Detail \|
	\|---\|---\|
	\| Vision backbone \| ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px \|
	\| Captioning decoder \| 6-layer causal transformer, 512-dim, GPT-2 vocab (50 257) \|
	\| Loss \| Cross-entropy over report tokens \|
	\| Training data \| MIMIC-CXR (physionet.org/content/mimic-cxr) \|

	## Downstream Evaluation (frozen encoder + linear probe)

	\| Dataset \| Mean AUC \|
	\|---\|---\|
	\| NIH CXR-14 (14-class) \| 0.686 \|
	\| CheXpert-5 (5-class) \| 0.808 \|

	The captioning-pretrained encoder matches or exceeds the contrastive variants on both
	classification benchmarks, and is the best-performing encoder on DiffVQA when used downstream.

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `encoder_final.pt` \| Vision encoder weights (used as frozen feature extractor downstream) \|
	\| `model_best.pt` \| Full encoder + decoder at best validation loss \|

	## Usage

	```python
	import torch
	from lapvqa.pretrain.model import CaptioningModel

	ckpt = torch.load("model_best.pt", map_location="cpu")
	model = CaptioningModel()
	model.load_state_dict(ckpt)
	model.eval()

	# To use only the encoder as a feature extractor:
	enc_weights = torch.load("encoder_final.pt", map_location="cpu")
	model.vision_encoder.load_state_dict(enc_weights)
	# vis_tokens = model.vision_encoder(images) # [B, 256, 1024]
	```

	## Citation

	If you use these weights please cite MIMIC-CXR:

	```bibtex
	@article{johnson2019mimic,
	title = {MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports},
	author = {Johnson, Alistair EW and others},
	journal = {Scientific data},
	volume = {6}, pages = {317}, year = {2019}
	}
	```