dmusingu's picture
Add classification AUC results (NIH 0.686, CheXpert 0.808)
d55f16f verified
|
Raw
History Blame Contribute Delete
2.37 kB
metadata
tags:
  - chest-xray
  - radiology
  - report-generation
  - mimic-cxr
  - vision-encoder
license: apache-2.0

LAPVQA — Pretrain (Captioning)

Part of the LAPVQA collection.

Description

A ViT-L/14 encoder + 6-layer causal decoder trained from scratch on MIMIC-CXR to generate full radiology reports from chest X-ray images. Unlike the contrastive pretrain variants, the generative objective forces the encoder to retain fine-grained spatial information sufficient for region-level text generation. The encoder weights (encoder_final.pt) serve as the strongest feature extractor in the LAPVQA downstream tasks.

Architecture

Component Detail
Vision backbone ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px
Captioning decoder 6-layer causal transformer, 512-dim, GPT-2 vocab (50 257)
Loss Cross-entropy over report tokens
Training data MIMIC-CXR (physionet.org/content/mimic-cxr)

Downstream Evaluation (frozen encoder + linear probe)

Dataset Mean AUC
NIH CXR-14 (14-class) 0.686
CheXpert-5 (5-class) 0.808

The captioning-pretrained encoder matches or exceeds the contrastive variants on both classification benchmarks, and is the best-performing encoder on DiffVQA when used downstream.

Files

File Description
encoder_final.pt Vision encoder weights (used as frozen feature extractor downstream)
model_best.pt Full encoder + decoder at best validation loss

Usage

import torch
from lapvqa.pretrain.model import CaptioningModel

ckpt = torch.load("model_best.pt", map_location="cpu")
model = CaptioningModel()
model.load_state_dict(ckpt)
model.eval()

# To use only the encoder as a feature extractor:
enc_weights = torch.load("encoder_final.pt", map_location="cpu")
model.vision_encoder.load_state_dict(enc_weights)
# vis_tokens = model.vision_encoder(images)  # [B, 256, 1024]

Citation

If you use these weights please cite MIMIC-CXR:

@article{johnson2019mimic,
  title   = {MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports},
  author  = {Johnson, Alistair EW and others},
  journal = {Scientific data},
  volume  = {6}, pages = {317}, year = {2019}
}