| --- |
| tags: |
| - chest-xray |
| - radiology |
| - report-generation |
| - mimic-cxr |
| - vision-encoder |
| license: apache-2.0 |
| --- |
| |
| # LAPVQA — Pretrain (Captioning) |
|
|
| Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). |
|
|
| ## Description |
|
|
| A **ViT-L/14 encoder + 6-layer causal decoder** trained from scratch on [MIMIC-CXR](https://physionet.org/content/mimic-cxr) |
| to generate full radiology reports from chest X-ray images. |
| Unlike the contrastive pretrain variants, the generative objective forces the encoder |
| to retain fine-grained spatial information sufficient for region-level text generation. |
| The encoder weights (`encoder_final.pt`) serve as the strongest feature extractor |
| in the LAPVQA downstream tasks. |
|
|
| ## Architecture |
|
|
| | Component | Detail | |
| |---|---| |
| | Vision backbone | ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px | |
| | Captioning decoder | 6-layer causal transformer, 512-dim, GPT-2 vocab (50 257) | |
| | Loss | Cross-entropy over report tokens | |
| | Training data | MIMIC-CXR (physionet.org/content/mimic-cxr) | |
|
|
| ## Downstream Evaluation (frozen encoder + linear probe) |
|
|
| | Dataset | Mean AUC | |
| |---|---| |
| | NIH CXR-14 (14-class) | 0.686 | |
| | CheXpert-5 (5-class) | 0.808 | |
|
|
| The captioning-pretrained encoder matches or exceeds the contrastive variants on both |
| classification benchmarks, and is the best-performing encoder on DiffVQA when used downstream. |
|
|
| ## Files |
|
|
| | File | Description | |
| |---|---| |
| | `encoder_final.pt` | Vision encoder weights (used as frozen feature extractor downstream) | |
| | `model_best.pt` | Full encoder + decoder at best validation loss | |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from lapvqa.pretrain.model import CaptioningModel |
| |
| ckpt = torch.load("model_best.pt", map_location="cpu") |
| model = CaptioningModel() |
| model.load_state_dict(ckpt) |
| model.eval() |
| |
| # To use only the encoder as a feature extractor: |
| enc_weights = torch.load("encoder_final.pt", map_location="cpu") |
| model.vision_encoder.load_state_dict(enc_weights) |
| # vis_tokens = model.vision_encoder(images) # [B, 256, 1024] |
| ``` |
|
|
| ## Citation |
|
|
| If you use these weights please cite MIMIC-CXR: |
|
|
| ```bibtex |
| @article{johnson2019mimic, |
| title = {MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports}, |
| author = {Johnson, Alistair EW and others}, |
| journal = {Scientific data}, |
| volume = {6}, pages = {317}, year = {2019} |
| } |
| ``` |
|
|