File size: 2,371 Bytes
139748c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d55f16f
139748c
d55f16f
 
139748c
 
 
 
 
 
 
 
 
 
d55f16f
139748c
 
 
d55f16f
 
139748c
d55f16f
 
139748c
 
 
 
 
 
d55f16f
139748c
 
 
 
 
 
 
 
 
 
 
d55f16f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139748c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
tags:
- chest-xray
- radiology
- report-generation
- mimic-cxr
- vision-encoder
license: apache-2.0
---

# LAPVQA — Pretrain (Captioning)

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

A **ViT-L/14 encoder + 6-layer causal decoder** trained from scratch on [MIMIC-CXR](https://physionet.org/content/mimic-cxr)
to generate full radiology reports from chest X-ray images.
Unlike the contrastive pretrain variants, the generative objective forces the encoder
to retain fine-grained spatial information sufficient for region-level text generation.
The encoder weights (`encoder_final.pt`) serve as the strongest feature extractor
in the LAPVQA downstream tasks.

## Architecture

| Component | Detail |
|---|---|
| Vision backbone | ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px |
| Captioning decoder | 6-layer causal transformer, 512-dim, GPT-2 vocab (50 257) |
| Loss | Cross-entropy over report tokens |
| Training data | MIMIC-CXR (physionet.org/content/mimic-cxr) |

## Downstream Evaluation (frozen encoder + linear probe)

| Dataset | Mean AUC |
|---|---|
| NIH CXR-14 (14-class) | 0.686 |
| CheXpert-5 (5-class) | 0.808 |

The captioning-pretrained encoder matches or exceeds the contrastive variants on both
classification benchmarks, and is the best-performing encoder on DiffVQA when used downstream.

## Files

| File | Description |
|---|---|
| `encoder_final.pt` | Vision encoder weights (used as frozen feature extractor downstream) |
| `model_best.pt` | Full encoder + decoder at best validation loss |

## Usage

```python
import torch
from lapvqa.pretrain.model import CaptioningModel

ckpt = torch.load("model_best.pt", map_location="cpu")
model = CaptioningModel()
model.load_state_dict(ckpt)
model.eval()

# To use only the encoder as a feature extractor:
enc_weights = torch.load("encoder_final.pt", map_location="cpu")
model.vision_encoder.load_state_dict(enc_weights)
# vis_tokens = model.vision_encoder(images)  # [B, 256, 1024]
```

## Citation

If you use these weights please cite MIMIC-CXR:

```bibtex
@article{johnson2019mimic,
  title   = {MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports},
  author  = {Johnson, Alistair EW and others},
  journal = {Scientific data},
  volume  = {6}, pages = {317}, year = {2019}
}
```