File size: 1,950 Bytes
97451c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c423b9
 
 
 
 
 
 
 
 
 
 
 
 
97451c6
 
 
 
 
 
2c423b9
 
97451c6
2c423b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97451c6
2c423b9
 
 
 
 
 
 
 
 
97451c6
2c423b9
 
97451c6
2c423b9
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
tags:
- chest-xray
- radiology
- report-generation
- mimic-cxr
license: apache-2.0
---

# LAPVQA — Radiology Report Generation (Frozen Off-the-shelf Encoders)

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

Autoregressive decoder heads for **Radiology Report Generation** on MIMIC-CXR,
trained on top of five **frozen** off-the-shelf encoders.
Each checkpoint is a dict: `{state_dict, vis_dim, d_model, num_layers, nhead, encoder, epoch, val_bleu4}`.

## Architecture — `ReportGenerationHead`

```
vis_proj : Linear(vis_dim → 512)
tok_emb  : Embedding(50257, 512)   # GPT-2 vocab, weight-tied with lm_head
pos_emb  : Embedding(150, 512)
decoder  : 6 × TransformerDecoderLayer (pre-norm)
lm_head  : Linear(512 → 50257, bias=False)
```

## Results (MIMIC-CXR test set)

| Encoder | BLEU-4 | ROUGE-L | RadGraph-s |
|---|---|---|---|
| SigLIP | 0.036 | 0.168 | 0.211 |
| Florence-2 | 0.035 | 0.169 | 0.205 |
| CLIP ViT-L/14 | 0.034 | 0.168 | 0.197 |
| OWLv2 | 0.034 | 0.169 | 0.197 |
| CoCa | 0.030 | 0.160 | 0.193 |

| File | Encoder | vis_dim |
|---|---|---|
| `siglip.pt` | SigLIP | 1152 |
| `florence2.pt` | Florence-2 | 1024 |
| `clip-vit-l14.pt` | CLIP ViT-L/14 | 1024 |
| `owlv2.pt` | OWLv2 | 1024 |
| `coca.pt` | CoCa | 768 |

## Loading

```python
import torch
import tiktoken
from lapvqa.rrg.heads import ReportGenerationHead

ckpt = torch.load("siglip.pt", map_location="cpu")
head = ReportGenerationHead(
    vis_dim    = ckpt["vis_dim"],
    d_model    = ckpt["d_model"],
    num_layers = ckpt["num_layers"],
    nhead      = ckpt["nhead"],
)
head.load_state_dict(ckpt["state_dict"])
head.eval()

enc = tiktoken.get_encoding("gpt2")
bos_id = eos_id = enc.eot_token

# vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder
token_ids = head.generate(vis_tokens, bos_id=bos_id, eos_id=eos_id, max_len=150)
reports   = [enc.decode(ids) for ids in token_ids]
```