dmusingu
/

lapvqa-rrg

report-generation

Model card Files Files and versions

lapvqa-rrg / README.md

dmusingu's picture

Update README with model loading code

2c423b9 verified 27 days ago

|

History Blame Contribute Delete

1.95 kB

	---
	tags:
	- chest-xray
	- radiology
	- report-generation
	- mimic-cxr
	license: apache-2.0
	---

	# LAPVQA — Radiology Report Generation (Frozen Off-the-shelf Encoders)

	Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

	## Description

	Autoregressive decoder heads for Radiology Report Generation on MIMIC-CXR,
	trained on top of five frozen off-the-shelf encoders.
	Each checkpoint is a dict: `{state_dict, vis_dim, d_model, num_layers, nhead, encoder, epoch, val_bleu4}`.

	## Architecture — `ReportGenerationHead`

	```
	vis_proj : Linear(vis_dim → 512)
	tok_emb : Embedding(50257, 512) # GPT-2 vocab, weight-tied with lm_head
	pos_emb : Embedding(150, 512)
	decoder : 6 × TransformerDecoderLayer (pre-norm)
	lm_head : Linear(512 → 50257, bias=False)
	```

	## Results (MIMIC-CXR test set)

	\| Encoder \| BLEU-4 \| ROUGE-L \| RadGraph-s \|
	\|---\|---\|---\|---\|
	\| SigLIP \| 0.036 \| 0.168 \| 0.211 \|
	\| Florence-2 \| 0.035 \| 0.169 \| 0.205 \|
	\| CLIP ViT-L/14 \| 0.034 \| 0.168 \| 0.197 \|
	\| OWLv2 \| 0.034 \| 0.169 \| 0.197 \|
	\| CoCa \| 0.030 \| 0.160 \| 0.193 \|

	\| File \| Encoder \| vis_dim \|
	\|---\|---\|---\|
	\| `siglip.pt` \| SigLIP \| 1152 \|
	\| `florence2.pt` \| Florence-2 \| 1024 \|
	\| `clip-vit-l14.pt` \| CLIP ViT-L/14 \| 1024 \|
	\| `owlv2.pt` \| OWLv2 \| 1024 \|
	\| `coca.pt` \| CoCa \| 768 \|

	## Loading

	```python
	import torch
	import tiktoken
	from lapvqa.rrg.heads import ReportGenerationHead

	ckpt = torch.load("siglip.pt", map_location="cpu")
	head = ReportGenerationHead(
	vis_dim = ckpt["vis_dim"],
	d_model = ckpt["d_model"],
	num_layers = ckpt["num_layers"],
	nhead = ckpt["nhead"],
	)
	head.load_state_dict(ckpt["state_dict"])
	head.eval()

	enc = tiktoken.get_encoding("gpt2")
	bos_id = eos_id = enc.eot_token

	# vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder
	token_ids = head.generate(vis_tokens, bos_id=bos_id, eos_id=eos_id, max_len=150)
	reports = [enc.decode(ids) for ids in token_ids]
	```