File size: 2,041 Bytes
7c82f29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdbe183
 
 
7c82f29
fdbe183
7c82f29
fdbe183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c82f29
fdbe183
 
 
 
 
 
7c82f29
fdbe183
7c82f29
fdbe183
 
 
 
7c82f29
fdbe183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
tags:
- chest-xray
- radiology
- object-detection
- abnormality-detection
- vindr-cxr
license: apache-2.0
---

# LAPVQA — Abnormality Detection

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

DETR-style detection heads for 14-class chest abnormality detection on VinDr-CXR,
trained on top of six **frozen** vision encoders.
Each checkpoint is a dict: `{state_dict, vis_dim, d_model, num_queries, num_enc, num_dec, encoder, epoch, val_map40, val_map50}`.

## Architecture — `DetectionHead`

```
vis_proj        : Linear(vis_dim → 256)
encoder         : 2 × TransformerEncoderLayer (self-attn, pre-norm)
object_queries  : Parameter [1, 20, 256]
decoder         : 3 × TransformerDecoderLayer (cross-attn to encoder output)
class_head      : Linear(256 → 15)   # 14 classes + background
box_head        : MLP(256 → 256 → 4) # (cx,cy,w,h) ∈ [0,1]
```

## Results (VinDr-CXR test, mAP@IoU=0.4)

| Encoder | mAP@0.4 (test) |
|---|---|
| OWLv2 | 0.048 |
| SigLIP | ~0.045 |
| CLIP ViT-L/14 | ~0.040 |

| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14.pt` | CLIP ViT-L/14 | 1024 |
| `siglip.pt` | SigLIP | 1152 |
| `florence2.pt` | Florence-2 | 1024 |
| `coca.pt` | CoCa | 768 |
| `owlv2.pt` | OWLv2 | 1024 |
| `mae-vit-l16.pt` | MAE ViT-L/16 | 1024 |

## Loading

```python
import torch
from lapvqa.ad.heads import DetectionHead
from lapvqa.ad.heads import predict

ckpt = torch.load("owlv2.pt", map_location="cpu")
head = DetectionHead(
    vis_dim     = ckpt["vis_dim"],
    d_model     = ckpt["d_model"],
    num_queries = ckpt["num_queries"],
    num_enc_layers = ckpt["num_enc"],
    num_dec_layers = ckpt["num_dec"],
)
head.load_state_dict(ckpt["state_dict"])
head.eval()

with torch.no_grad():
    # vis_tokens: [B, HW, vis_dim] — spatial patch tokens from the frozen encoder
    outputs    = head(vis_tokens)
    detections = predict(outputs, score_threshold=0.1, nms_iou=0.5)
    # detections[i]: {'boxes': [K,4] xyxy, 'labels': [K], 'scores': [K]}
```