File size: 4,585 Bytes
d48cf41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223debf
d48cf41
0cd3b3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d48cf41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223debf
d48cf41
223debf
 
 
 
 
d48cf41
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
language:
  - en
tags:
  - medical-imaging
  - chest-x-ray
  - radiology
  - clip
  - dinov2
  - vision-language-model
  - contrastive-learning
datasets:
  - mimic-cxr
  - chexpert-plus
  - rexgradient
pipeline_tag: zero-shot-image-classification
---

# CLEAR — Vision-Language Backbone

This repository hosts the pretrained vision-language backbone for CLEAR (*Concept-Level Embeddings for Auditable Radiology*). The checkpoint contains a contrastive image–text encoder that maps chest X-rays and radiological text into a shared 768-dimensional embedding space. Given a chest X-ray, the image encoder produces a feature vector whose cosine similarity with each of 368,294 text-encoded radiological observations yields a concept score vector. This concept score vector is then projected through LLM-derived semantic embeddings to produce the final CLEAR image embedding used for zero-shot classification, supervised linear probing, and concept bottleneck models. The concept bank, LLM embeddings, and downstream inference code are available in the [GitHub repository](https://github.com/peterhan91/CLEAR).

## Files in This Repository

| File | Description |
|------|-------------|
| `best_model.pt` | CLEAR vision-language backbone checkpoint (DINOv2 ViT-B/14 + text encoder) |
| `mimic_concepts.csv` | Full concept vocabulary (368,294 radiological observations extracted from reports) |
| `concept_embeddings_368294.pt` | Precomputed SFR-Embedding-Mistral embeddings for all 368,294 concepts (4096-dim) |

## Quick Start

```python
import torch
from PIL import Image
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize, InterpolationMode
from huggingface_hub import hf_hub_download

# Download the checkpoint
ckpt_path = hf_hub_download(repo_id="peterhan91/CLEAR", filename="best_model.pt")

# Clone the CLEAR repo for model code: git clone https://github.com/peterhan91/CLEAR
from clear.model import CLIP
from clear import tokenize
from examples.train import load_clip

# Load the CLEAR checkpoint (DINOv2 ViT-B/14 + text encoder)
model = load_clip(
    model_path=ckpt_path,
    use_dinov2=True,
    dinov2_model_name="dinov2_vitb14",
)
model.eval()

# Preprocess a chest X-ray
preprocess = Compose([
    Resize(448, interpolation=InterpolationMode.BICUBIC),
    CenterCrop(448),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
device = next(model.parameters()).device
image = preprocess(Image.open("path/to/cxr.jpg").convert("RGB")).unsqueeze(0).to(device)

# Encode image and text
with torch.no_grad():
    image_features = model.encode_image(image)
    text_tokens = tokenize(["pleural effusion", "no pleural effusion"]).to(device)
    text_features = model.encode_text(text_tokens)

    # Cosine similarity → softmax probability
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    logits = (image_features @ text_features.T).softmax(dim=-1)

print(logits)  # [[prob_positive, prob_negative]]
```

The full CLEAR pipeline projects these concept similarity scores through LLM-derived semantic embeddings (SFR-Embedding-Mistral) for auditable zero-shot classification. See the [GitHub repository](https://github.com/peterhan91/CLEAR) for benchmarking, concept bottleneck models, and model auditing scripts.

## Checkpoint Details

| Attribute | Value |
|-----------|-------|
| **Image Encoder** | DINOv2 ViT-B/14 with registers |
| **Text Encoder** | 12-layer Transformer (GPT-2 style, 512-dim, 8 heads) |
| **Embedding Dimension** | 768 |
| **Input Resolution** | 448 x 448 pixels |
| **File** | `best_model.pt` |

## Training Data

The model was trained on **873,342 CXR–report pairs** (239,091 patients) using symmetric contrastive learning (InfoNCE):

- MIMIC-CXR (377,110 pairs)
- CheXpert-Plus (223,228 pairs)
- ReXGradient (273,004 pairs)

Training details: 3 x NVIDIA A6000, effective batch size 768, AdamW (lr=1e-4, weight decay 0.2), cosine annealing with 500 warmup steps, 40 epochs.

## Citation

```bibtex
@article{han2026clear,
  title={CLEAR: An Auditable Foundation Model for Radiology Grounded in Clinical Concepts},
  author={Han, Tianyu and Wu, Riga and Tian, Yu and Khader, Firas and Adams, Lisa and Bressem, Keno and Davatzikos, Christos and Kather, Jakob Nikolas and Shen, Li and Mankoff, David and others},
  journal={medRxiv},
  pages={2026--01},
  year={2026},
  publisher={Cold Spring Harbor Laboratory Press}
}
```

## License

Apache-2.0