alyrraza commited on
Commit
e4eec8a
Β·
verified Β·
1 Parent(s): 927bd20

Create readme.md

Browse files
Files changed (1) hide show
  1. README.md +345 -0
README.md ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - medical
7
+ - radiology
8
+ - chest-xray
9
+ - multimodal
10
+ - vision-language
11
+ - error-detection
12
+ - pytorch
13
+ - biovil-t
14
+ - cxr-bert
15
+ - mimic-cxr
16
+ datasets:
17
+ - StanfordAIMI/mimic-cxr-jpg
18
+ library_name: pytorch
19
+ pipeline_tag: image-to-text
20
+ metrics:
21
+ - f1
22
+
23
+ model-index:
24
+ - name: RadGuard V11
25
+ results:
26
+ - task:
27
+ type: radiology-report-error-detection
28
+ name: Radiology Report Error Detection
29
+ dataset:
30
+ name: MIMIC-CXR
31
+ type: StanfordAIMI/mimic-cxr-jpg
32
+ split: validation
33
+ metrics:
34
+ - type: f1
35
+ value: 0.66
36
+ name: Validation F1
37
+ - type: f1_weighted
38
+ value: 0.63
39
+ name: Validation F1 (weighted)
40
+ ---
41
+
42
+ # RadGuard V11 β€” AI Radiology Report Error Detector
43
+
44
+ RadGuard detects errors in AI-generated chest X-ray radiology reports by cross-referencing the report text against the actual X-ray image. Given an X-ray and an AI-generated report, it classifies each mentioned condition as **SUPPORTED**, **HALLUCINATED**, **MISSING**, or **INACCURATE** β€” and computes an overall **ELRRs** (Error-Labelled Radiology Report Score).
45
+
46
+ This is the final V11 model from the RadGuard FYP thesis project, trained on MIMIC-CXR with a BioViL-T image encoder and CXR-BERT text encoder coupled via bidirectional cross-attention.
47
+
48
+ ---
49
+
50
+ ## Model Description
51
+
52
+ | Property | Value |
53
+ |---|---|
54
+ | **Task** | Radiology report error detection (multimodal classification) |
55
+ | **Image encoder** | BioViL-T (Microsoft, MIMIC-CXR pretrained) |
56
+ | **Text encoder** | CXR-BERT / BiomedVLP-BioViL-T tokenizer |
57
+ | **Fusion** | Bidirectional cross-attention + MLP-Mixer |
58
+ | **Output** | 14 conditions Γ— 4 error classes + X-ray presence scores |
59
+ | **Training data** | MIMIC-CXR (74,060 samples) |
60
+ | **Val F1** | 0.66 |
61
+ | **Parameters** | ~110 M (including frozen encoders) |
62
+ | **Input image** | 448 Γ— 448 RGB, ImageNet normalization |
63
+ | **Max text length** | 128 tokens |
64
+
65
+ ---
66
+
67
+ ## Architecture
68
+
69
+ ```
70
+ Chest X-Ray (448Γ—448) AI Report Sentence
71
+ β”‚ β”‚
72
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
73
+ β”‚ BioViL-T β”‚ β”‚ CXR-BERT β”‚
74
+ β”‚ Image Encoder β”‚ β”‚ Text Encoder β”‚
75
+ β”‚ (MIMIC-CXR) β”‚ β”‚ (MIMIC-CXR) β”‚
76
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
77
+ β”‚ [B, 512, 14, 14] β”‚ [B, 768]
78
+ β”‚ 196 spatial patches β”‚ CLS token + token sequence
79
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
80
+ β”‚
81
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
82
+ β”‚ Bidirectional Cross-Attention β”‚
83
+ β”‚ (14 condition-specific heads) β”‚
84
+ β”‚ β”‚
85
+ β”‚ Dir 1: Text CLS β†’ Image patches β”‚ ← WHERE is it in the image?
86
+ β”‚ Dir 2: Image GAP β†’ Text tokens β”‚ ← WHAT does the text say?
87
+ β”‚ β”‚
88
+ β”‚ + Condition Type Embedding (Γ—5) β”‚
89
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
90
+ β”‚
91
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
92
+ β”‚ MLP-Mixer Fusion β”‚
93
+ β”‚ (4 blocks, 512-dim) β”‚
94
+ β”‚ β”‚
95
+ β”‚ + CheXbert Label Encoder β”‚
96
+ β”‚ (14 AI labels β†’ 64-dim) β”‚
97
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
98
+ β”‚
99
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
100
+ β”‚ Shared MLP (256-dim) β”‚
101
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
102
+ β”‚
103
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
104
+ β”‚ β”‚
105
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
106
+ β”‚ Task 1 Heads β”‚ β”‚ Task 2 Heads β”‚
107
+ β”‚ 14 Γ— Linear(256β†’4)β”‚ β”‚ 14 Γ— Linear(256β†’1)β”‚
108
+ β”‚ Error class/cond β”‚ β”‚ X-ray presence β”‚
109
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
110
+ β”‚ β”‚
111
+ SUPPORTED / HALLUCINATED Present / Absent
112
+ MISSING / INACCURATE (per condition)
113
+ ```
114
+
115
+ **Why BioViL-T + CXR-BERT?**
116
+ Both encoders are jointly pretrained on MIMIC-CXR β€” the same domain as this task. Their feature spaces are already aligned, making cross-attention semantically meaningful without requiring a contrastive alignment stage. Earlier versions using DenseNet (ImageNet) + ClinicalBERT had mismatched feature spaces which created a performance ceiling.
117
+
118
+ **Why bidirectional cross-attention?**
119
+ Unidirectional attention (text β†’ image only) finds *where* a condition appears but misses cases where the image is ambiguous and the text provides disambiguating context. The reverse direction (image β†’ text) allows the model to attend to the specific words describing each condition, catching inaccurate descriptions even when the finding is visually present.
120
+
121
+ ---
122
+
123
+ ## Error Classes
124
+
125
+ The model classifies each chest condition into one of four error types:
126
+
127
+ | Label | Meaning | Clinical Risk |
128
+ |---|---|---|
129
+ | `SUPPORTED` | Report correctly describes what is visible on the X-ray | βœ… Safe |
130
+ | `HALLUCINATED` | Report mentions a finding that is **not** visible on the X-ray | πŸ”΄ High β€” false positive diagnosis |
131
+ | `MISSING` | A finding **is** visible on the X-ray but the report omits it | 🟠 High β€” missed diagnosis |
132
+ | `INACCURATE` | Finding is present but described incorrectly (wrong severity, location, etc.) | 🟑 Moderate |
133
+
134
+ ---
135
+
136
+ ## 14 Chest Conditions
137
+
138
+ ```
139
+ Enlarged Cardiomediastinum Cardiomegaly Lung Opacity
140
+ Lung Lesion Edema Consolidation
141
+ Pneumonia Atelectasis Pneumothorax
142
+ Pleural Effusion Pleural Other Fracture
143
+ Support Devices No Finding
144
+ ```
145
+
146
+ Conditions are grouped into 5 anatomical/semantic types (encoded as type embeddings):
147
+ - **Cardiac** (0): Enlarged Cardiomediastinum, Cardiomegaly
148
+ - **Parenchymal** (1): Lung Opacity, Lesion, Edema, Consolidation, Pneumonia, Atelectasis
149
+ - **Pleural** (2): Pneumothorax, Pleural Effusion, Pleural Other, Fracture
150
+ - **Device** (3): Support Devices
151
+ - **Normal** (4): No Finding
152
+
153
+ ---
154
+
155
+ ## ELRRs Score
156
+
157
+ The model outputs an **ELRRs** (Error-Labelled Radiology Report Score) inspired by [Yu et al. 2023 (RadCliQ)](https://doi.org/10.1016/j.patter.2023.100802):
158
+
159
+ ```
160
+ ELRRs = (Ξ£ weights) / N_active Γ— 100
161
+
162
+ Weights: SUPPORTED=+1.0, INACCURATE=βˆ’0.3, MISSING=βˆ’0.5, HALLUCINATED=βˆ’0.7
163
+ ```
164
+
165
+ | Score | Grade | Description |
166
+ |---|---|---|
167
+ | β‰₯ 80 | Excellent | Clinically safe β€” minimal errors |
168
+ | β‰₯ 60 | Good | Minor errors β€” clinically acceptable |
169
+ | β‰₯ 40 | Fair | Moderate errors β€” review advised |
170
+ | β‰₯ 20 | Poor | Significant errors β€” high risk |
171
+ | < 20 | Critical | Severe errors β€” unsafe for clinical use |
172
+
173
+ ---
174
+
175
+ ## Training Details
176
+
177
+ | Parameter | Value |
178
+ |---|---|
179
+ | **Dataset** | MIMIC-CXR (PhysioNet, v2.0.0) |
180
+ | **Train samples** | ~67,000 |
181
+ | **Val samples** | ~7,060 |
182
+ | **Total** | 74,060 |
183
+ | **Optimizer** | AdamW |
184
+ | **Scheduler** | Cosine annealing with warmup |
185
+ | **Image augmentation** | RandomHorizontalFlip, RandomAffine, ColorJitter |
186
+ | **Dropout** | 0.4 |
187
+ | **Batch size** | 16 |
188
+ | **Mixed precision** | AMP (fp16) |
189
+ | **Hardware** | NVIDIA A100 (Vast.ai) |
190
+
191
+ ### Training Evolution (V2 β†’ V11)
192
+
193
+ | Version | Val F1 | Key Change |
194
+ |---|---|---|
195
+ | V2 | 0.31 | Baseline: DenseNet + ClinicalBERT |
196
+ | V3 | 0.38 | Added CheXbert labels |
197
+ | V4 | 0.41 | Cross-attention introduced |
198
+ | V5 | 0.44 | Pseudo-label generation |
199
+ | V6 | 0.48 | Bidirectional cross-attention |
200
+ | V7 | 0.51 | Type embeddings |
201
+ | V8 | 0.55 | MLP-Mixer fusion |
202
+ | V9 | 0.58 | Dataset expansion + cleaning |
203
+ | V10 | 0.61 | BioViL-T + CXR-BERT encoders |
204
+ | **V11** | **0.66** | Hyperparameter tuning + augmentation |
205
+
206
+ ---
207
+
208
+ ## How to Use
209
+
210
+ ### Requirements
211
+
212
+ ```bash
213
+ pip install torch torchvision transformers hi-ml-multimodal pillow
214
+ ```
215
+
216
+ ### Load and Run Inference
217
+
218
+ ```python
219
+ import torch
220
+ from PIL import Image
221
+ from torchvision import transforms
222
+
223
+ # 1. Load the model weights
224
+ model_path = "best_model_v11.pth" # downloaded from this repo
225
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
226
+
227
+ # 2. The full inference pipeline is in RadGuard-AI-Engine
228
+ # Clone: https://github.com/alyrraza/RadGuard-Medical-AI
229
+ # Then:
230
+ from inference.model import get_model, get_tokenizer, run_inference_on_sentence
231
+ from inference.pipeline import run_full_pipeline
232
+
233
+ # 3. Run inference
234
+ image = Image.open("chest_xray.jpg").convert("RGB")
235
+ ai_report = "The heart is mildly enlarged. No pleural effusion is seen. Lungs are clear."
236
+
237
+ result = run_full_pipeline(image, ai_report)
238
+
239
+ print(f"ELRRs Score: {result['elrrs']['score']} β€” {result['elrrs']['grade']}")
240
+ for cond in result['conditions']:
241
+ print(f" {cond['name']}: {cond['verdict']} ({cond['confidence']:.0%})")
242
+ ```
243
+
244
+ ### REST API (Docker)
245
+
246
+ ```bash
247
+ # Pull and run the full stack
248
+ git clone https://github.com/alyrraza/RadGuard-Medical-AI
249
+ cd RadGuard-Medical-AI
250
+
251
+ # Set model path and start
252
+ MODEL_PATH=/path/to/best_model_v11.pth docker-compose up
253
+
254
+ # Call the API
255
+ curl -X POST http://localhost:8000/analyze \
256
+ -F "file=@chest_xray.jpg" \
257
+ -F "ai_report=The heart is mildly enlarged. Lungs are clear."
258
+ ```
259
+
260
+ ### API Response Schema
261
+
262
+ ```json
263
+ {
264
+ "task1_elrrs": {
265
+ "score": 71.4,
266
+ "grade": "Good",
267
+ "supported_count": 5,
268
+ "hallucinated_count": 1,
269
+ "missing_count": 0,
270
+ "inaccurate_count": 1
271
+ },
272
+ "task1_conditions": [
273
+ {
274
+ "name": "Cardiomegaly",
275
+ "verdict": "SUPPORTED",
276
+ "confidence": 0.87,
277
+ "meaning": "AI report is correct β€” X-ray confirms it",
278
+ "source_text": "The heart is mildly enlarged.",
279
+ "xray_present": true
280
+ }
281
+ ],
282
+ "task2_xray_findings": { "Cardiomegaly": { "xray_present": true, "confidence": 0.91 } },
283
+ "task3_heatmaps": { "Cardiomegaly": "http://.../results/abc_Cardiomegaly.png" },
284
+ "not_mentioned": ["Pneumothorax", "Fracture"],
285
+ "sentences_analyzed": 3
286
+ }
287
+ ```
288
+
289
+ ---
290
+
291
+ ## Limitations
292
+
293
+ - Trained exclusively on **MIMIC-CXR** (adult patients, US hospital system). Performance may degrade on pediatric, non-PA view, or non-US population X-rays.
294
+ - Runs on **individual sentences** β€” inter-sentence context is not modeled.
295
+ - CheXbert label extraction (used as auxiliary input) requires a separate model and adds latency. A keyword fallback is included but reduces accuracy.
296
+ - **Not validated for clinical deployment.** This is a research/thesis prototype.
297
+
298
+ ---
299
+
300
+ ## Citation
301
+
302
+ If you use this model in your research, please cite:
303
+
304
+ ```bibtex
305
+ @misc{raza2025radguard,
306
+ title = {RadGuard: Detecting Errors in AI-Generated Radiology Reports
307
+ via Bidirectional Cross-Modal Attention},
308
+ author = {Raza, Ali},
309
+ year = {2025},
310
+ note = {Final Year Project, Department of Computer Science,
311
+ National University of Computer and Emerging Sciences (FAST-NUCES)},
312
+ url = {https://github.com/alyrraza/RadGuard-Medical-AI}
313
+ }
314
+ ```
315
+
316
+ This work builds on:
317
+
318
+ ```bibtex
319
+ @article{yu2023evaluating,
320
+ title = {Evaluating progress in automatic chest X-ray radiology report generation},
321
+ author = {Yu, Feiyang and others},
322
+ journal = {Patterns},
323
+ volume = {4},
324
+ number = {9},
325
+ year = {2023},
326
+ doi = {10.1016/j.patter.2023.100802}
327
+ }
328
+
329
+ @inproceedings{bannur2023learning,
330
+ title = {Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
331
+ author = {Bannur, Shruthi and others},
332
+ booktitle = {CVPR},
333
+ year = {2023}
334
+ }
335
+ ```
336
+
337
+ ---
338
+
339
+ ## License
340
+
341
+ MIT License. Model weights are derived from MIMIC-CXR data β€” usage requires a valid [PhysioNet credentialed account](https://physionet.org/settings/credentialing/) and agreement to the MIMIC-CXR data use agreement.
342
+
343
+ ---
344
+
345
+ *βš•οΈ Medical Disclaimer: This model is a research prototype and has not been validated for clinical use. Do not use for diagnostic decisions.*