alyrraza
/

radguard-v11

+---
+language:
+  - en
+license: mit
+tags:
+  - medical
+  - radiology
+  - chest-xray
+  - multimodal
+  - vision-language
+  - error-detection
+  - pytorch
+  - biovil-t
+  - cxr-bert
+  - mimic-cxr
+datasets:
+  - StanfordAIMI/mimic-cxr-jpg
+library_name: pytorch
+pipeline_tag: image-to-text
+metrics:
+  - f1
+model-index:
+  - name: RadGuard V11
+    results:
+      - task:
+          type: radiology-report-error-detection
+          name: Radiology Report Error Detection
+        dataset:
+          name: MIMIC-CXR
+          type: StanfordAIMI/mimic-cxr-jpg
+          split: validation
+        metrics:
+          - type: f1
+            value: 0.66
+            name: Validation F1
+          - type: f1_weighted
+            value: 0.63
+            name: Validation F1 (weighted)
+---
+# RadGuard V11 — AI Radiology Report Error Detector
+RadGuard detects errors in AI-generated chest X-ray radiology reports by cross-referencing the report text against the actual X-ray image. Given an X-ray and an AI-generated report, it classifies each mentioned condition as **SUPPORTED**, **HALLUCINATED**, **MISSING**, or **INACCURATE** — and computes an overall **ELRRs** (Error-Labelled Radiology Report Score).
+This is the final V11 model from the RadGuard FYP thesis project, trained on MIMIC-CXR with a BioViL-T image encoder and CXR-BERT text encoder coupled via bidirectional cross-attention.
+---
+## Model Description
+| Property | Value |
+|---|---|
+| **Task** | Radiology report error detection (multimodal classification) |
+| **Image encoder** | BioViL-T (Microsoft, MIMIC-CXR pretrained) |
+| **Text encoder** | CXR-BERT / BiomedVLP-BioViL-T tokenizer |
+| **Fusion** | Bidirectional cross-attention + MLP-Mixer |
+| **Output** | 14 conditions × 4 error classes + X-ray presence scores |
+| **Training data** | MIMIC-CXR (74,060 samples) |
+| **Val F1** | 0.66 |
+| **Parameters** | ~110 M (including frozen encoders) |
+| **Input image** | 448 × 448 RGB, ImageNet normalization |
+| **Max text length** | 128 tokens |
+---
+## Architecture
+```
+Chest X-Ray (448×448)              AI Report Sentence
+        │                                   │
+┌───────▼────────┐               ┌──────────▼──────────┐
+│   BioViL-T     │               │      CXR-BERT        │
+│ Image Encoder  │               │    Text Encoder      │
+│ (MIMIC-CXR)    │               │    (MIMIC-CXR)       │
+└───────┬────────┘               └──────────┬───────────┘
+        │ [B, 512, 14, 14]                  │ [B, 768]
+        │ 196 spatial patches               │ CLS token + token sequence
+        └──────────────────┬────────────────┘
+                           │
+           ┌───────────────▼──────────────────┐
+           │    Bidirectional Cross-Attention  │
+           │    (14 condition-specific heads)  │
+           │                                  │
+           │  Dir 1: Text CLS → Image patches │  ← WHERE is it in the image?
+           │  Dir 2: Image GAP → Text tokens  │  ← WHAT does the text say?
+           │                                  │
+           │  + Condition Type Embedding (×5) │
+           └───────────────┬──────────────────┘
+                           │
+           ┌───────────────▼──────────────────┐
+           │         MLP-Mixer Fusion         │
+           │         (4 blocks, 512-dim)       │
+           │                                  │
+           │  + CheXbert Label Encoder        │
+           │    (14 AI labels → 64-dim)       │
+           └───────────────┬──────────────────┘
+                           │
+              ┌────────────▼────────────┐
+              │   Shared MLP (256-dim)  │
+              └────────────┬────────────┘
+                           │
+         ┌─────────────────┼─────────────────┐
+         │                                   │
+┌────────▼──────────┐             ┌──────────▼────────┐
+│  Task 1 Heads     │             │  Task 2 Heads     │
+│  14 × Linear(256→4)│            │  14 × Linear(256→1)│
+│  Error class/cond │             │  X-ray presence   │
+└────────┬──────────┘             ��──────────┬────────┘
+         │                                   │
+  SUPPORTED / HALLUCINATED          Present / Absent
+  MISSING  / INACCURATE             (per condition)
+```
+**Why BioViL-T + CXR-BERT?**
+Both encoders are jointly pretrained on MIMIC-CXR — the same domain as this task. Their feature spaces are already aligned, making cross-attention semantically meaningful without requiring a contrastive alignment stage. Earlier versions using DenseNet (ImageNet) + ClinicalBERT had mismatched feature spaces which created a performance ceiling.
+**Why bidirectional cross-attention?**
+Unidirectional attention (text → image only) finds *where* a condition appears but misses cases where the image is ambiguous and the text provides disambiguating context. The reverse direction (image → text) allows the model to attend to the specific words describing each condition, catching inaccurate descriptions even when the finding is visually present.
+---
+## Error Classes
+The model classifies each chest condition into one of four error types:
+| Label | Meaning | Clinical Risk |
+|---|---|---|
+| `SUPPORTED` | Report correctly describes what is visible on the X-ray | ✅ Safe |
+| `HALLUCINATED` | Report mentions a finding that is **not** visible on the X-ray | 🔴 High — false positive diagnosis |
+| `MISSING` | A finding **is** visible on the X-ray but the report omits it | 🟠 High — missed diagnosis |
+| `INACCURATE` | Finding is present but described incorrectly (wrong severity, location, etc.) | 🟡 Moderate |
+---
+## 14 Chest Conditions
+```
+Enlarged Cardiomediastinum  Cardiomegaly        Lung Opacity
+Lung Lesion                 Edema               Consolidation
+Pneumonia                   Atelectasis         Pneumothorax
+Pleural Effusion            Pleural Other       Fracture
+Support Devices             No Finding
+```
+Conditions are grouped into 5 anatomical/semantic types (encoded as type embeddings):
+- **Cardiac** (0): Enlarged Cardiomediastinum, Cardiomegaly
+- **Parenchymal** (1): Lung Opacity, Lesion, Edema, Consolidation, Pneumonia, Atelectasis
+- **Pleural** (2): Pneumothorax, Pleural Effusion, Pleural Other, Fracture
+- **Device** (3): Support Devices
+- **Normal** (4): No Finding
+---
+## ELRRs Score
+The model outputs an **ELRRs** (Error-Labelled Radiology Report Score) inspired by [Yu et al. 2023 (RadCliQ)](https://doi.org/10.1016/j.patter.2023.100802):
+```
+ELRRs = (Σ weights) / N_active × 100
+Weights: SUPPORTED=+1.0, INACCURATE=−0.3, MISSING=−0.5, HALLUCINATED=−0.7
+```
+| Score | Grade | Description |
+|---|---|---|
+| ≥ 80 | Excellent | Clinically safe — minimal errors |
+| ≥ 60 | Good | Minor errors — clinically acceptable |
+| ≥ 40 | Fair | Moderate errors — review advised |
+| ≥ 20 | Poor | Significant errors — high risk |
+| < 20 | Critical | Severe errors — unsafe for clinical use |
+---
+## Training Details
+| Parameter | Value |
+|---|---|
+| **Dataset** | MIMIC-CXR (PhysioNet, v2.0.0) |
+| **Train samples** | ~67,000 |
+| **Val samples** | ~7,060 |
+| **Total** | 74,060 |
+| **Optimizer** | AdamW |
+| **Scheduler** | Cosine annealing with warmup |
+| **Image augmentation** | RandomHorizontalFlip, RandomAffine, ColorJitter |
+| **Dropout** | 0.4 |
+| **Batch size** | 16 |
+| **Mixed precision** | AMP (fp16) |
+| **Hardware** | NVIDIA A100 (Vast.ai) |
+### Training Evolution (V2 → V11)
+| Version | Val F1 | Key Change |
+|---|---|---|
+| V2 | 0.31 | Baseline: DenseNet + ClinicalBERT |
+| V3 | 0.38 | Added CheXbert labels |
+| V4 | 0.41 | Cross-attention introduced |
+| V5 | 0.44 | Pseudo-label generation |
+| V6 | 0.48 | Bidirectional cross-attention |
+| V7 | 0.51 | Type embeddings |
+| V8 | 0.55 | MLP-Mixer fusion |
+| V9 | 0.58 | Dataset expansion + cleaning |
+| V10 | 0.61 | BioViL-T + CXR-BERT encoders |
+| **V11** | **0.66** | Hyperparameter tuning + augmentation |
+---
+## How to Use
+### Requirements
+```bash
+pip install torch torchvision transformers hi-ml-multimodal pillow
+```
+### Load and Run Inference
+```python
+import torch
+from PIL import Image
+from torchvision import transforms
+# 1. Load the model weights
+model_path = "best_model_v11.pth"  # downloaded from this repo
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# 2. The full inference pipeline is in RadGuard-AI-Engine
+#    Clone: https://github.com/alyrraza/RadGuard-Medical-AI
+#    Then:
+from inference.model import get_model, get_tokenizer, run_inference_on_sentence
+from inference.pipeline import run_full_pipeline
+# 3. Run inference
+image = Image.open("chest_xray.jpg").convert("RGB")
+ai_report = "The heart is mildly enlarged. No pleural effusion is seen. Lungs are clear."
+result = run_full_pipeline(image, ai_report)
+print(f"ELRRs Score: {result['elrrs']['score']} — {result['elrrs']['grade']}")
+for cond in result['conditions']:
+    print(f"  {cond['name']}: {cond['verdict']} ({cond['confidence']:.0%})")
+```
+### REST API (Docker)
+```bash
+# Pull and run the full stack
+git clone https://github.com/alyrraza/RadGuard-Medical-AI
+cd RadGuard-Medical-AI
+# Set model path and start
+MODEL_PATH=/path/to/best_model_v11.pth docker-compose up
+# Call the API
+curl -X POST http://localhost:8000/analyze \
+  -F "file=@chest_xray.jpg" \
+  -F "ai_report=The heart is mildly enlarged. Lungs are clear."
+```
+### API Response Schema
+```json
+{
+  "task1_elrrs": {
+    "score": 71.4,
+    "grade": "Good",
+    "supported_count": 5,
+    "hallucinated_count": 1,
+    "missing_count": 0,
+    "inaccurate_count": 1
+  },
+  "task1_conditions": [
+    {
+      "name": "Cardiomegaly",
+      "verdict": "SUPPORTED",
+      "confidence": 0.87,
+      "meaning": "AI report is correct — X-ray confirms it",
+      "source_text": "The heart is mildly enlarged.",
+      "xray_present": true
+    }
+  ],
+  "task2_xray_findings": { "Cardiomegaly": { "xray_present": true, "confidence": 0.91 } },
+  "task3_heatmaps": { "Cardiomegaly": "http://.../results/abc_Cardiomegaly.png" },
+  "not_mentioned": ["Pneumothorax", "Fracture"],
+  "sentences_analyzed": 3
+}
+```
+---
+## Limitations
+- Trained exclusively on **MIMIC-CXR** (adult patients, US hospital system). Performance may degrade on pediatric, non-PA view, or non-US population X-rays.
+- Runs on **individual sentences** — inter-sentence context is not modeled.
+- CheXbert label extraction (used as auxiliary input) requires a separate model and adds latency. A keyword fallback is included but reduces accuracy.
+- **Not validated for clinical deployment.** This is a research/thesis prototype.
+---
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{raza2025radguard,
+  title     = {RadGuard: Detecting Errors in AI-Generated Radiology Reports
+               via Bidirectional Cross-Modal Attention},
+  author    = {Raza, Ali},
+  year      = {2025},
+  note      = {Final Year Project, Department of Computer Science,
+               National University of Computer and Emerging Sciences (FAST-NUCES)},
+  url       = {https://github.com/alyrraza/RadGuard-Medical-AI}
+}
+```
+This work builds on:
+```bibtex
+@article{yu2023evaluating,
+  title   = {Evaluating progress in automatic chest X-ray radiology report generation},
+  author  = {Yu, Feiyang and others},
+  journal = {Patterns},
+  volume  = {4},
+  number  = {9},
+  year    = {2023},
+  doi     = {10.1016/j.patter.2023.100802}
+}
+@inproceedings{bannur2023learning,
+  title     = {Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
+  author    = {Bannur, Shruthi and others},
+  booktitle = {CVPR},
+  year      = {2023}
+}
+```
+---
+## License
+MIT License. Model weights are derived from MIMIC-CXR data — usage requires a valid [PhysioNet credentialed account](https://physionet.org/settings/credentialing/) and agreement to the MIMIC-CXR data use agreement.
+---
+*⚕️ Medical Disclaimer: This model is a research prototype and has not been validated for clinical use. Do not use for diagnostic decisions.*