RadGuard V11 β AI Radiology Report Error Detector
RadGuard detects errors in AI-generated chest X-ray radiology reports by cross-referencing the report text against the actual X-ray image. Given an X-ray and an AI-generated report, it classifies each mentioned condition as SUPPORTED, HALLUCINATED, MISSING, or INACCURATE β and computes an overall ELRRs (Error-Labelled Radiology Report Score).
This is the final V11 model from the RadGuard FYP thesis project, trained on MIMIC-CXR with a BioViL-T image encoder and CXR-BERT text encoder coupled via bidirectional cross-attention.
Model Description
| Property | Value |
|---|---|
| Task | Radiology report error detection (multimodal classification) |
| Image encoder | BioViL-T (Microsoft, MIMIC-CXR pretrained) |
| Text encoder | CXR-BERT / BiomedVLP-BioViL-T tokenizer |
| Fusion | Bidirectional cross-attention + MLP-Mixer |
| Output | 14 conditions Γ 4 error classes + X-ray presence scores |
| Training data | MIMIC-CXR (74,060 samples) |
| Val F1 | 0.66 |
| Parameters | ~110 M (including frozen encoders) |
| Input image | 448 Γ 448 RGB, ImageNet normalization |
| Max text length | 128 tokens |
Architecture
Chest X-Ray (448Γ448) AI Report Sentence
β β
βββββββββΌβββββββββ ββββββββββββΌβββββββββββ
β BioViL-T β β CXR-BERT β
β Image Encoder β β Text Encoder β
β (MIMIC-CXR) β β (MIMIC-CXR) β
βββββββββ¬βββββββββ ββββββββββββ¬ββββββββββββ
β [B, 512, 14, 14] β [B, 768]
β 196 spatial patches β CLS token + token sequence
ββββββββββββββββββββ¬βββββββββββββββββ
β
βββββββββββββββββΌβββββββββββββββββββ
β Bidirectional Cross-Attention β
β (14 condition-specific heads) β
β β
β Dir 1: Text CLS β Image patches β β WHERE is it in the image?
β Dir 2: Image GAP β Text tokens β β WHAT does the text say?
β β
β + Condition Type Embedding (Γ5) β
βββββββββββββββββ¬βββββββββββββββββββ
β
βββββββββββββββββΌβββββββββββββββββββ
β MLP-Mixer Fusion β
β (4 blocks, 512-dim) β
β β
β + CheXbert Label Encoder β
β (14 AI labels β 64-dim) β
βββββββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Shared MLP (256-dim) β
ββββββββββββββ¬βββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β β
ββββββββββΌβββββββββββ ββββββββββββΌβββββββββ
β Task 1 Heads β β Task 2 Heads β
β 14 Γ Linear(256β4)β β 14 Γ Linear(256β1)β
β Error class/cond β β X-ray presence β
ββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββ
β β
SUPPORTED / HALLUCINATED Present / Absent
MISSING / INACCURATE (per condition)
Why BioViL-T + CXR-BERT? Both encoders are jointly pretrained on MIMIC-CXR β the same domain as this task. Their feature spaces are already aligned, making cross-attention semantically meaningful without requiring a contrastive alignment stage. Earlier versions using DenseNet (ImageNet) + ClinicalBERT had mismatched feature spaces which created a performance ceiling.
Why bidirectional cross-attention? Unidirectional attention (text β image only) finds where a condition appears but misses cases where the image is ambiguous and the text provides disambiguating context. The reverse direction (image β text) allows the model to attend to the specific words describing each condition, catching inaccurate descriptions even when the finding is visually present.
Error Classes
The model classifies each chest condition into one of four error types:
| Label | Meaning | Clinical Risk |
|---|---|---|
SUPPORTED |
Report correctly describes what is visible on the X-ray | β Safe |
HALLUCINATED |
Report mentions a finding that is not visible on the X-ray | π΄ High β false positive diagnosis |
MISSING |
A finding is visible on the X-ray but the report omits it | π High β missed diagnosis |
INACCURATE |
Finding is present but described incorrectly (wrong severity, location, etc.) | π‘ Moderate |
14 Chest Conditions
Enlarged Cardiomediastinum Cardiomegaly Lung Opacity
Lung Lesion Edema Consolidation
Pneumonia Atelectasis Pneumothorax
Pleural Effusion Pleural Other Fracture
Support Devices No Finding
Conditions are grouped into 5 anatomical/semantic types (encoded as type embeddings):
- Cardiac (0): Enlarged Cardiomediastinum, Cardiomegaly
- Parenchymal (1): Lung Opacity, Lesion, Edema, Consolidation, Pneumonia, Atelectasis
- Pleural (2): Pneumothorax, Pleural Effusion, Pleural Other, Fracture
- Device (3): Support Devices
- Normal (4): No Finding
ELRRs Score
The model outputs an ELRRs (Error-Labelled Radiology Report Score) inspired by Yu et al. 2023 (RadCliQ):
ELRRs = (Ξ£ weights) / N_active Γ 100
Weights: SUPPORTED=+1.0, INACCURATE=β0.3, MISSING=β0.5, HALLUCINATED=β0.7
| Score | Grade | Description |
|---|---|---|
| β₯ 80 | Excellent | Clinically safe β minimal errors |
| β₯ 60 | Good | Minor errors β clinically acceptable |
| β₯ 40 | Fair | Moderate errors β review advised |
| β₯ 20 | Poor | Significant errors β high risk |
| < 20 | Critical | Severe errors β unsafe for clinical use |
Training Details
| Parameter | Value |
|---|---|
| Dataset | MIMIC-CXR (PhysioNet, v2.0.0) |
| Train samples | ~67,000 |
| Val samples | ~7,060 |
| Total | 74,060 |
| Optimizer | AdamW |
| Scheduler | Cosine annealing with warmup |
| Image augmentation | RandomHorizontalFlip, RandomAffine, ColorJitter |
| Dropout | 0.4 |
| Batch size | 16 |
| Mixed precision | AMP (fp16) |
| Hardware | NVIDIA A100 (Vast.ai) |
Training Evolution (V2 β V11)
| Version | Val F1 | Key Change |
|---|---|---|
| V2 | 0.31 | Baseline: DenseNet + ClinicalBERT |
| V3 | 0.38 | Added CheXbert labels |
| V4 | 0.41 | Cross-attention introduced |
| V5 | 0.44 | Pseudo-label generation |
| V6 | 0.48 | Bidirectional cross-attention |
| V7 | 0.51 | Type embeddings |
| V8 | 0.55 | MLP-Mixer fusion |
| V9 | 0.58 | Dataset expansion + cleaning |
| V10 | 0.61 | BioViL-T + CXR-BERT encoders |
| V11 | 0.66 | Hyperparameter tuning + augmentation |
How to Use
Requirements
pip install torch torchvision transformers hi-ml-multimodal pillow
Load and Run Inference
import torch
from PIL import Image
from torchvision import transforms
# 1. Load the model weights
model_path = "best_model_v11.pth" # downloaded from this repo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 2. The full inference pipeline is in RadGuard-AI-Engine
# Clone: https://github.com/alyrraza/RadGuard-Medical-AI
# Then:
from inference.model import get_model, get_tokenizer, run_inference_on_sentence
from inference.pipeline import run_full_pipeline
# 3. Run inference
image = Image.open("chest_xray.jpg").convert("RGB")
ai_report = "The heart is mildly enlarged. No pleural effusion is seen. Lungs are clear."
result = run_full_pipeline(image, ai_report)
print(f"ELRRs Score: {result['elrrs']['score']} β {result['elrrs']['grade']}")
for cond in result['conditions']:
print(f" {cond['name']}: {cond['verdict']} ({cond['confidence']:.0%})")
REST API (Docker)
# Pull and run the full stack
git clone https://github.com/alyrraza/RadGuard-Medical-AI
cd RadGuard-Medical-AI
# Set model path and start
MODEL_PATH=/path/to/best_model_v11.pth docker-compose up
# Call the API
curl -X POST http://localhost:8000/analyze \
-F "file=@chest_xray.jpg" \
-F "ai_report=The heart is mildly enlarged. Lungs are clear."
API Response Schema
{
"task1_elrrs": {
"score": 71.4,
"grade": "Good",
"supported_count": 5,
"hallucinated_count": 1,
"missing_count": 0,
"inaccurate_count": 1
},
"task1_conditions": [
{
"name": "Cardiomegaly",
"verdict": "SUPPORTED",
"confidence": 0.87,
"meaning": "AI report is correct β X-ray confirms it",
"source_text": "The heart is mildly enlarged.",
"xray_present": true
}
],
"task2_xray_findings": { "Cardiomegaly": { "xray_present": true, "confidence": 0.91 } },
"task3_heatmaps": { "Cardiomegaly": "http://.../results/abc_Cardiomegaly.png" },
"not_mentioned": ["Pneumothorax", "Fracture"],
"sentences_analyzed": 3
}
Limitations
- Trained exclusively on MIMIC-CXR (adult patients, US hospital system). Performance may degrade on pediatric, non-PA view, or non-US population X-rays.
- Runs on individual sentences β inter-sentence context is not modeled.
- CheXbert label extraction (used as auxiliary input) requires a separate model and adds latency. A keyword fallback is included but reduces accuracy.
- Not validated for clinical deployment. This is a research/thesis prototype.
Citation
If you use this model in your research, please cite:
@misc{raza2025radguard,
title = {RadGuard: Detecting Errors in AI-Generated Radiology Reports
via Bidirectional Cross-Modal Attention},
author = {Raza, Ali},
year = {2025},
note = {Final Year Project, Department of Computer Science,
National University of Computer and Emerging Sciences (FAST-NUCES)},
url = {https://github.com/alyrraza/RadGuard-Medical-AI}
}
This work builds on:
@article{yu2023evaluating,
title = {Evaluating progress in automatic chest X-ray radiology report generation},
author = {Yu, Feiyang and others},
journal = {Patterns},
volume = {4},
number = {9},
year = {2023},
doi = {10.1016/j.patter.2023.100802}
}
@inproceedings{bannur2023learning,
title = {Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
author = {Bannur, Shruthi and others},
booktitle = {CVPR},
year = {2023}
}
License
MIT License. Model weights are derived from MIMIC-CXR data β usage requires a valid PhysioNet credentialed account and agreement to the MIMIC-CXR data use agreement.
βοΈ Medical Disclaimer: This model is a research prototype and has not been validated for clinical use. Do not use for diagnostic decisions.
Evaluation results
- Validation F1 on MIMIC-CXRvalidation set self-reported0.660
- Validation F1 (weighted) on MIMIC-CXRvalidation set self-reported0.630