RadGuard V11 β€” AI Radiology Report Error Detector

RadGuard detects errors in AI-generated chest X-ray radiology reports by cross-referencing the report text against the actual X-ray image. Given an X-ray and an AI-generated report, it classifies each mentioned condition as SUPPORTED, HALLUCINATED, MISSING, or INACCURATE β€” and computes an overall ELRRs (Error-Labelled Radiology Report Score).

This is the final V11 model from the RadGuard FYP thesis project, trained on MIMIC-CXR with a BioViL-T image encoder and CXR-BERT text encoder coupled via bidirectional cross-attention.


Model Description

Property Value
Task Radiology report error detection (multimodal classification)
Image encoder BioViL-T (Microsoft, MIMIC-CXR pretrained)
Text encoder CXR-BERT / BiomedVLP-BioViL-T tokenizer
Fusion Bidirectional cross-attention + MLP-Mixer
Output 14 conditions Γ— 4 error classes + X-ray presence scores
Training data MIMIC-CXR (74,060 samples)
Val F1 0.66
Parameters ~110 M (including frozen encoders)
Input image 448 Γ— 448 RGB, ImageNet normalization
Max text length 128 tokens

Architecture

Chest X-Ray (448Γ—448)              AI Report Sentence
        β”‚                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   BioViL-T     β”‚               β”‚      CXR-BERT        β”‚
β”‚ Image Encoder  β”‚               β”‚    Text Encoder      β”‚
β”‚ (MIMIC-CXR)    β”‚               β”‚    (MIMIC-CXR)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚ [B, 512, 14, 14]                  β”‚ [B, 768]
        β”‚ 196 spatial patches               β”‚ CLS token + token sequence
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚    Bidirectional Cross-Attention  β”‚
           β”‚    (14 condition-specific heads)  β”‚
           β”‚                                  β”‚
           β”‚  Dir 1: Text CLS β†’ Image patches β”‚  ← WHERE is it in the image?
           β”‚  Dir 2: Image GAP β†’ Text tokens  β”‚  ← WHAT does the text say?
           β”‚                                  β”‚
           β”‚  + Condition Type Embedding (Γ—5) β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚         MLP-Mixer Fusion         β”‚
           β”‚         (4 blocks, 512-dim)       β”‚
           β”‚                                  β”‚
           β”‚  + CheXbert Label Encoder        β”‚
           β”‚    (14 AI labels β†’ 64-dim)       β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Shared MLP (256-dim)  β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Task 1 Heads     β”‚             β”‚  Task 2 Heads     β”‚
β”‚  14 Γ— Linear(256β†’4)β”‚            β”‚  14 Γ— Linear(256β†’1)β”‚
β”‚  Error class/cond β”‚             β”‚  X-ray presence   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                   β”‚
  SUPPORTED / HALLUCINATED          Present / Absent
  MISSING  / INACCURATE             (per condition)

Why BioViL-T + CXR-BERT? Both encoders are jointly pretrained on MIMIC-CXR β€” the same domain as this task. Their feature spaces are already aligned, making cross-attention semantically meaningful without requiring a contrastive alignment stage. Earlier versions using DenseNet (ImageNet) + ClinicalBERT had mismatched feature spaces which created a performance ceiling.

Why bidirectional cross-attention? Unidirectional attention (text β†’ image only) finds where a condition appears but misses cases where the image is ambiguous and the text provides disambiguating context. The reverse direction (image β†’ text) allows the model to attend to the specific words describing each condition, catching inaccurate descriptions even when the finding is visually present.


Error Classes

The model classifies each chest condition into one of four error types:

Label Meaning Clinical Risk
SUPPORTED Report correctly describes what is visible on the X-ray βœ… Safe
HALLUCINATED Report mentions a finding that is not visible on the X-ray πŸ”΄ High β€” false positive diagnosis
MISSING A finding is visible on the X-ray but the report omits it 🟠 High β€” missed diagnosis
INACCURATE Finding is present but described incorrectly (wrong severity, location, etc.) 🟑 Moderate

14 Chest Conditions

Enlarged Cardiomediastinum  Cardiomegaly        Lung Opacity
Lung Lesion                 Edema               Consolidation
Pneumonia                   Atelectasis         Pneumothorax
Pleural Effusion            Pleural Other       Fracture
Support Devices             No Finding

Conditions are grouped into 5 anatomical/semantic types (encoded as type embeddings):

  • Cardiac (0): Enlarged Cardiomediastinum, Cardiomegaly
  • Parenchymal (1): Lung Opacity, Lesion, Edema, Consolidation, Pneumonia, Atelectasis
  • Pleural (2): Pneumothorax, Pleural Effusion, Pleural Other, Fracture
  • Device (3): Support Devices
  • Normal (4): No Finding

ELRRs Score

The model outputs an ELRRs (Error-Labelled Radiology Report Score) inspired by Yu et al. 2023 (RadCliQ):

ELRRs = (Ξ£ weights) / N_active Γ— 100

Weights: SUPPORTED=+1.0, INACCURATE=βˆ’0.3, MISSING=βˆ’0.5, HALLUCINATED=βˆ’0.7
Score Grade Description
β‰₯ 80 Excellent Clinically safe β€” minimal errors
β‰₯ 60 Good Minor errors β€” clinically acceptable
β‰₯ 40 Fair Moderate errors β€” review advised
β‰₯ 20 Poor Significant errors β€” high risk
< 20 Critical Severe errors β€” unsafe for clinical use

Training Details

Parameter Value
Dataset MIMIC-CXR (PhysioNet, v2.0.0)
Train samples ~67,000
Val samples ~7,060
Total 74,060
Optimizer AdamW
Scheduler Cosine annealing with warmup
Image augmentation RandomHorizontalFlip, RandomAffine, ColorJitter
Dropout 0.4
Batch size 16
Mixed precision AMP (fp16)
Hardware NVIDIA A100 (Vast.ai)

Training Evolution (V2 β†’ V11)

Version Val F1 Key Change
V2 0.31 Baseline: DenseNet + ClinicalBERT
V3 0.38 Added CheXbert labels
V4 0.41 Cross-attention introduced
V5 0.44 Pseudo-label generation
V6 0.48 Bidirectional cross-attention
V7 0.51 Type embeddings
V8 0.55 MLP-Mixer fusion
V9 0.58 Dataset expansion + cleaning
V10 0.61 BioViL-T + CXR-BERT encoders
V11 0.66 Hyperparameter tuning + augmentation

How to Use

Requirements

pip install torch torchvision transformers hi-ml-multimodal pillow

Load and Run Inference

import torch
from PIL import Image
from torchvision import transforms

# 1. Load the model weights
model_path = "best_model_v11.pth"  # downloaded from this repo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 2. The full inference pipeline is in RadGuard-AI-Engine
#    Clone: https://github.com/alyrraza/RadGuard-Medical-AI
#    Then:
from inference.model import get_model, get_tokenizer, run_inference_on_sentence
from inference.pipeline import run_full_pipeline

# 3. Run inference
image = Image.open("chest_xray.jpg").convert("RGB")
ai_report = "The heart is mildly enlarged. No pleural effusion is seen. Lungs are clear."

result = run_full_pipeline(image, ai_report)

print(f"ELRRs Score: {result['elrrs']['score']} β€” {result['elrrs']['grade']}")
for cond in result['conditions']:
    print(f"  {cond['name']}: {cond['verdict']} ({cond['confidence']:.0%})")

REST API (Docker)

# Pull and run the full stack
git clone https://github.com/alyrraza/RadGuard-Medical-AI
cd RadGuard-Medical-AI

# Set model path and start
MODEL_PATH=/path/to/best_model_v11.pth docker-compose up

# Call the API
curl -X POST http://localhost:8000/analyze \
  -F "file=@chest_xray.jpg" \
  -F "ai_report=The heart is mildly enlarged. Lungs are clear."

API Response Schema

{
  "task1_elrrs": {
    "score": 71.4,
    "grade": "Good",
    "supported_count": 5,
    "hallucinated_count": 1,
    "missing_count": 0,
    "inaccurate_count": 1
  },
  "task1_conditions": [
    {
      "name": "Cardiomegaly",
      "verdict": "SUPPORTED",
      "confidence": 0.87,
      "meaning": "AI report is correct β€” X-ray confirms it",
      "source_text": "The heart is mildly enlarged.",
      "xray_present": true
    }
  ],
  "task2_xray_findings": { "Cardiomegaly": { "xray_present": true, "confidence": 0.91 } },
  "task3_heatmaps": { "Cardiomegaly": "http://.../results/abc_Cardiomegaly.png" },
  "not_mentioned": ["Pneumothorax", "Fracture"],
  "sentences_analyzed": 3
}

Limitations

  • Trained exclusively on MIMIC-CXR (adult patients, US hospital system). Performance may degrade on pediatric, non-PA view, or non-US population X-rays.
  • Runs on individual sentences β€” inter-sentence context is not modeled.
  • CheXbert label extraction (used as auxiliary input) requires a separate model and adds latency. A keyword fallback is included but reduces accuracy.
  • Not validated for clinical deployment. This is a research/thesis prototype.

Citation

If you use this model in your research, please cite:

@misc{raza2025radguard,
  title     = {RadGuard: Detecting Errors in AI-Generated Radiology Reports
               via Bidirectional Cross-Modal Attention},
  author    = {Raza, Ali},
  year      = {2025},
  note      = {Final Year Project, Department of Computer Science,
               National University of Computer and Emerging Sciences (FAST-NUCES)},
  url       = {https://github.com/alyrraza/RadGuard-Medical-AI}
}

This work builds on:

@article{yu2023evaluating,
  title   = {Evaluating progress in automatic chest X-ray radiology report generation},
  author  = {Yu, Feiyang and others},
  journal = {Patterns},
  volume  = {4},
  number  = {9},
  year    = {2023},
  doi     = {10.1016/j.patter.2023.100802}
}

@inproceedings{bannur2023learning,
  title     = {Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
  author    = {Bannur, Shruthi and others},
  booktitle = {CVPR},
  year      = {2023}
}

License

MIT License. Model weights are derived from MIMIC-CXR data β€” usage requires a valid PhysioNet credentialed account and agreement to the MIMIC-CXR data use agreement.


βš•οΈ Medical Disclaimer: This model is a research prototype and has not been validated for clinical use. Do not use for diagnostic decisions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results