RadGuard V11 — AI Radiology Report Error Detector

RadGuard detects errors in AI-generated chest X-ray radiology reports by cross-referencing the report text against the actual X-ray image. Given an X-ray and an AI-generated report, it classifies each mentioned condition as SUPPORTED, HALLUCINATED, MISSING, or INACCURATE — and computes an overall ELRRs (Error-Labelled Radiology Report Score).

This is the final V11 model from the RadGuard FYP thesis project, trained on MIMIC-CXR with a BioViL-T image encoder and CXR-BERT text encoder coupled via bidirectional cross-attention.

Model Description

Property	Value
Task	Radiology report error detection (multimodal classification)
Image encoder	BioViL-T (Microsoft, MIMIC-CXR pretrained)
Text encoder	CXR-BERT / BiomedVLP-BioViL-T tokenizer
Fusion	Bidirectional cross-attention + MLP-Mixer
Output	14 conditions × 4 error classes + X-ray presence scores
Training data	MIMIC-CXR (74,060 samples)
Val F1	0.66
Parameters	~110 M (including frozen encoders)
Input image	448 × 448 RGB, ImageNet normalization
Max text length	128 tokens

Architecture

Chest X-Ray (448×448)              AI Report Sentence
        │                                   │
┌───────▼────────┐               ┌──────────▼──────────┐
│   BioViL-T     │               │      CXR-BERT        │
│ Image Encoder  │               │    Text Encoder      │
│ (MIMIC-CXR)    │               │    (MIMIC-CXR)       │
└───────┬────────┘               └──────────┬───────────┘
        │ [B, 512, 14, 14]                  │ [B, 768]
        │ 196 spatial patches               │ CLS token + token sequence
        └──────────────────┬────────────────┘
                           │
           ┌───────────────▼──────────────────┐
           │    Bidirectional Cross-Attention  │
           │    (14 condition-specific heads)  │
           │                                  │
           │  Dir 1: Text CLS → Image patches │  ← WHERE is it in the image?
           │  Dir 2: Image GAP → Text tokens  │  ← WHAT does the text say?
           │                                  │
           │  + Condition Type Embedding (×5) │
           └───────────────┬──────────────────┘
                           │
           ┌───────────────▼──────────────────┐
           │         MLP-Mixer Fusion         │
           │         (4 blocks, 512-dim)       │
           │                                  │
           │  + CheXbert Label Encoder        │
           │    (14 AI labels → 64-dim)       │
           └───────────────┬──────────────────┘
                           │
              ┌────────────▼────────────┐
              │   Shared MLP (256-dim)  │
              └────────────┬────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                                   │
┌────────▼──────────┐             ┌──────────▼────────┐
│  Task 1 Heads     │             │  Task 2 Heads     │
│  14 × Linear(256→4)│            │  14 × Linear(256→1)│
│  Error class/cond │             │  X-ray presence   │
└────────┬──────────┘             └──────────┬────────┘
         │                                   │
  SUPPORTED / HALLUCINATED          Present / Absent
  MISSING  / INACCURATE             (per condition)

Why BioViL-T + CXR-BERT? Both encoders are jointly pretrained on MIMIC-CXR — the same domain as this task. Their feature spaces are already aligned, making cross-attention semantically meaningful without requiring a contrastive alignment stage. Earlier versions using DenseNet (ImageNet) + ClinicalBERT had mismatched feature spaces which created a performance ceiling.

Why bidirectional cross-attention? Unidirectional attention (text → image only) finds where a condition appears but misses cases where the image is ambiguous and the text provides disambiguating context. The reverse direction (image → text) allows the model to attend to the specific words describing each condition, catching inaccurate descriptions even when the finding is visually present.

Error Classes

The model classifies each chest condition into one of four error types:

Label	Meaning	Clinical Risk
`SUPPORTED`	Report correctly describes what is visible on the X-ray	✅ Safe
`HALLUCINATED`	Report mentions a finding that is not visible on the X-ray	🔴 High — false positive diagnosis
`MISSING`	A finding is visible on the X-ray but the report omits it	🟠 High — missed diagnosis
`INACCURATE`	Finding is present but described incorrectly (wrong severity, location, etc.)	🟡 Moderate

14 Chest Conditions

Enlarged Cardiomediastinum  Cardiomegaly        Lung Opacity
Lung Lesion                 Edema               Consolidation
Pneumonia                   Atelectasis         Pneumothorax
Pleural Effusion            Pleural Other       Fracture
Support Devices             No Finding

Conditions are grouped into 5 anatomical/semantic types (encoded as type embeddings):

Cardiac (0): Enlarged Cardiomediastinum, Cardiomegaly
Parenchymal (1): Lung Opacity, Lesion, Edema, Consolidation, Pneumonia, Atelectasis
Pleural (2): Pneumothorax, Pleural Effusion, Pleural Other, Fracture
Device (3): Support Devices
Normal (4): No Finding

ELRRs Score

The model outputs an ELRRs (Error-Labelled Radiology Report Score) inspired by Yu et al. 2023 (RadCliQ):

ELRRs = (Σ weights) / N_active × 100

Weights: SUPPORTED=+1.0, INACCURATE=−0.3, MISSING=−0.5, HALLUCINATED=−0.7

Score	Grade	Description
≥ 80	Excellent	Clinically safe — minimal errors
≥ 60	Good	Minor errors — clinically acceptable
≥ 40	Fair	Moderate errors — review advised
≥ 20	Poor	Significant errors — high risk
< 20	Critical	Severe errors — unsafe for clinical use

Training Details

Parameter	Value
Dataset	MIMIC-CXR (PhysioNet, v2.0.0)
Train samples	~67,000
Val samples	~7,060
Total	74,060
Optimizer	AdamW
Scheduler	Cosine annealing with warmup
Image augmentation	RandomHorizontalFlip, RandomAffine, ColorJitter
Dropout	0.4
Batch size	16
Mixed precision	AMP (fp16)
Hardware	NVIDIA A100 (Vast.ai)

Training Evolution (V2 → V11)

Version	Val F1	Key Change
V2	0.31	Baseline: DenseNet + ClinicalBERT
V3	0.38	Added CheXbert labels
V4	0.41	Cross-attention introduced
V5	0.44	Pseudo-label generation
V6	0.48	Bidirectional cross-attention
V7	0.51	Type embeddings
V8	0.55	MLP-Mixer fusion
V9	0.58	Dataset expansion + cleaning
V10	0.61	BioViL-T + CXR-BERT encoders
V11	0.66	Hyperparameter tuning + augmentation

How to Use

Requirements

pip install torch torchvision transformers hi-ml-multimodal pillow

Load and Run Inference

import torch
from PIL import Image
from torchvision import transforms

# 1. Load the model weights
model_path = "best_model_v11.pth"  # downloaded from this repo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 2. The full inference pipeline is in RadGuard-AI-Engine
#    Clone: https://github.com/alyrraza/RadGuard-Medical-AI
#    Then:
from inference.model import get_model, get_tokenizer, run_inference_on_sentence
from inference.pipeline import run_full_pipeline

# 3. Run inference
image = Image.open("chest_xray.jpg").convert("RGB")
ai_report = "The heart is mildly enlarged. No pleural effusion is seen. Lungs are clear."

result = run_full_pipeline(image, ai_report)

print(f"ELRRs Score: {result['elrrs']['score']} — {result['elrrs']['grade']}")
for cond in result['conditions']:
    print(f"  {cond['name']}: {cond['verdict']} ({cond['confidence']:.0%})")

REST API (Docker)

# Pull and run the full stack
git clone https://github.com/alyrraza/RadGuard-Medical-AI
cd RadGuard-Medical-AI

# Set model path and start
MODEL_PATH=/path/to/best_model_v11.pth docker-compose up

# Call the API
curl -X POST http://localhost:8000/analyze \
  -F "file=@chest_xray.jpg" \
  -F "ai_report=The heart is mildly enlarged. Lungs are clear."

API Response Schema

{
  "task1_elrrs": {
    "score": 71.4,
    "grade": "Good",
    "supported_count": 5,
    "hallucinated_count": 1,
    "missing_count": 0,
    "inaccurate_count": 1
  },
  "task1_conditions": [
    {
      "name": "Cardiomegaly",
      "verdict": "SUPPORTED",
      "confidence": 0.87,
      "meaning": "AI report is correct — X-ray confirms it",
      "source_text": "The heart is mildly enlarged.",
      "xray_present": true
    }
  ],
  "task2_xray_findings": { "Cardiomegaly": { "xray_present": true, "confidence": 0.91 } },
  "task3_heatmaps": { "Cardiomegaly": "http://.../results/abc_Cardiomegaly.png" },
  "not_mentioned": ["Pneumothorax", "Fracture"],
  "sentences_analyzed": 3
}

Limitations

Trained exclusively on MIMIC-CXR (adult patients, US hospital system). Performance may degrade on pediatric, non-PA view, or non-US population X-rays.
Runs on individual sentences — inter-sentence context is not modeled.
CheXbert label extraction (used as auxiliary input) requires a separate model and adds latency. A keyword fallback is included but reduces accuracy.
Not validated for clinical deployment. This is a research/thesis prototype.

Citation

If you use this model in your research, please cite:

@misc{raza2025radguard,
  title     = {RadGuard: Detecting Errors in AI-Generated Radiology Reports
               via Bidirectional Cross-Modal Attention},
  author    = {Raza, Ali},
  year      = {2025},
  note      = {Final Year Project, Department of Computer Science,
               National University of Computer and Emerging Sciences (FAST-NUCES)},
  url       = {https://github.com/alyrraza/RadGuard-Medical-AI}
}

This work builds on:

@article{yu2023evaluating,
  title   = {Evaluating progress in automatic chest X-ray radiology report generation},
  author  = {Yu, Feiyang and others},
  journal = {Patterns},
  volume  = {4},
  number  = {9},
  year    = {2023},
  doi     = {10.1016/j.patter.2023.100802}
}

@inproceedings{bannur2023learning,
  title     = {Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
  author    = {Bannur, Shruthi and others},
  booktitle = {CVPR},
  year      = {2023}
}

License

MIT License. Model weights are derived from MIMIC-CXR data — usage requires a valid PhysioNet credentialed account and agreement to the MIMIC-CXR data use agreement.

⚕️ Medical Disclaimer: This model is a research prototype and has not been validated for clinical use. Do not use for diagnostic decisions.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Validation F1 on MIMIC-CXR
validation set self-reported

0.660
Validation F1 (weighted) on MIMIC-CXR
validation set self-reported

0.630