🩺 ExplainableVLM-Rad

Interpretable Vision--Language Generative Framework for Radiology Report Generation


πŸ”¬ Abstract

ExplainableVLM-Rad is a unified Vision--Language generative framework for automated radiology report generation with interpretable visual grounding. The model integrates a transformer-based vision encoder and biomedical language decoder under cross-modal alignment supervision to produce clinically coherent reports while enabling traceable reasoning between image regions and generated findings.


πŸ—οΈ Model Architecture

Encoder: ViT-Base (Patch16-224, ImageNet21k pretrained)
Decoder: BioGPT (Biomedical Language Model)
Framework: Hugging Face Transformers VisionEncoderDecoderModel

Core Components

  • Patch Embedding + Positional Encoding\
  • Multi-Head Self Attention\
  • Cross-Modal Attention\
  • Clinical Knowledge Conditioning\
  • Token--Patch Relevance Mapping\
  • Attention & Gradient-based Explainability

πŸ“Š Training Details

  • Dataset: MIMIC-CXR (cleaned subset)
  • Optimizer: AdamW
  • Scheduler: Linear Warmup
  • Mixed Precision Training (AMP)
  • Batch Size: 4
  • Device: CUDA (T4 GPU)
  • Max Steps: 500 (prototype training)

πŸ“ˆ Objective Function

Total Loss:

L_total = L_ce + Ξ»1 L_align + Ξ»2 L_explain

Where: - L_ce: Cross-Entropy Loss (Report Generation) - L_align: Cross-Modal Alignment Contrastive Loss - L_explain: Explanation Consistency Loss


πŸš€ Usage

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch

model = VisionEncoderDecoderModel.from_pretrained("Vikhram-S/mimic-vit-biogpt")
processor = ViTImageProcessor.from_pretrained("Vikhram-S/mimic-vit-biogpt")
tokenizer = AutoTokenizer.from_pretrained("Vikhram-S/mimic-vit-biogpt")

image = Image.open("sample_xray.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values

output_ids = model.generate(pixel_values, max_length=128)
report = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(report)

🧠 Explainability Features

  • Attention Heatmaps\
  • Gradient-based Saliency\
  • Token--Region Relevance Mapping\
  • Confidence Score Estimation

These components improve interpretability and enhance trust in AI-assisted clinical decision support systems.


🎯 Intended Use

This model is intended for: - Academic research - AI for healthcare experimentation - Educational demonstrations of explainable AI


⚠️ Limitations

  • Trained on limited subset of MIMIC-CXR
  • Single-view radiographs only
  • Not clinically validated
  • Not approved for medical deployment

πŸ“Œ Ethical Considerations

  • May reflect dataset bias
  • Should not replace professional radiological diagnosis
  • Research use only

πŸ‘¨β€πŸ’» Author

Vikhram S
Final Year B.E. Electronics and Communication Engineering
AI for Healthcare | Vision-Language Models | Explainable AI


πŸ“œ Citation

If you use this model, please cite:

Vikhram S. (2026).
ExplainableVLM-Rad: Interpretable Vision-Language Framework for Radiology Report Generation.

Downloads last month
46
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Vikhram-S/mimic-vit-biogpt