π©Ί ExplainableVLM-Rad
Interpretable Vision--Language Generative Framework for Radiology Report Generation
π¬ Abstract
ExplainableVLM-Rad is a unified Vision--Language generative framework for automated radiology report generation with interpretable visual grounding. The model integrates a transformer-based vision encoder and biomedical language decoder under cross-modal alignment supervision to produce clinically coherent reports while enabling traceable reasoning between image regions and generated findings.
ποΈ Model Architecture
Encoder: ViT-Base (Patch16-224, ImageNet21k pretrained)
Decoder: BioGPT (Biomedical Language Model)
Framework: Hugging Face Transformers VisionEncoderDecoderModel
Core Components
- Patch Embedding + Positional Encoding\
- Multi-Head Self Attention\
- Cross-Modal Attention\
- Clinical Knowledge Conditioning\
- Token--Patch Relevance Mapping\
- Attention & Gradient-based Explainability
π Training Details
- Dataset: MIMIC-CXR (cleaned subset)
- Optimizer: AdamW
- Scheduler: Linear Warmup
- Mixed Precision Training (AMP)
- Batch Size: 4
- Device: CUDA (T4 GPU)
- Max Steps: 500 (prototype training)
π Objective Function
Total Loss:
L_total = L_ce + Ξ»1 L_align + Ξ»2 L_explain
Where: - L_ce: Cross-Entropy Loss (Report Generation) - L_align: Cross-Modal Alignment Contrastive Loss - L_explain: Explanation Consistency Loss
π Usage
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch
model = VisionEncoderDecoderModel.from_pretrained("Vikhram-S/mimic-vit-biogpt")
processor = ViTImageProcessor.from_pretrained("Vikhram-S/mimic-vit-biogpt")
tokenizer = AutoTokenizer.from_pretrained("Vikhram-S/mimic-vit-biogpt")
image = Image.open("sample_xray.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
output_ids = model.generate(pixel_values, max_length=128)
report = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(report)
π§ Explainability Features
- Attention Heatmaps\
- Gradient-based Saliency\
- Token--Region Relevance Mapping\
- Confidence Score Estimation
These components improve interpretability and enhance trust in AI-assisted clinical decision support systems.
π― Intended Use
This model is intended for: - Academic research - AI for healthcare experimentation - Educational demonstrations of explainable AI
β οΈ Limitations
- Trained on limited subset of MIMIC-CXR
- Single-view radiographs only
- Not clinically validated
- Not approved for medical deployment
π Ethical Considerations
- May reflect dataset bias
- Should not replace professional radiological diagnosis
- Research use only
π¨βπ» Author
Vikhram S
Final Year B.E. Electronics and Communication Engineering
AI for Healthcare | Vision-Language Models | Explainable AI
π Citation
If you use this model, please cite:
Vikhram S. (2026).
ExplainableVLM-Rad: Interpretable Vision-Language Framework for
Radiology Report Generation.
- Downloads last month
- 46