MedGemma-TI - Temporal Intelligence for Chest X-Ray Progression
MedGemma-TI is a QLoRA adapter for google/medgemma-4b-it that adds temporal progression assessment to a model that already speaks the language of medicine. Given two or more chronologically ordered chest X-rays, it compares PRIOR to CURRENT anatomy and outputs a structured report concluding with a single assessment: IMPROVED / STABLE / WORSENED / MIXED.
⚠️ Research prototype. Not validated for clinical use. See Limitations.
What It Does
Base MedGemma-4B-IT was not trained to reason across time. When we tested it on sequential chest X-ray comparison:
| Metric | Base MedGemma | MedGemma-TI |
|---|---|---|
| Accuracy (17,802 samples) | 23.7% | 44.0% (+20.3pp) |
| Macro F1 | 0.140 | 0.343 (2.45×) |
| Worsening Recall | 32.8% | 54.8% (+22.0pp) |
| Missed Worsening | 67.2% | 45.2% - 1.49× safer |
| Temporal Coherence (flip test) | 1.3% | 26.4% (19.7×) |
Temporal coherence: swapping PRIOR ↔ CURRENT images flips the model's assessment (IMPROVED ↔ WORSENED) as expected.
Loading the Model
This repository contains LoRA adapter weights only. You must load the base model first, then apply the adapter.
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
# 1. Load base model (4-bit quantization for <4 GB VRAM)
base_model = AutoModelForImageTextToText.from_pretrained(
"google/medgemma-4b-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# 2. Apply the LoRA adapter
model = PeftModel.from_pretrained(base_model, "sh3hryarkhan/MedGemma-TI")
model.eval()
# 3. Load processor from the base model
processor = AutoProcessor.from_pretrained("google/medgemma-4b-it")
Note: Use
AutoModelForImageTextToText, NOTAutoModelForCausalLM. The latter lacks the vision encoder.
Inference - Structured Temporal Report
The model was trained with a specific prompt format. Use this format exactly to get structured temporal reports. Deviating from it (e.g. using lowercase headers or a different section order) will reduce output quality.
Prompt Structure
Images must be passed as actual image inputs. The <image_N> tags inside IMAGING TIMELINE are literal text markers the model associates with image positions - they are not image tokens.
PATIENT CONTEXT:
Age: {age} years | Sex: {sex}
IMAGING TIMELINE:
[IMAGE_1 | Date: {YYYY-MM-DD} | Role: Baseline]
<image_1>
[IMAGE_2 | Date: {YYYY-MM-DD} | Role: Current]
<image_2>
CLINICAL REQUEST:
{physician's question or "Compare these studies and assess interval change."}
TASK: Analyze the current study compared to the prior study and identify any interval changes. Conclude your analysis with a clear overall assessment using one of these terms: IMPROVED, STABLE, WORSENED, or MIXED.
Role labels follow this convention:
- 2 images:
Baseline,Current - 3+ images:
Baseline,Intermediate, …,Current - 1 image:
Current(falls back to single-image description; temporal comparison is not meaningful)
Optional sections (insert after IMAGING TIMELINE, before CLINICAL REQUEST):
CLINICAL ALERT:
{free-text alert, e.g. "Patient on anticoagulation therapy"}
PATIENT NOTES:
({date})
{note text}
Previous findings (insert after CLINICAL REQUEST, before TASK):
PREVIOUS FINDINGS:
{prior radiology read for the baseline image}
Python Example (Two Images)
from PIL import Image
import torch
prior_image = Image.open("prior.jpg").convert("RGB")
current_image = Image.open("current.jpg").convert("RGB")
prompt_text = """PATIENT CONTEXT:
Age: 58 years | Sex: Female
IMAGING TIMELINE:
[IMAGE_1 | Date: 2024-11-01 | Role: Baseline]
<image_1>
[IMAGE_2 | Date: 2025-01-15 | Role: Current]
<image_2>
CLINICAL REQUEST:
Compare these two chest X-rays and assess interval change.
TASK: Analyze the current study compared to the prior study and identify any interval changes. Conclude your analysis with a clear overall assessment using one of these terms: IMPROVED, STABLE, WORSENED, or MIXED."""
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": prior_image},
{"type": "image", "image": current_image},
{"type": "text", "text": prompt_text},
],
}
]
# Two-step processing (required for multi-image inputs)
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
text=[text],
images=[prior_image, current_image],
return_tensors="pt",
padding=True,
).to(model.device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
do_sample=True,
)
# Decode only the newly generated tokens
input_length = inputs["input_ids"].shape[1]
generated = output_ids[0][input_length:]
report = processor.tokenizer.decode(generated, skip_special_tokens=True)
print(report)
Expected Output Format
The model produces a structured report in four sections:
PRIOR:
Moderate left pleural effusion. Mild cardiomegaly. Patchy bilateral opacities
consistent with pulmonary edema. No pneumothorax.
CURRENT:
Small residual left pleural effusion, markedly reduced from prior. Cardiac
silhouette improved. Bilateral opacities substantially decreased. Lung fields
otherwise clearer.
CHANGES:
- Left pleural effusion: markedly decreased
- Pulmonary edema: substantially improved
- Cardiomegaly: mild improvement
- No new findings
IMPRESSION: IMPROVED
The final line of the IMPRESSION section will contain exactly one of: IMPROVED, STABLE, WORSENED, or MIXED.
Training Details
| Component | Detail |
|---|---|
| Base model | google/medgemma-4b-it |
| Method | QLoRA - 4-bit quantization, r=16, α=16 |
| Target modules | All linear layers + lm_head, embed_tokens |
| Loss | Response-only cross-entropy (prompt tokens masked to −100) |
| Epochs | 4 |
| Learning rate | 2e-4 with 5% warmup |
| Gradient clip | 0.3 |
| Adapter size | ~2.6 GB |
| Inference VRAM | <4 GB (with 4-bit quantization) |
| Hardware | Virginia Tech ARC cluster (multi-GPU, torchrun) |
Training Data
| Source | Training Examples | Population |
|---|---|---|
| CheXpert (Stanford) | ~56,320 temporal pairs | Effusion, pneumonia, cardiomegaly, nodules, atelectasis |
| RICORD-1C | 963 (321 pairs × 3×) | COVID-19 ICU, viral pneumonia, ARDS |
| Total | 57,283 | - |
- Patient-level train/val/test splits with audited zero cross-split leakage
- Test set (17,802 samples): evaluated without oversampling or augmentation
- IMPROVED class underrepresentation corrected by 1.61× upsampling in train split only
Evaluation
Test set: 17,802 samples (17,679 CheXpert + 123 RICORD), natural class distribution.
Per-Class Results (MedGemma-TI)
| Class | Precision | Recall | F1 |
|---|---|---|---|
| IMPROVED | 0.380 | 0.326 | 0.351 |
| STABLE | 0.328 | 0.653 | 0.436 |
| WORSENED | 0.484 | 0.548 | 0.514 |
| MIXED | 0.688 | 0.295 | 0.413 |
| Macro avg | - | - | 0.343 |
WORSENED recall (54.8%) is the clinically critical metric - catching deteriorating patients before they are missed.
Temporal Coherence - Flip Test
A novel evaluation measuring whether the model genuinely reasons about image order. Paired test cases swap PRIOR ↔ CURRENT images while updating all text metadata to match. A model reading only text labels would produce identical outputs (0% coherence). Only genuine visual comparison yields non-zero coherence.
| Model | Coherence Rate |
|---|---|
| Base MedGemma-4B-IT | 1.3% |
| MedGemma-TI | 26.4% (19.7×) |
Limitations
- Not validated externally. Results are from same-distribution test splits (CheXpert/RICORD). Performance on other hospital systems (MIMIC-CXR, UK Biobank, etc.) is unknown.
- Chest X-rays only. CT, MRI, ultrasound, and other modalities are not supported.
- Temporal coherence is a proof-of-concept. 26.4% is not a deployable threshold.
- MIXED class is inherently ambiguous. Simultaneous regional improvement and worsening is harder to label consistently.
- Not for clinical use. This is a research prototype and has not been validated for clinical decision-making.
Citation
@misc{khan2025medgemma_ti,
title = {MedGemma-TI: Teaching Temporal Reasoning to Medical Vision-Language Models},
author = {Muhammad Shehryar Khan and Abdullah Al Muhit},
year = {2025},
institution = {Virginia Tech},
howpublished = {\url{https://huggingface.co/sh3hryarkhan/MedGemma-TI}},
}
Acknowledgments
Computational resources provided by Advanced Research Computing at Virginia Tech.
Base model: google/medgemma-4b-it. Training datasets: (Compressed)CheXpert (Stanford ML Group) and RICORD-1C (RSNA). Note: Original CheXpert can be accessed here: Original CheXpert
- Downloads last month
- 2