MedGemma-TI - Temporal Intelligence for Chest X-Ray Progression

MedGemma-TI is a QLoRA adapter for google/medgemma-4b-it that adds temporal progression assessment to a model that already speaks the language of medicine. Given two or more chronologically ordered chest X-rays, it compares PRIOR to CURRENT anatomy and outputs a structured report concluding with a single assessment: IMPROVED / STABLE / WORSENED / MIXED.

⚠️ Research prototype. Not validated for clinical use. See Limitations.


What It Does

Base MedGemma-4B-IT was not trained to reason across time. When we tested it on sequential chest X-ray comparison:

Metric Base MedGemma MedGemma-TI
Accuracy (17,802 samples) 23.7% 44.0% (+20.3pp)
Macro F1 0.140 0.343 (2.45×)
Worsening Recall 32.8% 54.8% (+22.0pp)
Missed Worsening 67.2% 45.2% - 1.49× safer
Temporal Coherence (flip test) 1.3% 26.4% (19.7×)

Temporal coherence: swapping PRIOR ↔ CURRENT images flips the model's assessment (IMPROVED ↔ WORSENED) as expected.


Loading the Model

This repository contains LoRA adapter weights only. You must load the base model first, then apply the adapter.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel

# 1. Load base model (4-bit quantization for <4 GB VRAM)
base_model = AutoModelForImageTextToText.from_pretrained(
    "google/medgemma-4b-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 2. Apply the LoRA adapter
model = PeftModel.from_pretrained(base_model, "sh3hryarkhan/MedGemma-TI")
model.eval()

# 3. Load processor from the base model
processor = AutoProcessor.from_pretrained("google/medgemma-4b-it")

Note: Use AutoModelForImageTextToText, NOT AutoModelForCausalLM. The latter lacks the vision encoder.


Inference - Structured Temporal Report

The model was trained with a specific prompt format. Use this format exactly to get structured temporal reports. Deviating from it (e.g. using lowercase headers or a different section order) will reduce output quality.

Prompt Structure

Images must be passed as actual image inputs. The <image_N> tags inside IMAGING TIMELINE are literal text markers the model associates with image positions - they are not image tokens.

PATIENT CONTEXT:
Age: {age} years | Sex: {sex}

IMAGING TIMELINE:
[IMAGE_1 | Date: {YYYY-MM-DD} | Role: Baseline]
<image_1>

[IMAGE_2 | Date: {YYYY-MM-DD} | Role: Current]
<image_2>

CLINICAL REQUEST:
{physician's question or "Compare these studies and assess interval change."}

TASK: Analyze the current study compared to the prior study and identify any interval changes. Conclude your analysis with a clear overall assessment using one of these terms: IMPROVED, STABLE, WORSENED, or MIXED.

Role labels follow this convention:

  • 2 images: Baseline, Current
  • 3+ images: Baseline, Intermediate, …, Current
  • 1 image: Current (falls back to single-image description; temporal comparison is not meaningful)

Optional sections (insert after IMAGING TIMELINE, before CLINICAL REQUEST):

CLINICAL ALERT:
{free-text alert, e.g. "Patient on anticoagulation therapy"}

PATIENT NOTES:
({date})
{note text}

Previous findings (insert after CLINICAL REQUEST, before TASK):

PREVIOUS FINDINGS:
{prior radiology read for the baseline image}

Python Example (Two Images)

from PIL import Image
import torch

prior_image = Image.open("prior.jpg").convert("RGB")
current_image = Image.open("current.jpg").convert("RGB")

prompt_text = """PATIENT CONTEXT:
Age: 58 years | Sex: Female

IMAGING TIMELINE:
[IMAGE_1 | Date: 2024-11-01 | Role: Baseline]
<image_1>

[IMAGE_2 | Date: 2025-01-15 | Role: Current]
<image_2>

CLINICAL REQUEST:
Compare these two chest X-rays and assess interval change.

TASK: Analyze the current study compared to the prior study and identify any interval changes. Conclude your analysis with a clear overall assessment using one of these terms: IMPROVED, STABLE, WORSENED, or MIXED."""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": prior_image},
            {"type": "image", "image": current_image},
            {"type": "text", "text": prompt_text},
        ],
    }
]

# Two-step processing (required for multi-image inputs)
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[prior_image, current_image],
    return_tensors="pt",
    padding=True,
).to(model.device)

with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        do_sample=True,
    )

# Decode only the newly generated tokens
input_length = inputs["input_ids"].shape[1]
generated = output_ids[0][input_length:]
report = processor.tokenizer.decode(generated, skip_special_tokens=True)
print(report)

Expected Output Format

The model produces a structured report in four sections:

PRIOR:
Moderate left pleural effusion. Mild cardiomegaly. Patchy bilateral opacities
consistent with pulmonary edema. No pneumothorax.

CURRENT:
Small residual left pleural effusion, markedly reduced from prior. Cardiac
silhouette improved. Bilateral opacities substantially decreased. Lung fields
otherwise clearer.

CHANGES:
- Left pleural effusion: markedly decreased
- Pulmonary edema: substantially improved
- Cardiomegaly: mild improvement
- No new findings

IMPRESSION: IMPROVED

The final line of the IMPRESSION section will contain exactly one of: IMPROVED, STABLE, WORSENED, or MIXED.


Training Details

Component Detail
Base model google/medgemma-4b-it
Method QLoRA - 4-bit quantization, r=16, α=16
Target modules All linear layers + lm_head, embed_tokens
Loss Response-only cross-entropy (prompt tokens masked to −100)
Epochs 4
Learning rate 2e-4 with 5% warmup
Gradient clip 0.3
Adapter size ~2.6 GB
Inference VRAM <4 GB (with 4-bit quantization)
Hardware Virginia Tech ARC cluster (multi-GPU, torchrun)

Training Data

Source Training Examples Population
CheXpert (Stanford) ~56,320 temporal pairs Effusion, pneumonia, cardiomegaly, nodules, atelectasis
RICORD-1C 963 (321 pairs × 3×) COVID-19 ICU, viral pneumonia, ARDS
Total 57,283 -
  • Patient-level train/val/test splits with audited zero cross-split leakage
  • Test set (17,802 samples): evaluated without oversampling or augmentation
  • IMPROVED class underrepresentation corrected by 1.61× upsampling in train split only

Evaluation

Test set: 17,802 samples (17,679 CheXpert + 123 RICORD), natural class distribution.

Per-Class Results (MedGemma-TI)

Class Precision Recall F1
IMPROVED 0.380 0.326 0.351
STABLE 0.328 0.653 0.436
WORSENED 0.484 0.548 0.514
MIXED 0.688 0.295 0.413
Macro avg - - 0.343

WORSENED recall (54.8%) is the clinically critical metric - catching deteriorating patients before they are missed.

Temporal Coherence - Flip Test

A novel evaluation measuring whether the model genuinely reasons about image order. Paired test cases swap PRIOR ↔ CURRENT images while updating all text metadata to match. A model reading only text labels would produce identical outputs (0% coherence). Only genuine visual comparison yields non-zero coherence.

Model Coherence Rate
Base MedGemma-4B-IT 1.3%
MedGemma-TI 26.4% (19.7×)

Limitations

  • Not validated externally. Results are from same-distribution test splits (CheXpert/RICORD). Performance on other hospital systems (MIMIC-CXR, UK Biobank, etc.) is unknown.
  • Chest X-rays only. CT, MRI, ultrasound, and other modalities are not supported.
  • Temporal coherence is a proof-of-concept. 26.4% is not a deployable threshold.
  • MIXED class is inherently ambiguous. Simultaneous regional improvement and worsening is harder to label consistently.
  • Not for clinical use. This is a research prototype and has not been validated for clinical decision-making.

Citation

@misc{khan2025medgemma_ti,
  title        = {MedGemma-TI: Teaching Temporal Reasoning to Medical Vision-Language Models},
  author       = {Muhammad Shehryar Khan and Abdullah Al Muhit},
  year         = {2025},
  institution  = {Virginia Tech},
  howpublished = {\url{https://huggingface.co/sh3hryarkhan/MedGemma-TI}},
}

Acknowledgments

Computational resources provided by Advanced Research Computing at Virginia Tech.

Base model: google/medgemma-4b-it. Training datasets: (Compressed)CheXpert (Stanford ML Group) and RICORD-1C (RSNA). Note: Original CheXpert can be accessed here: Original CheXpert

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sh3hryarkhan/MedGemma-TI

Adapter
(92)
this model