---
license: mit
tags:
  - medical-imaging
  - chest-x-ray
  - temporal-analysis
  - interval-change
  - radiology
  - vision-language
language:
  - en
library_name: pytorch
pipeline_tag: image-feature-extraction
---

# TILA — Temporal Inversion-aware Learning and Alignment

**[Temporal Inversion for Learning Interval Change in Chest X-Rays](http://arxiv.org/abs/2604.04563)**
*Accepted at CVPR 2026*

TILA is a vision-language framework that uses *temporal inversion* — reversing image pairs — as a supervisory signal to enhance the sensitivity of temporal vision-language models to directional change in chest X-rays. Given a current and a prior radiograph, TILA can:

1. **Extract temporal-aware image embeddings** (128-dim) that capture both the static anatomy and the interval change between the two images.
2. **Encode radiology text** into the same 128-dim space for zero-shot classification via image-text similarity.
3. **Predict interval change** (binary: change vs. no change) using a lightweight classification head.

The image encoder is based on the [BioViL-T](https://huggingface.co/microsoft/BiomedVLP-BioViL-T) architecture (ResNet-50 + Vision Transformer temporal pooler), and the text encoder is CXR-BERT, both fine-tuned with temporal inversion-aware alignment.

## Quick Start

### Installation

```bash
pip install torch>=2.0 torchvision>=0.15 timm>=0.9 transformers>=4.30 safetensors>=0.4 pillow opencv-python numpy
```

### Load Model and Processor

```python
import torch
from transformers import AutoModel

# Load from HuggingFace Hub
model = AutoModel.from_pretrained("lukeingawesome/TILA", trust_remote_code=True)
model = model.to("cuda", dtype=torch.bfloat16)
```

Or load locally:

```python
from model import TILAModel
model = TILAModel.from_pretrained("model.safetensors")
model = model.to("cuda", dtype=torch.bfloat16)
```

```python
from processor import TILAProcessor

# Processor handles everything: raw image → model-ready tensor
processor = TILAProcessor(dtype=torch.bfloat16, device="cuda")
```

### Extract Embeddings

```python
current = processor("current_cxr.png")    # accepts file paths, numpy arrays, or PIL images
previous = processor("previous_cxr.png")

# 128-dim L2-normalized embeddings
embeddings = model.get_embeddings(current, previous)
```

The processor automatically applies medical image preprocessing (windowing, black padding removal, resize) followed by model transforms (center crop to 448x448, expand to 3 channels). If your images are already preprocessed, skip the medical preprocessing:

```python
processor = TILAProcessor(raw_preprocess=False, dtype=torch.bfloat16, device="cuda")
```

The embeddings encode both the current image state and the temporal difference from the prior.
They can be used for retrieval, similarity search, or as features for downstream tasks.

### Encode Text

```python
text_emb = model.encode_text([
    "Improved pulmonary edema.",
    "Stable pulmonary edema.",
    "Worsening pulmonary edema.",
])

# Zero-shot classification via image-text similarity
similarities = embeddings @ text_emb.T  # [1, 3]
prediction = similarities.argmax(dim=1)  # 0=improving, 1=stable, 2=worsening
```

### Predict Interval Change

```python
result = model.get_interval_change_prediction(current, previous, mode="bestf1")

print(result["probabilities"])  # Raw change probability
print(result["predictions"])    # Binary: 0 = no change, 1 = change
print(result["threshold"])      # Threshold used
```

Three threshold modes are available:

| Mode | Threshold | Description |
|------|-----------|-------------|
| `"bestf1"` | 0.29 | Maximizes F1 score (balanced sensitivity/specificity) |
| `"default"` | 0.50 | Standard sigmoid cutoff |
| `"spec95"` | 0.64 | Targets 95% specificity (conservative, fewer false positives) |

### CLI Example

```bash
python inference.py \
    --checkpoint model.safetensors \
    --current_image /path/to/current.png \
    --previous_image /path/to/previous.png
```


## Preprocessing Raw Images

> **Note:** This preprocessing is **not** applied automatically. Run it as a separate step before model inference.

If your chest X-rays are raw (e.g., DICOM-derived PNGs with varying bit depths, black borders, or 16-bit depth), preprocess them first:

```python
import cv2
from preprocess import preprocess_image

img = preprocess_image("raw_cxr.png")
cv2.imwrite("preprocessed.png", img)
```

The pipeline applies:
1. **Read as-is** — preserves original bit depth (supports 8-bit and 16-bit PNGs)
2. **Windowing** — clips to `mean +/- 2*std`, normalizes to [0, 1]
3. **Black padding removal** — contour-based crop
4. **Aspect-ratio-preserving resize** — longest side to 512px (configurable)

```bash
# CLI usage
python preprocess.py --input raw.png --output preprocessed.png
```

If your images are already preprocessed (contrast-normalized, cropped, resized grayscale PNGs), you can skip this step and feed them directly to the model.

## Input Format

- **Image format**: Grayscale chest X-ray (PNG, JPEG)
- **Model input**: Resize to 512px (shorter side), center crop to 448x448, repeat to 3 channels (handled by the transform in `inference.py`)
- **Pair**: Current (follow-up) image + Previous (baseline) image of the same patient
- **Dtype**: `torch.bfloat16` recommended on GPU, `torch.float32` on CPU

## Files

| File | Description |
|------|-------------|
| `model.safetensors` | Model weights (613 MB, image + text + classifier) |
| `config.json` | Model configuration (for AutoModel support) |
| `configuration_tila.py` | TILAConfig class |
| `model.py` | Self-contained model architecture |
| `processor.py` | Image processor (raw image → model-ready tensor) |
| `preprocess.py` | Medical image preprocessing utilities |
| `inference.py` | Example inference script |

## Citation

If you use this model, please cite:

```bibtex
@inproceedings{ko2026tila,
  title={Temporal Inversion for Learning Interval Change in Chest X-Rays},
  author={Ko, Hanbin and Jeon, Kyeongmin and Choi, Doowoong and Park, Chang Min},
  booktitle={CVPR},
  year={2026},
  url={http://arxiv.org/abs/2604.04563}
}
```

## Acknowledgements

This model builds upon [BioViL-T](https://huggingface.co/microsoft/BiomedVLP-BioViL-T) by Microsoft Research:

```bibtex
@inproceedings{bannur2023biovilt,
  title={Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
  author={Bannur, Shruthi and Hyland, Stephanie and Liu, Qianchu and Perez-Garcia, Fernando and Oktay, Ozan and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier},
  booktitle={CVPR},
  year={2023}
}
```

## License

This model is released under the [MIT License](LICENSE).