Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +213 -0
inference.py +60 -0
model.py +547 -0
model.safetensors +3 -0
preprocess.py +111 -0
processor.py +117 -0

README.md ADDED Viewed

	@@ -0,0 +1,213 @@

+---
+license: mit
+tags:
+  - medical-imaging
+  - chest-x-ray
+  - temporal-analysis
+  - interval-change
+  - radiology
+language:
+  - en
+library_name: pytorch
+pipeline_tag: image-feature-extraction
+---
+# TILA — Temporal Image-Language Alignment for Chest X-rays
+TILA is a vision-language model for analyzing **temporal changes** between pairs of chest X-rays. Given a current and a prior radiograph, TILA can:
+1. **Extract temporal-aware image embeddings** (128-dim) that capture both the static anatomy and the interval change between the two images.
+2. **Encode radiology text** into the same 128-dim space for zero-shot classification via image-text similarity.
+3. **Predict interval change** (binary: change vs. no change) using a lightweight classification head.
+The image encoder is based on the [BioViL-T](https://huggingface.co/microsoft/BiomedVLP-BioViL-T) architecture (ResNet-50 + Vision Transformer temporal pooler), and the text encoder is CXR-BERT, both fine-tuned with temporal image-language alignment.
+## Quick Start
+### Installation
+```bash
+pip install torch>=2.0 torchvision>=0.15 timm>=0.9 transformers>=4.30 safetensors>=0.4 pillow opencv-python numpy
+```
+### Load Model and Processor
+```python
+import torch
+from model import TILAModel
+from processor import TILAProcessor
+model = TILAModel.from_pretrained("model.safetensors")
+model = model.to("cuda", dtype=torch.bfloat16)
+# Processor handles everything: raw image → model-ready tensor
+processor = TILAProcessor(dtype=torch.bfloat16, device="cuda")
+```
+### Extract Embeddings
+```python
+current = processor("current_cxr.png")    # accepts file paths, numpy arrays, or PIL images
+previous = processor("previous_cxr.png")
+# 128-dim L2-normalized embeddings
+embeddings = model.get_embeddings(current, previous)
+```
+The processor automatically applies medical image preprocessing (windowing, black padding removal, resize) followed by model transforms (center crop to 448x448, expand to 3 channels). If your images are already preprocessed, skip the medical preprocessing:
+```python
+processor = TILAProcessor(raw_preprocess=False, dtype=torch.bfloat16, device="cuda")
+```
+The embeddings encode both the current image state and the temporal difference from the prior.
+They can be used for retrieval, similarity search, or as features for downstream tasks.
+### Encode Text
+```python
+text_emb = model.encode_text([
+    "Improved pulmonary edema.",
+    "Stable pulmonary edema.",
+    "Worsening pulmonary edema.",
+])
+# Zero-shot classification via image-text similarity
+similarities = embeddings @ text_emb.T  # [1, 3]
+prediction = similarities.argmax(dim=1)  # 0=improving, 1=stable, 2=worsening
+```
+### Predict Interval Change
+```python
+result = model.get_interval_change_prediction(current, previous, mode="bestf1")
+print(result["probabilities"])  # Raw change probability
+print(result["predictions"])    # Binary: 0 = no change, 1 = change
+print(result["threshold"])      # Threshold used
+```
+Three threshold modes are available:
+| Mode | Threshold | Description |
+|------|-----------|-------------|
+| `"bestf1"` | 0.29 | Maximizes F1 score (balanced sensitivity/specificity) |
+| `"default"` | 0.50 | Standard sigmoid cutoff |
+| `"spec95"` | 0.64 | Targets 95% specificity (conservative, fewer false positives) |
+### CLI Example
+```bash
+python inference.py \
+    --checkpoint model.safetensors \
+    --current_image /path/to/current.png \
+    --previous_image /path/to/previous.png
+```
+## Model Architecture
+```
+IMAGE ENCODER:
+  Input: current CXR [B, 3, 448, 448] + previous CXR [B, 3, 448, 448]
+    |
+    +-- ResNet-50 backbone (shared weights, processes both images)
+    |     -> patch features [B, 2048, 14, 14]
+    |
+    +-- 1x1 Conv projection (2048 -> 256)
+    |
+    +-- Vision Transformer Pooler (3 blocks, 8 heads)
+    |     -> temporal difference features [B, 256, 14, 14]
+    |
+    +-- Concatenate [static, temporal] -> [B, 512, 14, 14]
+    |
+    +-- MLP Projector (512 -> 128)
+          -> image embedding [B, 128]            <-- get_embeddings()
+TEXT ENCODER:
+  Input: tokenized text
+    |
+    +-- CXR-BERT (12 layers, 768-dim)
+    |     -> CLS token [B, 768]
+    |
+    +-- LayerNorm + Linear (768 -> 128)
+          -> text embedding [B, 128]             <-- encode_text()
+CLASSIFIER:
+  image embedding [B, 128]
+    |
+    +-- Linear (128 -> 64) -> ReLU -> Linear (64 -> 1)
+          -> change probability [B]              <-- get_interval_change_prediction()
+```
+## Preprocessing Raw Images
+> **Note:** This preprocessing is **not** applied automatically. Run it as a separate step before model inference.
+If your chest X-rays are raw (e.g., DICOM-derived PNGs with varying bit depths, black borders, or 16-bit depth), preprocess them first:
+```python
+import cv2
+from preprocess import preprocess_image
+img = preprocess_image("raw_cxr.png")
+cv2.imwrite("preprocessed.png", img)
+```
+The pipeline applies:
+1. **Read as-is** — preserves original bit depth (supports 8-bit and 16-bit PNGs)
+2. **Windowing** — clips to `mean +/- 2*std`, normalizes to [0, 1]
+3. **Black padding removal** — contour-based crop
+4. **Aspect-ratio-preserving resize** — longest side to 512px (configurable)
+```bash
+# CLI usage
+python preprocess.py --input raw.png --output preprocessed.png
+```
+If your images are already preprocessed (contrast-normalized, cropped, resized grayscale PNGs), you can skip this step and feed them directly to the model.
+## Input Format
+- **Image format**: Grayscale chest X-ray (PNG, JPEG)
+- **Model input**: Resize to 512px (shorter side), center crop to 448x448, repeat to 3 channels (handled by the transform in `inference.py`)
+- **Pair**: Current (follow-up) image + Previous (baseline) image of the same patient
+- **Dtype**: `torch.bfloat16` recommended on GPU, `torch.float32` on CPU
+## Files
+| File | Description |
+|------|-------------|
+| `model.safetensors` | Model weights (613 MB, image + text + classifier) |
+| `model.py` | Self-contained model architecture |
+| `processor.py` | Image processor (raw image → model-ready tensor) |
+| `preprocess.py` | Medical image preprocessing utilities |
+| `inference.py` | Example inference script |
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{tila2026,
+  title={TILA: Temporal Image-Language Alignment for Chest X-rays},
+  year={2026}
+}
+```
+## Acknowledgements
+This model builds upon [BioViL-T](https://huggingface.co/microsoft/BiomedVLP-BioViL-T) by Microsoft Research:
+```bibtex
+@inproceedings{bannur2023biovilt,
+  title={Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing},
+  author={Bannur, Shruthi and Hyland, Stephanie and Liu, Qianchu and Perez-Garcia, Fernando and Oktay, Ozan and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier},
+  booktitle={CVPR},
+  year={2023}
+}
+```
+## License
+This model is released under the [MIT License](LICENSE).
+The BioViL-T architecture and CXR-BERT text encoder are by Microsoft Research, also released under MIT.

inference.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""
+TILA — Inference Example
+Usage:
+    # From raw images (full preprocessing applied automatically):
+    python inference.py --current_image raw_current.png --previous_image raw_previous.png
+    # From already-preprocessed images:
+    python inference.py --current_image prep_current.png --previous_image prep_previous.png --no-preprocess
+"""
+import argparse
+import torch
+from model import TILAModel
+from processor import TILAProcessor
+def main():
+    parser = argparse.ArgumentParser(description="TILA Inference")
+    parser.add_argument("--checkpoint", type=str, default="model.safetensors")
+    parser.add_argument("--current_image", type=str, required=True)
+    parser.add_argument("--previous_image", type=str, required=True)
+    parser.add_argument("--no-preprocess", action="store_true",
+                        help="Skip medical preprocessing (use if images are already preprocessed)")
+    parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu")
+    args = parser.parse_args()
+    device = args.device
+    dtype = torch.bfloat16 if "cuda" in device else torch.float32
+    # Load model
+    model = TILAModel.from_pretrained(args.checkpoint, device=device)
+    model = model.to(dtype=dtype)
+    # Load and process images (preprocessing is built into the processor)
+    processor = TILAProcessor(
+        raw_preprocess=not args.no_preprocess,
+        dtype=dtype,
+        device=device,
+    )
+    current = processor(args.current_image)
+    previous = processor(args.previous_image)
+    # 1. Get embeddings (128-dim, L2-normalized)
+    embeddings = model.get_embeddings(current, previous)
+    print(f"Embedding shape: {embeddings.shape}")
+    print(f"Embedding (first 8 dims): {embeddings[0, :8].float().tolist()}")
+    # 2. Get interval change prediction (3 modes available)
+    for mode in ["default", "bestf1", "spec95"]:
+        result = model.get_interval_change_prediction(current, previous, mode=mode)
+        prob = result["probabilities"].item()
+        pred = result["predictions"].item()
+        thresh = result["threshold"]
+        label = "CHANGE" if pred == 1 else "NO CHANGE"
+        print(f"[{mode}] threshold={thresh:.4f}, prob={prob:.4f} -> {label}")
+if __name__ == "__main__":
+    main()

model.py ADDED Viewed

	@@ -0,0 +1,547 @@

+"""
+TILA (Temporal Image-Language Alignment) — Model Architecture
+This module contains the full model architecture for the TILA image encoder,
+built on top of the BioViL-T (ResNet-50 + Vision Transformer pooler) backbone.
+Dependencies:
+    pip install torch torchvision timm
+"""
+from __future__ import annotations
+import math
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, Callable, Optional, Sequence, Set, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from timm.layers import DropPath, Mlp, trunc_normal_
+from torchvision.models.resnet import Bottleneck, conv1x1
+# ──────────────────────────────────────────────────────────────────────────────
+# Output types
+# ──────────────────────────────────────────────────────────────────────────────
+@dataclass
+class ImageModelOutput:
+    img_embedding: torch.Tensor
+    patch_embeddings: torch.Tensor
+    projected_global_embedding: torch.Tensor
+    class_logits: Optional[torch.Tensor]
+    projected_patch_embeddings: torch.Tensor
+# ──────────────────────────────────────────────────────────────────────────────
+# ResNet-50 backbone
+# ──────────────────────────────────────────────────────────────────────────────
+class ResNet(nn.Module):
+    """Standard ResNet-50 (torchvision-compatible) without the final FC layer in forward."""
+    def __init__(
+        self,
+        layers: Sequence[int] = (3, 4, 6, 3),
+        num_classes: int = 1000,
+        zero_init_residual: bool = False,
+        replace_stride_with_dilation: Optional[Sequence[bool]] = None,
+    ):
+        super().__init__()
+        block = Bottleneck
+        self.inplanes = 64
+        self.dilation = 1
+        if replace_stride_with_dilation is None:
+            replace_stride_with_dilation = [False, False, False]
+        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
+        self.bn1 = nn.BatchNorm2d(64)
+        self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.layer2 = self._make_layer(block, 128, layers[1], stride=2, dilate=replace_stride_with_dilation[0])
+        self.layer3 = self._make_layer(block, 256, layers[2], stride=2, dilate=replace_stride_with_dilation[1])
+        self.layer4 = self._make_layer(block, 512, layers[3], stride=2, dilate=replace_stride_with_dilation[2])
+        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(512 * block.expansion, num_classes)
+        # Weight init
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
+            elif isinstance(m, nn.BatchNorm2d):
+                nn.init.constant_(m.weight, 1)
+                nn.init.constant_(m.bias, 0)
+        if zero_init_residual:
+            for m in self.modules():
+                if isinstance(m, Bottleneck) and m.bn3.weight is not None:
+                    nn.init.constant_(m.bn3.weight, 0)
+    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
+        downsample = None
+        previous_dilation = self.dilation
+        if dilate:
+            self.dilation *= stride
+            stride = 1
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            downsample = nn.Sequential(
+                conv1x1(self.inplanes, planes * block.expansion, stride),
+                nn.BatchNorm2d(planes * block.expansion),
+            )
+        layers = [block(self.inplanes, planes, stride, downsample, 1, 64, previous_dilation, nn.BatchNorm2d)]
+        self.inplanes = planes * block.expansion
+        for _ in range(1, blocks):
+            layers.append(block(self.inplanes, planes, dilation=self.dilation, norm_layer=nn.BatchNorm2d))
+        return nn.Sequential(*layers)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.maxpool(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        return x  # patch features [B, 2048, H, W]
+# ──────────────────────────────────────────��───────────────────────────────────
+# Vision Transformer Pooler (temporal attention)
+# ──────────────────────────────────────────────────────────────────────────────
+class SinePositionEmbedding:
+    def __init__(self, embedding_dim: int = 64, temperature: int = 10000,
+                 normalize: bool = False, scale: Optional[float] = None):
+        self.embedding_dim = embedding_dim
+        self.temperature = temperature
+        self.normalize = normalize
+        self.scale = scale if scale is not None else 2 * math.pi
+    def __call__(self, mask: torch.Tensor) -> torch.Tensor:
+        B, H, W = mask.shape
+        y_embed = mask.cumsum(1, dtype=torch.float32)
+        x_embed = mask.cumsum(2, dtype=torch.float32)
+        if self.normalize:
+            y_embed = y_embed / (y_embed[:, -1:, :] + 1e-6) * self.scale
+            x_embed = x_embed / (x_embed[:, :, -1:] + 1e-6) * self.scale
+        dim_t = torch.arange(self.embedding_dim, dtype=torch.float32)
+        dim_t = self.temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / self.embedding_dim)
+        pos_x = x_embed[:, :, :, None] / dim_t
+        pos_y = y_embed[:, :, :, None] / dim_t
+        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
+        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
+        return torch.cat((pos_y, pos_x), dim=3).view(B, H * W, self.embedding_dim * 2)
+class MultiHeadAttentionLayer(nn.Module):
+    def __init__(self, dim: int, num_heads: int = 8, qkv_bias: bool = False,
+                 attn_drop: float = 0.0, proj_drop: float = 0.0):
+        super().__init__()
+        self.num_heads = num_heads
+        self.scale = (dim // num_heads) ** -0.5
+        self.proj_q = nn.Linear(dim, dim, bias=qkv_bias)
+        self.proj_k = nn.Linear(dim, dim, bias=qkv_bias)
+        self.proj_v = nn.Linear(dim, dim, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, k, q, v):
+        B, N, C = v.shape
+        h = self.num_heads
+        wq = self.proj_q(q).reshape(B, N, h, C // h).permute(0, 2, 1, 3)
+        wk = self.proj_k(k).reshape(B, N, h, C // h).permute(0, 2, 1, 3)
+        wv = self.proj_v(v).reshape(B, N, h, C // h).permute(0, 2, 1, 3)
+        attn = (wq @ wk.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        o = (attn @ wv).transpose(1, 2).reshape(B, N, C)
+        return self.proj_drop(self.proj(o))
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=1.0, qkv_bias=False,
+                 drop=0.0, attn_drop=0.0, drop_path=0.0,
+                 act_layer=nn.GELU, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = MultiHeadAttentionLayer(dim, num_heads, qkv_bias, attn_drop, drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        self.mlp = Mlp(in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer, drop=drop)
+    def forward(self, x, pos_and_type_embed=None):
+        x_norm = self.norm1(x)
+        if pos_and_type_embed is not None:
+            x_norm = x_norm + pos_and_type_embed
+        x = x + self.drop_path(self.attn(x_norm, x_norm, x_norm))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class VisionTransformerPooler(nn.Module):
+    def __init__(self, input_dim: int, grid_shape: Tuple[int, int],
+                 num_heads: int = 8, num_blocks: int = 3,
+                 norm_layer=partial(nn.LayerNorm, eps=1e-6)):
+        super().__init__()
+        block_kwargs = dict(dim=input_dim, num_heads=num_heads, mlp_ratio=1.0,
+                            drop=0.10, attn_drop=0.10, drop_path=0.25,
+                            act_layer=nn.GELU, norm_layer=norm_layer)
+        self.blocks = nn.ModuleList([Block(**block_kwargs) for _ in range(num_blocks)])
+        self.norm_post = norm_layer(input_dim)
+        self.grid_shape = grid_shape
+        self.num_patches = grid_shape[0] * grid_shape[1]
+        self.type_embed = nn.Parameter(torch.zeros(2, 1, input_dim))
+        trunc_normal_(self.type_embed, std=0.02)
+        self.pos_drop = nn.Dropout(p=0.10)
+        pos_embed = SinePositionEmbedding(input_dim // 2, normalize=True)(
+            torch.ones([1, grid_shape[0], grid_shape[1]]))
+        self.register_buffer("pos_embed", pos_embed, persistent=False)
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=0.02)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def forward(self, current_image, previous_image=None):
+        B, C, H, W = current_image.shape
+        if previous_image is not None:
+            prev = previous_image.view(B, C, H * W).transpose(1, 2)
+        else:
+            prev = None
+        cur = current_image.view(B, C, H * W).transpose(1, 2)
+        pos = self.pos_embed.repeat(B, 1, 1)
+        L = cur.shape[1]
+        type_emb = self.type_embed[0].expand(B, L, -1)
+        if prev is not None:
+            x = torch.cat((cur, prev), dim=1)
+            pos = torch.cat((pos, pos), dim=1)
+            type_emb = torch.cat((type_emb, self.type_embed[1].expand(B, L, -1)), dim=1)
+        else:
+            x = cur
+        pos_type = pos + type_emb
+        x = self.pos_drop(x)
+        for blk in self.blocks:
+            x = blk(x, pos_type)
+        x = self.norm_post(x)
+        return x[:, :self.num_patches].transpose(1, 2).view(B, C, H, W)
+# ──────────────────────────────────────────────────────────────────────────────
+# Multi-image encoder (temporal)
+# ──────────────────────────────────────────────────────────────────────────────
+class MLP(nn.Module):
+    """Projection MLP (1x1 conv based)."""
+    def __init__(self, input_dim, output_dim, hidden_dim=None, use_1x1_convs=False):
+        super().__init__()
+        if use_1x1_convs and hidden_dim is not None:
+            self.model = nn.Sequential(
+                nn.Conv2d(input_dim, hidden_dim, 1, bias=False),
+                nn.BatchNorm2d(hidden_dim),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(hidden_dim, output_dim, 1, bias=True),
+            )
+        elif hidden_dim is not None:
+            self.model = nn.Sequential(
+                nn.Linear(input_dim, hidden_dim, bias=False),
+                nn.BatchNorm1d(hidden_dim),
+                nn.ReLU(inplace=True),
+                nn.Linear(hidden_dim, output_dim, bias=True),
+            )
+        else:
+            self.model = nn.Linear(input_dim, output_dim)
+    def forward(self, x):
+        return self.model(x)
+class MultiImageEncoder(nn.Module):
+    """BioViL-T style multi-image encoder: ResNet-50 backbone + ViT temporal pooler."""
+    def __init__(self):
+        super().__init__()
+        self.encoder = ResNet()
+        backbone_out_dim = 2048  # ResNet-50 output channels
+        output_dim = 256
+        self.backbone_to_vit = nn.Conv2d(backbone_out_dim, output_dim, 1, bias=False)
+        self.vit_pooler = VisionTransformerPooler(input_dim=output_dim, grid_shape=(14, 14))
+        self.missing_previous_emb = nn.Parameter(torch.zeros(1, output_dim, 1, 1))
+        trunc_normal_(self.missing_previous_emb, std=0.02)
+    def forward(self, current_image, previous_image=None, return_patch_embeddings=False):
+        B = current_image.shape[0]
+        if previous_image is not None:
+            x = torch.cat([current_image, previous_image], dim=0)
+            x = self.encoder(x)
+            x = self.backbone_to_vit(x)
+            patch_x, patch_prev = x[:B], x[B:]
+            diff_x = self.vit_pooler(current_image=patch_x, previous_image=patch_prev)
+        else:
+            x = self.encoder(current_image)
+            patch_x = self.backbone_to_vit(x)
+            _, _, W, H = patch_x.shape
+            diff_x = self.missing_previous_emb.repeat(B, 1, W, H)
+        patch_fused = torch.cat([patch_x, diff_x], dim=1)  # [B, 512, H, W]
+        avg_pooled = torch.flatten(F.adaptive_avg_pool2d(patch_fused, (1, 1)), 1)
+        if return_patch_embeddings:
+            return patch_fused, avg_pooled
+        return avg_pooled
+class TILAImageEncoder(nn.Module):
+    """Full TILA image encoder: MultiImageEncoder + projection head.
+    Outputs 128-dim normalized embeddings suitable for CLIP-style retrieval.
+    """
+    JOINT_FEATURE_SIZE = 128
+    def __init__(self):
+        super().__init__()
+        self.encoder = MultiImageEncoder()
+        self.projector = MLP(
+            input_dim=512,  # patch_x (256) + diff_x (256)
+            output_dim=self.JOINT_FEATURE_SIZE,
+            hidden_dim=self.JOINT_FEATURE_SIZE,
+            use_1x1_convs=True,
+        )
+    def forward(self, current_image, previous_image=None):
+        patch_fused, pooled = self.encoder(current_image, previous_image, return_patch_embeddings=True)
+        projected_patch = self.projector(patch_fused)
+        projected_global = torch.mean(projected_patch, dim=(2, 3))
+        return ImageModelOutput(
+            img_embedding=pooled,
+            patch_embeddings=patch_fused,
+            class_logits=None,
+            projected_patch_embeddings=projected_patch,
+            projected_global_embedding=projected_global,
+        )
+# ──────────────────────────────────────────────────────────────────────────────
+# Text encoder (BioViL-T CXR-BERT + projection)
+# ──────────────────────────────────────────────────────────────────────────────
+TEXT_MODEL_NAME = "microsoft/BiomedVLP-BioViL-T"
+class TextEncoder(nn.Module):
+    """CXR-BERT text encoder with a projection head to 128-dim.
+    Loads the pretrained BioViL-T text model and adds a LayerNorm + Linear
+    projection from 768-dim CLS embeddings to 128-dim joint space.
+    """
+    def __init__(self):
+        super().__init__()
+        from transformers import AutoConfig, AutoModel
+        config = AutoConfig.from_pretrained(TEXT_MODEL_NAME, trust_remote_code=True)
+        self.model = AutoModel.from_pretrained(
+            TEXT_MODEL_NAME, config=config, trust_remote_code=True,
+        )
+        self.projection = nn.Sequential(
+            nn.LayerNorm(config.hidden_size),
+            nn.Linear(config.hidden_size, 128),
+        )
+    def forward(self, text_inputs: dict) -> torch.Tensor:
+        """Encode tokenized text to 128-dim embeddings.
+        Args:
+            text_inputs: Dict from tokenizer (input_ids, attention_mask, etc.)
+        Returns:
+            Projected CLS embeddings [B, 128]
+        """
+        outputs = self.model(**text_inputs)
+        cls_emb = outputs.last_hidden_state[:, 0, :]
+        if cls_emb.dtype != next(self.projection.parameters()).dtype:
+            cls_emb = cls_emb.to(next(self.projection.parameters()).dtype)
+        return self.projection(cls_emb)
+# ──────────────────────────────────────────────────────────────────────────────
+# Interval change classifier head
+# ──────────────────────────────────────────────────────────────────────────────
+class IntervalChangeClassifier(nn.Module):
+    """Binary classifier head for interval change detection.
+    Takes 128-dim projected embeddings and outputs a change probability.
+    """
+    def __init__(self):
+        super().__init__()
+        self.head = nn.Sequential(
+            nn.Linear(128, 64),
+            nn.ReLU(),
+            nn.Linear(64, 1),
+        )
+    def forward(self, embedding: torch.Tensor) -> torch.Tensor:
+        """Returns logit (pre-sigmoid). Apply torch.sigmoid() to get probability."""
+        return self.head(embedding).squeeze(-1)
+# ──────────────────────────────────────────────────────────────────────────────
+# Full model wrapper
+# ──────────────────────────────────────────────────────────────────────────────
+class TILAModel(nn.Module):
+    """TILA model with image encoder, text encoder, and interval change classifier.
+    Usage:
+        model = TILAModel.from_pretrained("model.safetensors")
+        # Get 128-dim image embeddings
+        emb = model.get_embeddings(current_img, previous_img)
+        # Get 128-dim text embeddings
+        text_emb = model.encode_text(["Improved pulmonary edema."])
+        # Predict interval change
+        result = model.get_interval_change_prediction(current_img, previous_img)
+    """
+    def __init__(self):
+        super().__init__()
+        self.image_encoder = TILAImageEncoder()
+        self.text_encoder = TextEncoder()
+        self.change_classifier = IntervalChangeClassifier()
+    @torch.no_grad()
+    def encode_text(self, texts: list) -> torch.Tensor:
+        """Encode text prompts to 128-dim normalized embeddings.
+        Args:
+            texts: List of text strings
+        Returns:
+            Normalized text embeddings [N, 128]
+        """
+        from transformers import AutoTokenizer
+        if not hasattr(self, '_tokenizer'):
+            self._tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL_NAME, padding_side="right")
+        device = next(self.parameters()).device
+        tokens = self._tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
+        tokens = {k: v.to(device) for k, v in tokens.items()}
+        self.eval()
+        # Run text encoder in float32 for numerical stability
+        with torch.autocast(device_type=device.type if isinstance(device, torch.device) else "cuda", enabled=False):
+            self.text_encoder.float()
+            emb = self.text_encoder(tokens)
+            self.text_encoder.to(next(self.image_encoder.parameters()).dtype)
+        return F.normalize(emb.float(), p=2, dim=1)
+    @torch.no_grad()
+    def get_embeddings(
+        self, current_image: torch.Tensor, previous_image: Optional[torch.Tensor] = None
+    ) -> torch.Tensor:
+        """Extract 128-dim projected global embeddings from a pair of chest X-rays.
+        Args:
+            current_image: Current CXR tensor [B, 3, 448, 448]
+            previous_image: Previous CXR tensor [B, 3, 448, 448] (optional)
+        Returns:
+            Normalized 128-dim embeddings [B, 128]
+        """
+        self.eval()
+        out = self.image_encoder(current_image, previous_image)
+        return F.normalize(out.projected_global_embedding.float(), p=2, dim=1)
+    # Thresholds calibrated on validation set (AUC=0.7558)
+    THRESHOLDS = {
+        "default": 0.5000,   # Standard sigmoid midpoint
+        "bestf1": 0.2886,    # Youden's J — best F1=0.7210, sens=0.7798, spec=0.6166
+        "spec95": 0.6370,    # Specificity ~0.95 — sens=0.1752, spec=0.9502
+    }
+    @torch.no_grad()
+    def get_interval_change_prediction(
+        self,
+        current_image: torch.Tensor,
+        previous_image: torch.Tensor,
+        mode: str = "bestf1",
+    ) -> torch.Tensor:
+        """Predict interval change between two chest X-rays.
+        Args:
+            current_image: Current CXR tensor [B, 3, 448, 448]
+            previous_image: Previous CXR tensor [B, 3, 448, 448]
+            mode: Threshold mode — one of:
+                "default"  : threshold=0.50 (standard sigmoid cutoff)
+                "bestf1"   : threshold=0.29 (maximizes F1, balanced sens/spec)
+                "spec95"   : threshold=0.64 (targets 95% specificity, conservative)
+        Returns:
+            Dict with keys:
+                "probabilities": raw change probabilities [B]
+                "predictions":   binary predictions [B] (0=no change, 1=change)
+                "threshold":     threshold used (float)
+        """
+        if mode not in self.THRESHOLDS:
+            raise ValueError(f"mode must be one of {list(self.THRESHOLDS.keys())}, got '{mode}'")
+        self.eval()
+        out = self.image_encoder(current_image, previous_image)
+        logits = self.change_classifier(out.projected_global_embedding)
+        probs = torch.sigmoid(logits.float())
+        threshold = self.THRESHOLDS[mode]
+        preds = (probs >= threshold).long()
+        return {"probabilities": probs, "predictions": preds, "threshold": threshold}
+    @classmethod
+    def from_pretrained(cls, checkpoint_path: str, device: str = "cpu") -> "TILAModel":
+        """Load model from a checkpoint file.
+        Args:
+            checkpoint_path: Path to model.safetensors (or pytorch_model.bin)
+            device: Device to load onto
+        """
+        model = cls()
+        if checkpoint_path.endswith(".safetensors"):
+            from safetensors.torch import load_file
+            state_dict = load_file(checkpoint_path, device=device)
+            # safetensors stores scalar tensors as 1-d; squeeze them back
+            for k, v in state_dict.items():
+                if v.dim() == 1 and v.shape[0] == 1 and "num_batches_tracked" in k:
+                    state_dict[k] = v.squeeze(0)
+        else:
+            state_dict = torch.load(checkpoint_path, map_location=device, weights_only=True)
+        model.load_state_dict(state_dict)
+        model.eval()
+        return model

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b16b6bcf47ac6e4e79c4d9da2db88055b297adca22715935e4522184f87ce101
+size 642508642

preprocess.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+TILA — Image Preprocessing
+Converts raw chest X-ray images (DICOM-derived PNGs or raw PNGs) into the
+normalized format expected by the TILA model.
+Pipeline:
+    1. Read image as-is (preserving bit depth)
+    2. Windowing: clip to mean +/- 2*std, normalize to [0, 1]
+    3. Convert to uint8
+    4. Remove black padding (contour-based crop)
+    5. Resize preserving aspect ratio (longest side = 512)
+Usage:
+    from preprocess import preprocess_image
+    img = preprocess_image("raw_cxr.png")
+    cv2.imwrite("preprocessed.png", img)
+"""
+import cv2
+import numpy as np
+from pathlib import Path
+def apply_windowing(image: np.ndarray, width_param: float = 4.0) -> np.ndarray:
+    """Apply intensity windowing based on image statistics.
+    Clips intensities to [mean - width_param/2 * std, mean + width_param/2 * std]
+    and normalizes to [0, 1].
+    """
+    image = image.astype(np.float64)
+    mean = np.mean(image)
+    std = np.std(image)
+    window_center = mean
+    window_width = width_param * std
+    img_min = window_center - window_width / 2
+    img_max = window_center + window_width / 2
+    image = np.clip(image, img_min, img_max)
+    image = (image - img_min) / (img_max - img_min + 1e-8)
+    return image
+def remove_black_padding(image: np.ndarray) -> np.ndarray:
+    """Remove black padded borders by finding the largest contour."""
+    _, thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY)
+    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+    if not contours:
+        return image
+    largest = max(contours, key=cv2.contourArea)
+    x, y, w, h = cv2.boundingRect(largest)
+    return image[y:y + h, x:x + w]
+def resize_preserve_aspect_ratio(image: np.ndarray, max_size: int = 512) -> np.ndarray:
+    """Resize so the longest side equals max_size, preserving aspect ratio."""
+    if len(image.shape) == 3:
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+    h, w = image.shape
+    if w < h:
+        new_w = max_size
+        new_h = int(new_w / (w / h))
+    else:
+        new_h = max_size
+        new_w = int(new_h * (w / h))
+    return cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_AREA)
+def preprocess_image(
+    image_path: str,
+    width_param: float = 4.0,
+    max_size: int = 512,
+) -> np.ndarray:
+    """Full preprocessing pipeline for a chest X-ray image.
+    Args:
+        image_path: Path to raw image (PNG, JPEG, etc.)
+        width_param: Windowing width in multiples of std (default: 4.0)
+        max_size: Target size for longest dimension (default: 512)
+    Returns:
+        Preprocessed uint8 grayscale image
+    """
+    # IMREAD_UNCHANGED preserves bit depth (important for 16-bit DICOM-derived PNGs)
+    image = cv2.imread(str(image_path), cv2.IMREAD_UNCHANGED)
+    if image is None:
+        raise ValueError(f"Could not read image: {image_path}")
+    # Convert color to grayscale if needed
+    if len(image.shape) == 3:
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+    image = apply_windowing(image, width_param)
+    image = (image * 255.0).astype(np.uint8)
+    image = remove_black_padding(image)
+    image = resize_preserve_aspect_ratio(image, max_size)
+    return image
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Preprocess chest X-ray images for TILA")
+    parser.add_argument("--input", required=True, help="Input image path")
+    parser.add_argument("--output", required=True, help="Output image path")
+    parser.add_argument("--width-param", type=float, default=4.0)
+    parser.add_argument("--max-size", type=int, default=512)
+    args = parser.parse_args()
+    img = preprocess_image(args.input, args.width_param, args.max_size)
+    cv2.imwrite(args.output, img)
+    print(f"Saved preprocessed image to {args.output} ({img.shape[1]}x{img.shape[0]})")

processor.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""
+TILA — Image Processor
+Single processor that handles the full pipeline:
+    raw image (path, numpy, or PIL) → model-ready tensor [1, 3, 448, 448]
+Combines:
+    1. Medical image preprocessing (windowing, padding removal, resize)
+    2. Model transforms (resize, center crop, to tensor, expand channels)
+Usage:
+    from processor import TILAProcessor
+    processor = TILAProcessor()
+    # From file path (applies full preprocessing)
+    tensor = processor("raw_cxr.png")
+    # From PIL image (skips medical preprocessing, applies model transforms only)
+    tensor = processor(Image.open("preprocessed.png"))
+    # Pair of images for the model
+    current = processor("current.png")
+    previous = processor("previous.png")
+    result = model.get_interval_change_prediction(current, previous)
+"""
+import cv2
+import numpy as np
+import torch
+from PIL import Image
+from torchvision import transforms
+from typing import Union
+from preprocess import preprocess_image
+class TILAProcessor:
+    """End-to-end image processor for the TILA model.
+    Accepts file paths (str/Path), numpy arrays, or PIL Images.
+    - File paths: full pipeline (windowing → crop → resize → model transform)
+    - Numpy arrays: treated as raw, full pipeline applied
+    - PIL Images: assumed already preprocessed, only model transforms applied
+    Args:
+        raw_preprocess: Apply medical preprocessing (windowing, padding removal).
+                        Set False if images are already preprocessed PNGs.
+        width_param: Windowing width parameter (default: 4.0)
+        max_size: Resize longest side to this before model transforms (default: 512)
+        crop_size: Center crop size for model input (default: 448)
+        dtype: Output tensor dtype (default: torch.bfloat16)
+        device: Output tensor device (default: "cpu")
+    """
+    def __init__(
+        self,
+        raw_preprocess: bool = True,
+        width_param: float = 4.0,
+        max_size: int = 512,
+        crop_size: int = 448,
+        dtype: torch.dtype = torch.bfloat16,
+        device: str = "cpu",
+    ):
+        self.raw_preprocess = raw_preprocess
+        self.width_param = width_param
+        self.max_size = max_size
+        self.dtype = dtype
+        self.device = device
+        self.model_transform = transforms.Compose([
+            transforms.Resize(max_size),
+            transforms.CenterCrop(crop_size),
+            transforms.ToTensor(),
+            _ExpandChannels(),
+        ])
+    def __call__(self, image: Union[str, np.ndarray, Image.Image]) -> torch.Tensor:
+        """Process a single image into a model-ready tensor.
+        Args:
+            image: File path (str), numpy array, or PIL Image
+        Returns:
+            Tensor of shape [1, 3, 448, 448]
+        """
+        if isinstance(image, str):
+            if self.raw_preprocess:
+                img_np = preprocess_image(image, self.width_param, self.max_size)
+                pil_img = Image.fromarray(img_np)
+            else:
+                pil_img = Image.open(image).convert("L")
+        elif isinstance(image, np.ndarray):
+            if self.raw_preprocess:
+                from preprocess import apply_windowing, remove_black_padding, resize_preserve_aspect_ratio
+                if len(image.shape) == 3:
+                    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+                image = apply_windowing(image, self.width_param)
+                image = (image * 255.0).astype(np.uint8)
+                image = remove_black_padding(image)
+                image = resize_preserve_aspect_ratio(image, self.max_size)
+            pil_img = Image.fromarray(image)
+        elif isinstance(image, Image.Image):
+            pil_img = image.convert("L")
+        else:
+            raise TypeError(f"Expected str, np.ndarray, or PIL.Image, got {type(image)}")
+        tensor = self.model_transform(pil_img).unsqueeze(0)
+        return tensor.to(dtype=self.dtype, device=self.device)
+class _ExpandChannels:
+    """Expand single-channel tensor to 3 channels."""
+    def __call__(self, x: torch.Tensor) -> torch.Tensor:
+        if x.shape[0] == 1:
+            return x.repeat(3, 1, 1)
+        return x