Test

by timvnb - opened Jan 8

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+11

-294

Files changed (5) hide show

README.md +5 -91
modeling_tapct.py +5 -10
preprocessor_config.json +0 -13
tapct_processor.py +0 -179
vision_transformer_base.py +1 -1

README.md CHANGED Viewed

@@ -3,41 +3,23 @@ license: cc-by-nc-4.0
 ---
 # TAP-CT: 3D Task-Agnostic Pretraining of CT Foundation Models
- [![arXiv](https://img.shields.io/badge/arXiv-TAP--CT-b31b1b.svg)](https://arxiv.org/abs/2512.00872)
 TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
-This repository provides TAP-CT-B-3D, a Vision Transformer (ViT-Base) architecture pretrained on volumetric inputs with a spatial resolution of (12, 224, 224) and a patch size of (4, 8, 8). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan.
 ## Preprocessing
-### Using dedicated image processor
-Each TAP-CT model repository provides its own dedicated image processor and configuration file. To ensure proper preprocessing, it is recommended to instantiate the corresponding image processor using the `AutoImageProcessor` class from Hugging Face Transformers. This can be accomplished as follows:
-```python
-from transformers import AutoImageProcessor
-preprocessor = AutoImageProcessor.from_pretrained(
-    'fomofo/tap-ct-b-3d',
-    trust_remote_code=True
-)
-```
-This approach automatically loads the appropriate processor and configuration for the selected TAP-CT model.
-### Preprocessing without pipeline
 1. **Orientation**: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
 2. **Spatial Resizing**: Resize the volume to a spatial resolution of \(z, 224, 224\) or \(z, 512, 512\), where \(z\) represents the number of slices along the axial dimension.
-3. **Axial Padding**: Apply -1024 padding along the \(z\)-axis to ensure divisibility by 4, accommodating the model's patch size of (4, 8, 8).
 4. **Intensity Clipping**: Clip voxel intensities to the range \([-1008, 822]\) HU (Hounsfield Units).
 5. **Normalization**: Apply z-score normalization using \(mean = -86.8086\) and \(std = 322.6347\).
 ## Usage
-### Default Usage
 ```python
 import torch
 from transformers import AutoModel
@@ -49,62 +31,7 @@ model = AutoModel.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
 x = torch.randn((16, 1, 12, 224, 224))
 # Forward pass
-with torch.no_grad():
-    output = model.forward(x)
-```
-### Usage with Preprocessor, loading CT volumes & sliding window inference
-**Recommended environment:**
-- Python >= 3.11
-- torch >= 2.8
-- numpy >= 2.35
-- SimpleITK >= 2.52
-- monai >= 1.4.0
-- xformers >= 0.0.32 (optional, recommended for CUDA)
-```python
-import numpy as np
-import SimpleITK as sitk
-import torch
-from transformers import AutoModel, AutoImageProcessor
-# Load the model
-model = AutoModel.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
-preprocessor = AutoImageProcessor.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
-# Load image & set orientation to LPS
-volume = sitk.ReadImage('/path/to/ct-scan.nii.gz')
-volume = sitk.DICOMOrient(volume, 'LPS')
-# Get array, expand to (B, C, D, H, W) and preprocess
-array = sitk.GetArrayFromImage(volume)
-array = np.expand_dims(array, axis=(0, 1))
-x = preprocessor(array)['pixel_values']
-# Forward pass
-with torch.no_grad():
-    output = model.forward(x)
-# OR
-# Forward pass with sliding window
-from monai.inferers import SlidingWindowInferer
-def predictor_fn(x):
-    # Reshape the patch tokens to resemble a 3D feature map
-    out = model(x, reshape=True)
-    return out.last_hidden_state
-inferer = SlidingWindowInferer(
-    roi_size=[12, 224, 224],
-    sw_batch_size=1,
-    overlap=0.75,
-    mode='gaussian'
-)
-with torch.no_grad():
-    output = inferer(x, predictor_fn)
 ```
 The model returns a `BaseModelOutputWithPooling` object from the transformers library. The `output.pooler_output` contains the pooled `[CLS]` token representation, while `output.last_hidden_state` contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass `output_hidden_states=True` to the forward method.
@@ -114,17 +41,4 @@ The model returns a `BaseModelOutputWithPooling` object from the transformers li
 - **Model Type**: 3D CT Vision Foundation Model
 - **Input Shape**: `(batch_size, 1, depth, height, width)`
 - **Example Input**: `(16, 1, 12, 224, 224)` - batch of 16 CT crops with 12 slices at 224×224 resolution
-- **License**: CC-BY-NC-4.0
-## Citation
-If you find this work useful, please cite:
-```bibtex
-@article{veenboer2025tapct,
-  title={TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models},
-  author={Veenboer, Tim and Yiasemis, George and Marcus, Eric and Van Veldhuizen, Vivien and Snoek, Cees G. M. and Teuwen, Jonas and Groot Lipman, Kevin B. W.},
-  journal={arXiv preprint arXiv:2512.00872},
-  year={2025}
-}
-```

 ---
 # TAP-CT: 3D Task-Agnostic Pretraining of CT Foundation Models
 TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
+This repository provides TAP-CT-B-3D, a Vision Transformer (ViT-Base) architecture pretrained on volumetric inputs with a spatial resolution of (12, 224, 224) and a patch size of (4, 8, 8). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan. Additional TAP-CT model variants, as well as the image processor, will be released in future updates.
 ## Preprocessing
+While a dedicated image processor will be released in future updates, optimal feature extraction requires the following preprocessing pipeline:
 1. **Orientation**: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
 2. **Spatial Resizing**: Resize the volume to a spatial resolution of \(z, 224, 224\) or \(z, 512, 512\), where \(z\) represents the number of slices along the axial dimension.
+3. **Axial Padding**: Apply zero-padding along the \(z\)-axis to ensure divisibility by 4, accommodating the model's patch size of (4, 8, 8).
 4. **Intensity Clipping**: Clip voxel intensities to the range \([-1008, 822]\) HU (Hounsfield Units).
 5. **Normalization**: Apply z-score normalization using \(mean = -86.8086\) and \(std = 322.6347\).
 ## Usage
 ```python
 import torch
 from transformers import AutoModel
 x = torch.randn((16, 1, 12, 224, 224))
 # Forward pass
+output = model.forward(x)
 ```
 The model returns a `BaseModelOutputWithPooling` object from the transformers library. The `output.pooler_output` contains the pooled `[CLS]` token representation, while `output.last_hidden_state` contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass `output_hidden_states=True` to the forward method.
 - **Model Type**: 3D CT Vision Foundation Model
 - **Input Shape**: `(batch_size, 1, depth, height, width)`
 - **Example Input**: `(16, 1, 12, 224, 224)` - batch of 16 CT crops with 12 slices at 224×224 resolution
+- **License**: CC-BY-NC-4.0

modeling_tapct.py CHANGED Viewed

@@ -94,7 +94,6 @@ class TAPCTModel(TAPCTPreTrainedModel):
         pixel_values: torch.Tensor,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        reshape: bool = False
     ) -> BaseModelOutputWithPooling:
         """
         Forward pass of the TAP-CT model.
@@ -102,15 +101,11 @@ class TAPCTModel(TAPCTPreTrainedModel):
         Parameters
         ----------
         pixel_values : torch.Tensor
-            Input images. Shape (B, C, H, W) for 2D or (B, C, D, H, W) for 3D.
         output_hidden_states : Optional[bool], optional
-            Whether to return hidden states from all layers.
         return_dict : Optional[bool], optional
-            Whether to return a ModelOutput instead of a plain tuple.
-        reshape : bool, default=False
-            Whether to reshape output features to spatial dimensions. If True,
-            returns shape (B, H, W, C) for 2D or (B, D, H, W, C) for 3D instead
-            of flattened (B, N, C) where N is the number of patches.
         Returns
         -------
@@ -128,7 +123,7 @@ class TAPCTModel(TAPCTPreTrainedModel):
                 pixel_values,
                 n=self.model.n_blocks,
                 return_class_token=True,
-                reshape=reshape
             )
             outputs = tuple(o[0] for o in outputs_tuple)
             class_tokens = tuple(o[1] for o in outputs_tuple)
@@ -141,7 +136,7 @@ class TAPCTModel(TAPCTPreTrainedModel):
                 pixel_values,
                 n=1,
                 return_class_token=True,
-                reshape=reshape
             )
             last_hidden_state = outputs_tuple[0][0]
             pooler_output = outputs_tuple[0][1]

         pixel_values: torch.Tensor,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> BaseModelOutputWithPooling:
         """
         Forward pass of the TAP-CT model.
         Parameters
         ----------
         pixel_values : torch.Tensor
+            Input images. Shape (B, C, H, W) for 2D or (B, C, D, H, W) for 3D
         output_hidden_states : Optional[bool], optional
+            Whether to return hidden states from all layers
         return_dict : Optional[bool], optional
+            Whether to return a ModelOutput instead of a plain tuple
         Returns
         -------
                 pixel_values,
                 n=self.model.n_blocks,
                 return_class_token=True,
+                reshape=False
             )
             outputs = tuple(o[0] for o in outputs_tuple)
             class_tokens = tuple(o[1] for o in outputs_tuple)
                 pixel_values,
                 n=1,
                 return_class_token=True,
+                reshape=False
             )
             last_hidden_state = outputs_tuple[0][0]
             pooler_output = outputs_tuple[0][1]

preprocessor_config.json DELETED Viewed

@@ -1,13 +0,0 @@
-{
-    "image_processor_type": "TAPCTProcessor",
-    "use_fast": false,
-    "resize_dims": [224, 224],
-    "divisible_pad_z": 4,
-    "clip_range": [-1008.0, 822.0],
-    "norm_mean": -86.80862426757812,
-    "norm_std": 322.63470458984375,
-    "auto_map": {
-        "AutoImageProcessor": "tapct_processor.TAPCTProcessor"
-    }
-}

tapct_processor.py DELETED Viewed

@@ -1,179 +0,0 @@
-from typing import Union
-import numpy as np
-import torch
-import torch.nn.functional as F
-from transformers.image_processing_utils import BaseImageProcessor
-class TAPCTProcessor(BaseImageProcessor):
-    """
-    Image processor for TAP-CT 3D volumes.
-    Processes CT volumes with the following pipeline:
-    1. Spatial Resizing: Resize to (z, H', W') where H', W' are resize_dims
-    2. Axial Padding: Pad z-axis with -1024 HU for divisibility by patch size
-    3. Intensity Clipping: Clip to HU range
-    4. Normalization: Z-score normalization
-    Parameters
-    ----------
-    resize_dims : tuple[int, int], default=(224, 224)
-        Target spatial dimensions (H, W) for resizing.
-    divisible_pad_z : int, default=4
-        Pad the z-axis to be divisible by this value.
-    clip_range : tuple[float, float], default=(-1008.0, 822.0)
-        HU intensity clipping range (min, max).
-    norm_mean : float, default=-86.80862426757812
-        Mean for z-score normalization.
-    norm_std : float, default=322.63470458984375
-        Standard deviation for z-score normalization.
-    **kwargs
-        Additional arguments passed to BaseImageProcessor.
-    """
-    model_input_names = ["pixel_values"]
-    def __init__(
-        self,
-        resize_dims: tuple[int, int] = (224, 224),
-        divisible_pad_z: int = 4,
-        clip_range: tuple[float, float] = (-1008.0, 822.0),
-        norm_mean: float = -86.80862426757812,
-        norm_std: float = 322.63470458984375,
-        **kwargs
-    ) -> None:
-        super().__init__(**kwargs)
-        self.resize_dims = resize_dims
-        self.divisible_pad_z = divisible_pad_z
-        self.clip_range = clip_range
-        self.norm_mean = norm_mean
-        self.norm_std = norm_std
-    def preprocess(
-        self,
-        images: Union[torch.Tensor, np.ndarray],
-        return_tensors: str = "pt",
-        **kwargs
-    ) -> dict[str, torch.Tensor]:
-        """
-        Preprocess CT volumes.
-        Parameters
-        ----------
-        images : torch.Tensor or np.ndarray
-            Input tensor or numpy array of shape (B, C, D, H, W) where
-            B=batch, C=channels, D=depth/slices, H=height, W=width.
-        return_tensors : str, default="pt"
-            Return format. Only "pt" (PyTorch) is supported.
-        **kwargs
-            Additional keyword arguments (unused).
-        Returns
-        -------
-        dict[str, torch.Tensor]
-            Dictionary with "pixel_values" containing processed tensor of shape
-            (B, C, D', H', W') where D' may be padded for divisibility.
-        Raises
-        ------
-        ValueError
-            If return_tensors is not "pt" or input is not 5D.
-        """
-        if return_tensors != "pt":
-            raise ValueError(f"Only 'pt' return_tensors is supported, got {return_tensors}")
-        # Convert numpy to tensor if needed
-        if isinstance(images, np.ndarray):
-            images = torch.from_numpy(images)
-        # Ensure float32 dtype for processing
-        images = images.float()
-        # Validate input shape
-        if images.ndim != 5:
-            raise ValueError(f"Expected 5D input (B, C, D, H, W), got shape {images.shape}")
-        B, C, D, H, W = images.shape
-        # Step 1: Spatial Resizing - resize H, W dimensions to resize_dims
-        target_h, target_w = self.resize_dims
-        if H != target_h or W != target_w:
-            images = self._resize_spatial(images, target_h, target_w)
-        # Step 2: Axial Padding - pad z-axis with -1024 for divisibility
-        images = self._pad_axial(images)
-        # Step 3: Intensity Clipping - clip to HU range
-        images = torch.clamp(images, min=self.clip_range[0], max=self.clip_range[1])
-        # Step 4: Z-score Normalization
-        images = (images - self.norm_mean) / self.norm_std
-        return {"pixel_values": images}
-    def _resize_spatial(
-        self,
-        images: torch.Tensor,
-        target_h: int,
-        target_w: int
-    ) -> torch.Tensor:
-        """
-        Resize spatial dimensions (H, W) using trilinear interpolation.
-        Parameters
-        ----------
-        images : torch.Tensor
-            Tensor of shape (B, C, D, H, W).
-        target_h : int
-            Target height.
-        target_w : int
-            Target width.
-        Returns
-        -------
-        torch.Tensor
-            Resized tensor of shape (B, C, D, target_h, target_w).
-        """
-        D = images.shape[2]
-        # Apply trilinear interpolation, keeping depth unchanged
-        images = F.interpolate(
-            images,
-            size=(D, target_h, target_w),
-            mode='trilinear',
-            align_corners=False
-        )
-        return images
-    def _pad_axial(self, images: torch.Tensor) -> torch.Tensor:
-        """
-        Pad the axial (z/depth) dimension with -1024 HU for divisibility.
-        Parameters
-        ----------
-        images : torch.Tensor
-            Tensor of shape (B, C, D, H, W).
-        Returns
-        -------
-        torch.Tensor
-            Padded tensor of shape (B, C, D', H, W) where D' is divisible
-            by divisible_pad_z.
-        """
-        D = images.shape[2]
-        remainder = D % self.divisible_pad_z
-        if remainder == 0:
-            return images
-        pad_z = self.divisible_pad_z - remainder
-        # F.pad expects padding in reverse dimension order: (W_l, W_r, H_l, H_r, D_l, D_r, ...)
-        # To pad depth at the end: (0, 0, 0, 0, 0, pad_z)
-        padding = (0, 0, 0, 0, 0, pad_z)
-        images = F.pad(images, padding, mode='constant', value=-1024.0)
-        return images

vision_transformer_base.py CHANGED Viewed

@@ -270,7 +270,7 @@ class DinoVisionTransformerBase(nn.Module):
         if drop_path_uniform is True:
             dpr = [drop_path_rate] * depth
         else:
-            dpr = torch.linspace(0, drop_path_rate, depth, device="cpu").tolist()  # stochastic depth decay rule
         if ffn_layer == DinoVisionTransformerFFNLayer.MLP:
             self.logger.info("Using MLP layer as FFN")

         if drop_path_uniform is True:
             dpr = [drop_path_rate] * depth
         else:
+            dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
         if ffn_layer == DinoVisionTransformerFFNLayer.MLP:
             self.logger.info("Using MLP layer as FFN")