fomofo
/

tap-ct-b-2-5d

Safetensors

tapct

custom_code

Model card Files Files and versions

xet

Community

TimVeenboer commited on 13 days ago

Commit

eb26bbb

1 Parent(s): 9be891b

docs(tap-hf): readme and modeling docs

Browse files

Files changed (2) hide show

README.md +68 -4
modeling_tapct.py +10 -5

README.md CHANGED Viewed

@@ -6,11 +6,26 @@ license: cc-by-nc-4.0
 TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
-This repository provides TAP-CT-B-2.5D, a Vision Transformer (ViT-Small) architecture pretrained on volumetric inputs with a spatial resolution of (6, 224, 224) and a patch size of (1, 16, 16). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan. Additional TAP-CT model variants, as well as the image processor, will be released in future updates.
 ## Preprocessing
-While a dedicated image processor will be released in future updates, optimal feature extraction requires the following preprocessing pipeline:
 1. **Orientation**: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
 2. **Spatial Resizing**: Resize the volume to a spatial resolution of \(z, 224, 224\) or \(z, 512, 512\), where \(z\) represents the number of slices along the axial dimension.
@@ -19,6 +34,8 @@ While a dedicated image processor will be released in future updates, optimal fe
 ## Usage
 ```python
 import torch
 from transformers import AutoModel
@@ -30,7 +47,54 @@ model = AutoModel.from_pretrained('fomofo/tap-ct-b-2-5d', trust_remote_code=True
 x = torch.randn((16, 1, 6, 224, 224))
 # Forward pass
-output = model.forward(x)
 ```
 The model returns a `BaseModelOutputWithPooling` object from the transformers library. The `output.pooler_output` contains the pooled `[CLS]` token representation, while `output.last_hidden_state` contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass `output_hidden_states=True` to the forward method.
@@ -40,4 +104,4 @@ The model returns a `BaseModelOutputWithPooling` object from the transformers li
 - **Model Type**: 3D CT Vision Foundation Model
 - **Input Shape**: `(batch_size, 1, depth, height, width)`
 - **Example Input**: `(16, 1, 6, 224, 224)` - batch of 16 CT crops with 6 slices at 224×224 resolution
-- **License**: CC-BY-NC-4.0

 TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
+This repository provides TAP-CT-B-2.5D, a Vision Transformer (ViT-Base) architecture pretrained on volumetric inputs with a spatial resolution of (6, 224, 224) and a patch size of (1, 16, 16). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan.
 ## Preprocessing
+### Using dedicated image processor
+Each TAP-CT model repository provides its own dedicated image processor and configuration file. To ensure proper preprocessing, it is recommended to instantiate the corresponding image processor using the `AutoImageProcessor` class from Hugging Face Transformers. This can be accomplished as follows:
+```python
+from transformers import AutoImageProcessor
+preprocessor = AutoImageProcessor.from_pretrained(
+    'fomofo/tap-ct-b-2-5d',
+    trust_remote_code=True
+)
+```
+This approach automatically loads the appropriate processor and configuration for the selected TAP-CT model.
+### Preprocessing without pipeline
 1. **Orientation**: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
 2. **Spatial Resizing**: Resize the volume to a spatial resolution of \(z, 224, 224\) or \(z, 512, 512\), where \(z\) represents the number of slices along the axial dimension.
 ## Usage
+### Default Usage
 ```python
 import torch
 from transformers import AutoModel
 x = torch.randn((16, 1, 6, 224, 224))
 # Forward pass
+with torch.no_grad():
+    output = model.forward(x)
+```
+### Usage with Preprocessor, loading CT volumes & sliding window inference
+```python
+import numpy as np
+import SimpleITK as sitk
+import torch
+from transformers import AutoModel, AutoImageProcessor
+# Load the model
+model = AutoModel.from_pretrained('fomofo/tap-ct-b-2-5d', trust_remote_code=True)
+preprocessor = AutoImageProcessor.from_pretrained('fomofo/tap-ct-b-2-5d', trust_remote_code=True)
+# Load image & set orientation to LPS
+volume = sitk.ReadImage('/path/to/ct-scan.nii.gz')
+volume = sitk.DICOMOrient(volume, 'LPS')
+# Get array, expand to (B, C, D, H, W) and preprocess
+array = sitk.GetArrayFromImage(volume)
+array = np.expand_dims(array, axis(0, 1))
+x = preprocessor(array)['pixel_values']
+# Forward pass
+with torch.no_grad():
+    output = model.forward(x)
+# OR
+# Forward pass with sliding window
+from monai.inferers import SlidingWindowInferer
+def predictor_fn(x):
+    # Reshape the patch tokens to resemble a 3D feature map
+    out = model(x, reshape=True)
+    return out.last_hidden_state
+inferer = SlidingWindowInferer(
+    roi_size=[6, 224, 224],
+    sw_batch_size=1,
+    overlap=0.75,
+    mode='gaussian'
+)
+with torch.no_grad():
+    output = inferer(x, predictor_fn)
 ```
 The model returns a `BaseModelOutputWithPooling` object from the transformers library. The `output.pooler_output` contains the pooled `[CLS]` token representation, while `output.last_hidden_state` contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass `output_hidden_states=True` to the forward method.
 - **Model Type**: 3D CT Vision Foundation Model
 - **Input Shape**: `(batch_size, 1, depth, height, width)`
 - **Example Input**: `(16, 1, 6, 224, 224)` - batch of 16 CT crops with 6 slices at 224×224 resolution
+- **License**: CC-BY-NC-4.0

modeling_tapct.py CHANGED Viewed

@@ -94,6 +94,7 @@ class TAPCTModel(TAPCTPreTrainedModel):
         pixel_values: torch.Tensor,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> BaseModelOutputWithPooling:
         """
         Forward pass of the TAP-CT model.
@@ -101,11 +102,15 @@ class TAPCTModel(TAPCTPreTrainedModel):
         Parameters
         ----------
         pixel_values : torch.Tensor
-            Input images. Shape (B, C, H, W) for 2D or (B, C, D, H, W) for 3D
         output_hidden_states : Optional[bool], optional
-            Whether to return hidden states from all layers
         return_dict : Optional[bool], optional
-            Whether to return a ModelOutput instead of a plain tuple
         Returns
         -------
@@ -123,7 +128,7 @@ class TAPCTModel(TAPCTPreTrainedModel):
                 pixel_values,
                 n=self.model.n_blocks,
                 return_class_token=True,
-                reshape=False
             )
             outputs = tuple(o[0] for o in outputs_tuple)
             class_tokens = tuple(o[1] for o in outputs_tuple)
@@ -136,7 +141,7 @@ class TAPCTModel(TAPCTPreTrainedModel):
                 pixel_values,
                 n=1,
                 return_class_token=True,
-                reshape=False
             )
             last_hidden_state = outputs_tuple[0][0]
             pooler_output = outputs_tuple[0][1]

         pixel_values: torch.Tensor,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
+        reshape: bool = False
     ) -> BaseModelOutputWithPooling:
         """
         Forward pass of the TAP-CT model.
         Parameters
         ----------
         pixel_values : torch.Tensor
+            Input images. Shape (B, C, H, W) for 2D or (B, C, D, H, W) for 3D.
         output_hidden_states : Optional[bool], optional
+            Whether to return hidden states from all layers.
         return_dict : Optional[bool], optional
+            Whether to return a ModelOutput instead of a plain tuple.
+        reshape : bool, default=False
+            Whether to reshape output features to spatial dimensions. If True,
+            returns shape (B, H, W, C) for 2D or (B, D, H, W, C) for 3D instead
+            of flattened (B, N, C) where N is the number of patches.
         Returns
         -------
                 pixel_values,
                 n=self.model.n_blocks,
                 return_class_token=True,
+                reshape=reshape
             )
             outputs = tuple(o[0] for o in outputs_tuple)
             class_tokens = tuple(o[1] for o in outputs_tuple)
                 pixel_values,
                 n=1,
                 return_class_token=True,
+                reshape=reshape
             )
             last_hidden_state = outputs_tuple[0][0]
             pooler_output = outputs_tuple[0][1]