fomofo
/

tap-ct-s-3d

Safetensors

tapct

custom_code

Model card Files Files and versions

xet

Community

TimVeenboer commited on Jan 7

Commit

3eca297

1 Parent(s): 7fb44d5

docs(tap-hf): README & docs editing

Browse files

Files changed (2) hide show

README.md +67 -3
modeling_tapct.py +7 -3

README.md CHANGED Viewed

@@ -6,11 +6,26 @@ license: cc-by-nc-4.0
 TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
-This repository provides TAP-CT-S-3D, a Vision Transformer (ViT-Small) architecture pretrained on volumetric inputs with a spatial resolution of (12, 224, 224) and a patch size of (4, 8, 8). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan. Additional TAP-CT model variants, as well as the image processor, will be released in future updates.
 ## Preprocessing
-While a dedicated image processor will be released in future updates, optimal feature extraction requires the following preprocessing pipeline:
 1. **Orientation**: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
 2. **Spatial Resizing**: Resize the volume to a spatial resolution of \(z, 224, 224\) or \(z, 512, 512\), where \(z\) represents the number of slices along the axial dimension.
@@ -20,6 +35,8 @@ While a dedicated image processor will be released in future updates, optimal fe
 ## Usage
 ```python
 import torch
 from transformers import AutoModel
@@ -31,7 +48,54 @@ model = AutoModel.from_pretrained('fomofo/tap-ct-s-3d', trust_remote_code=True)
 x = torch.randn((16, 1, 12, 224, 224))
 # Forward pass
-output = model.forward(x)
 ```
 The model returns a `BaseModelOutputWithPooling` object from the transformers library. The `output.pooler_output` contains the pooled `[CLS]` token representation, while `output.last_hidden_state` contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass `output_hidden_states=True` to the forward method.

 TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
+This repository provides TAP-CT-S-3D, a Vision Transformer (ViT-Small) architecture pretrained on volumetric inputs with a spatial resolution of (12, 224, 224) and a patch size of (4, 8, 8). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan.
 ## Preprocessing
+### Using dedicated image processor
+Each TAP-CT model repository provides its own dedicated image processor and configuration file. To ensure proper preprocessing, it is recommended to instantiate the corresponding image processor using the `AutoImageProcessor` class from Hugging Face Transformers. This can be accomplished as follows:
+```python
+from transformers import AutoImageProcessor
+preprocessor = AutoImageProcessor.from_pretrained(
+    'fomofo/tap-ct-s-3d',
+    trust_remote_code=True
+)
+```
+This approach automatically loads the appropriate processor and configuration for the selected TAP-CT model.
+### Preprocessing without pipeline
 1. **Orientation**: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
 2. **Spatial Resizing**: Resize the volume to a spatial resolution of \(z, 224, 224\) or \(z, 512, 512\), where \(z\) represents the number of slices along the axial dimension.
 ## Usage
+### Default Usage
 ```python
 import torch
 from transformers import AutoModel
 x = torch.randn((16, 1, 12, 224, 224))
 # Forward pass
+with torch.no_grad():
+    output = model.forward(x)
+```
+### Usage with Preprocessor, loading CT volumes & sliding window inference
+```python
+import numpy as np
+import SimpleITK as sitk
+import torch
+from transformers import AutoModel, AutoImageProcessor
+# Load the model
+model = AutoModel.from_pretrained('fomofo/tap-ct-s-3d', trust_remote_code=True)
+preprocessor = AutoImageProcessor.from_pretrained('fomofo/tap-ct-s-3d', trust_remote_code=True)
+# Load image & set orientation to LPS
+volume = sitk.ReadImage('/path/to/ct-scan.nii.gz')
+volume = sitk.DICOMOrient(volume, 'LPS')
+# Get array, expand to (B, C, D, H, W) and preprocess
+array = sitk.GetArrayFromImage(volume)
+array = np.expand_dims(array, axis(0, 1))
+x = preprocessor(array)['pixel_values']
+# Forward pass
+with torch.no_grad():
+    output = model.forward(x)
+# OR
+# Forward pass with sliding window
+from monai.inferers import SlidingWindowInferer
+def predictor_fn(x):
+    # Reshape the patch tokens to resemble a 3D feature map
+    out = model(x, reshape=True)
+    return out.last_hidden_state
+inferer = SlidingWindowInferer(
+    roi_size=[12, 224, 224],
+    sw_batch_size=1,
+    overlap=0.75,
+    mode='gaussian'
+)
+with torch.no_grad():
+    output = inferer(x, predictor_fn)
 ```
 The model returns a `BaseModelOutputWithPooling` object from the transformers library. The `output.pooler_output` contains the pooled `[CLS]` token representation, while `output.last_hidden_state` contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass `output_hidden_states=True` to the forward method.

modeling_tapct.py CHANGED Viewed

@@ -102,11 +102,15 @@ class TAPCTModel(TAPCTPreTrainedModel):
         Parameters
         ----------
         pixel_values : torch.Tensor
-            Input images. Shape (B, C, H, W) for 2D or (B, C, D, H, W) for 3D
         output_hidden_states : Optional[bool], optional
-            Whether to return hidden states from all layers
         return_dict : Optional[bool], optional
-            Whether to return a ModelOutput instead of a plain tuple
         Returns
         -------

         Parameters
         ----------
         pixel_values : torch.Tensor
+            Input images. Shape (B, C, H, W) for 2D or (B, C, D, H, W) for 3D.
         output_hidden_states : Optional[bool], optional
+            Whether to return hidden states from all layers.
         return_dict : Optional[bool], optional
+            Whether to return a ModelOutput instead of a plain tuple.
+        reshape : bool, default=False
+            Whether to reshape output features to spatial dimensions. If True,
+            returns shape (B, H, W, C) for 2D or (B, D, H, W, C) for 3D instead
+            of flattened (B, N, C) where N is the number of patches.
         Returns
         -------