Update README.md

8b81431 verified 4 days ago

8.3 kB

	---
	license: cc-by-nc-sa-4.0
	datasets:
	- ibrahimhamamci/CT-RATE
	pipeline_tag: feature-extraction
	tags:
	- university-of-kentucky
	- medical
	- radiology
	- chest-ct
	- vision
	- lejepa
	language:
	- en
	---
	# Model Card for DALE-CT-0

	This repository hosts the backbone weights for DALE-CT-0 (Depth-Aware Latent-Euclidean Computed Tomography), a foundational Vision Transformer (ViT-Large) trained on Chest CT scans using the Latent-Euclidean Joint-Embedding Predictive Architecture ([LeJEPA](https://github.com/galilai-group/lejepa)) framework.

	Unlike its 1S and 2S counterparts, this model was trained purely using self-supervised LeJEPA objectives without any auxiliary supervision, allowing it to learn general-purpose representations of CT volumes in a completely unsupervised manner.

	This model was developed by the Institute for Biomedical Informatics Center for Applied AI (IBI-CAAI) at the University of Kentucky to serve as a robust feature extractor for downstream medical imaging tasks, including segmentation, multi-instance learning (MIL), and anomaly detection.

	## Model Details
	* Model Type: Vision Transformer (ViT-Large) for Chest CT analysis.
	* Developed by: Institute for Biomedical Informatics Center for Applied AI (IBI-CAAI)
	* Model Date: 04/2026
	* Base Model Architecture: `vit_large_patch14_dinov2` (via `timm`). Note: This model was randomly initialized and trained entirely from scratch with modified architecture arguments.
	* Input: 1-channel Grayscale CT Image.
	* Output: Class token and patch tokens. These can be used for various downstream tasks (e.g., classification, anomaly detection, multi-instance learning).
	* Embedding Dimension: 1024
	* Patch Size: 16
	* Image Size Compatibility: Native `512x512` resolution. The architecture supports variable input sizes dynamically, provided the height and width are divisible by the patch size (16).
	* License: CC BY-NC-SA 4.0 (Inherited from the CT-RATE dataset terms).

	## Intended Uses
	This model is intended for research purposes in the field of medical imaging and radiology.
	* Primary Intended Uses:
	* Feature extraction for quantitative analysis of Chest CT scans.
	* Foundational backbone for downstream models predicting organ anomalies, segmentation, or volume-level analysis via MIL.

	## Training Data
	* Dataset(s): The model was trained exclusively on the train split of the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) dataset.
	* Preprocessing: Hounsfield Units (HU) were strictly clipped between `[-997.0, 888.0]`. These values correspond to the 0.5% and 99.5% pixel intensities of the foreground voxels calculated on a subset of the CT-RATE dataset. The clipped values were mapped to a `[0, 1]` range, followed by Z-score normalization utilizing a dataset mean of `-142.39` and standard deviation of `360.97`.

	## Training Procedure
	* Training System/Framework: Distributed Data Parallel (DDP) utilizing `bf16` mixed precision.
	* Hardware & Scale: The model was trained for a total of 50,000 iterations. The configuration utilized a batch size of 64 per GPU, a peak learning rate of `3.0e-04` (decaying to `3.0e-05`), and a 5,000-step warmup.
	* Training Strategy: Global and local crops were sampled from within a 12mm physical slab rather than a single 2D plane to ensure anatomical awareness during the self-supervised matching task.

	### Self-Supervised Objective: LeJEPA
	The model was trained using solely the self-supervised LeJEPA objective, which is a strictly non-predictive architecture that combines a spatial invariance loss with Sketched Isotropic Gaussian Regularization (SIGReg):
	$$\mathcal{L}_{\text{LeJEPA}}=(1-\lambda)\mathcal{L}_{\text{invariance}}+\lambda\mathcal{L}_{\text{SIGReg}}$$
	where \$\lambda=0.02\$. The SIGReg formulation projects embeddings onto a set of random 1D directions to enforce normality via empirical characteristic functions, completely avoiding traditional architectural heuristics like predictor networks.

	### Data Augmentation Pipeline
	A specialized, GPU-accelerated augmentation pipeline generated the multi-crop views required for the LeJEPA architecture.

	1. Spatial Cropping
	* Global Crops: 2 global crops generated per volume, sized at 256x256 pixels, with a random scale between 60% and 100% of the original image dimensions.
	* Local Crops: 8 local crops generated per volume, sized at 144x144 pixels, with a random scale between 30% and 60%.
	* Flip & Resize: Resized using nearest-neighbor interpolation to prevent artificial averaging of HU values, followed by a random horizontal flip (50% probability).

	2. Intensity & Noise Augmentations
	* Random Gamma Correction: A random gamma shift (range: 0.9 to 1.1) was applied independently to all crops with an 80% probability.

	## How to Get Started with the Model

	Because CT scans require strict Hounsfield Unit (HU) windowing and normalization to match the training distribution, you must apply the specific preprocessing logic below. Note: With the native 512px architecture, standard CT slices (512x512) do not require dynamic padding.

	```python
	import torch
	import numpy as np
	import timm
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file

	class CTInferenceTransform:
	"""
	Applies the exact HU windowing and Z-score normalization used during LeJEPA training.
	Assumes standard 512x512 CT slice inputs.
	"""
	def __init__(self):
	self.clip_min = -997.0
	self.clip_max = 888.0
	self.mean_hu = -142.39
	self.std_hu = 360.97

	# Calculate 0-1 scaled mean and std
	range_val = self.clip_max - self.clip_min
	self.norm_mean = (self.mean_hu - self.clip_min) / range_val
	self.norm_std = self.std_hu / range_val

	def __call__(self, volume):
	# Expects a 2D numpy array or torch tensor (H, W) in Hounsfield Units
	if isinstance(volume, np.ndarray):
	volume = torch.from_numpy(volume).float()
	if volume.ndim == 2:
	volume = volume.unsqueeze(0) # Add channel dim: (1, H, W)

	# 1. Clamp HU values and map strictly to [0, 1]
	volume = torch.clamp(volume, self.clip_min, self.clip_max)
	range_val = self.clip_max - self.clip_min
	volume = (volume - self.clip_min) / range_val

	# 2. Z-score standardization
	volume = (volume - self.norm_mean) / self.norm_std

	# Returns (1, 1, H, W). For batched inference, stack these along dim=0.
	return volume.unsqueeze(0)

	def load_ct_model(repo_id="IBI-CAAI/DALE-CT-0"):
	"""
	Downloads and initializes the ViT-Large backbone using timm and safetensors.
	"""
	# 1. Initialize the base architecture with overrides (patch_size=16, img_size=512)
	model = timm.create_model(
	"vit_large_patch14_dinov2",
	pretrained=False,
	num_classes=0,
	in_chans=1,
	patch_size=16,
	img_size=512,
	dynamic_img_size=True
	)

	# 2. Download and load the custom safetensors weights
	model_path = hf_hub_download(repo_id=repo_id, filename="model.safetensors")
	state_dict = load_file(model_path)
	model.load_state_dict(state_dict, strict=False)
	model.eval()

	return model

	if __name__ == "__main__":
	# Initialize the transform and the model
	transform = CTInferenceTransform()
	model = load_ct_model()

	# Simulate a raw CT slice (Replace this with an actual NIfTI/DICOM load in Hounsfield Units)
	raw_ct_slice = np.random.uniform(-1000, 1000, size=(512, 512))

	# Process the image to ensure correct normalization
	input_tensor = transform(raw_ct_slice)

	# Extract embeddings
	with torch.no_grad():
	# Option A: Get the single pooled global feature for the entire slice
	global_feature = model(input_tensor)

	# Option B: Get the unpooled, dense spatial patch tokens (for fine-grained tasks like Segmentation)
	patch_tokens = model.forward_features(input_tensor)

	print(f"Input tensor shape: {input_tensor.shape}")
	print(f"Extracted features shape: {global_feature.shape}")
	print(f"Dense patch tokens shape: {patch_tokens.shape}")