Add files using upload-large-folder tool

Browse files

Files changed (14) hide show

README.md +156 -0
__init__.py +19 -0
common.py +36 -0
config.json +181 -0
configuration_brain_mri_siglip.py +112 -0
model.safetensors +3 -0
modeling_brain_mri_siglip.py +615 -0
offline_aligned_preprocessing.py +286 -0
preprocessor_config.json +52 -0
processing_brain_mri_siglip.py +680 -0
processor_config.json +25 -0
special_tokens_map.json +23 -0
spiece.model +3 -0
tokenizer_config.json +34 -0

README.md ADDED Viewed

	@@ -0,0 +1,156 @@

+---
+library_name: transformers
+pipeline_tag: feature-extraction
+base_model: google/medsiglip-448
+tags:
+  - medical-imaging
+  - mri
+  - brain-mri
+  - siglip
+  - vision-language
+  - contrastive-learning
+  - feature-extraction
+  - custom-code
+  - pytorch
+---
+# Brain MRI SigLIP
+Brain MRI SigLIP is a 3D MRI vision-language representation model trained with a SigLIP-style image-text contrastive objective. This repository publishes the final saved `stage2_joint_finetune` checkpoint from the `brain_mri_siglip_run_0509` experiment.
+This checkpoint is intended as a research visual encoder for brain MRI downstream tasks and as a warm-start encoder for building a medical VLM. It is not a clinical diagnostic device.
+## Model Summary
+- Base text tower: `google/medsiglip-448`
+- Model class: `BrainMRISiglipModel`
+- Vision input: single-channel 3D MRI volumes
+- Expected volume shape: `[1, 128, 192, 192]`
+- Projection dimension: `1152`
+- Patch size: `[8, 16, 16]`
+- Training precision: `bf16`
+- Training input format: preprocessed `.pt` tensors, `float16`, value range `[-1, 1]`
+## Training Context
+This model was initialized from the `brain_mri_siglip_run_0509/stage1_freeze_text` checkpoint and then jointly fine-tuned with both vision and text towers trainable.
+Training summary:
+- Training samples: `950,720`
+- Validation samples: `67,450`
+- Validation samples with `metadata_text`: `32,278`
+- Stage 1: frozen text tower, vision-heavy training
+- Stage 2: joint vision-text fine-tuning
+- Stage 2 epochs configured: `8`
+- World size: `5`
+- Stage 2 per-device batch size: `160`
+- Stage 2 contrastive forward batch: `800`
+- Gradient checkpointing: text and vision enabled
+Training-time retrieval evaluation used capped validation subsets and should be treated as monitoring rather than a final benchmark.
+## Loading
+This model uses custom Transformers code. Load it with `trust_remote_code=True`.
+```python
+import torch
+from transformers import AutoModel, AutoProcessor
+repo_id = "shenxiaochen/brain-mri-siglip"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = AutoModel.from_pretrained(
+    repo_id,
+    trust_remote_code=True,
+    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+).to(device).eval()
+processor = AutoProcessor.from_pretrained(
+    repo_id,
+    trust_remote_code=True,
+)
+```
+## NIfTI Preprocessing
+For reproducible inference from NIfTI files, pass paths directly to the saved processor. This repository includes the offline-aligned preprocessing implementation used to match the training tensor distribution.
+```python
+nifti_path = "/path/to/brain_mri.nii.gz"
+inputs = processor(
+    volumes=nifti_path,
+    return_tensors="pt",
+)
+pixel_values = inputs["pixel_values"].to(device)
+if torch.cuda.is_available():
+    pixel_values = pixel_values.to(dtype=torch.bfloat16)
+with torch.inference_mode():
+    image_embeds = model.get_image_features(pixel_values=pixel_values)
+print(pixel_values.shape)  # [1, 1, 128, 192, 192]
+print(image_embeds.shape)  # [1, 1152]
+```
+The saved path-based preprocessing recipe is:
+- canonicalize image orientation to closest RAS
+- build foreground mask with threshold `1e-3`
+- keep the largest connected foreground component
+- crop foreground with `5mm` margin
+- normalize foreground intensities with `0.5/99.5` percentiles
+- map intensities to `[-1, 1]`
+- resample to spacing `(1.25, 1.0, 1.0)`
+- downscale to fit `[128, 192, 192]`
+- center-pad with background value `-1.0`
+The exact settings are saved in `preprocessor_config.json` and `processor_config.json`.
+## Using Preprocessed `.pt` Inputs
+If your data is already stored as the same offline preprocessed tensors used during training, you can load it directly:
+```python
+payload = torch.load("/path/to/sample.pt", map_location="cpu")
+pixel_values = payload["pixel_values"] if isinstance(payload, dict) else payload
+if pixel_values.ndim == 4:
+    pixel_values = pixel_values.unsqueeze(0)
+pixel_values = pixel_values.to(device=device, dtype=torch.bfloat16)
+with torch.inference_mode():
+    image_embeds = model.get_image_features(pixel_values=pixel_values)
+```
+Expected tensor format:
+- shape `[1, 128, 192, 192]` for one volume, or `[B, 1, 128, 192, 192]` for a batch
+- values in `[-1, 1]`
+- padded background voxels near `-1.0`
+## VLM Integration Notes
+For VLM construction, use the 3D vision tower as a visual backbone and add a projector, Q-Former, Perceiver resampler, or other token compressor before connecting to an LLM.
+A practical downstream recipe is:
+1. Freeze this MRI encoder and train only the multimodal projector/resampler.
+2. Evaluate downstream classification, retrieval, report alignment, or instruction-following behavior.
+3. Optionally unfreeze the top vision layers with a much smaller learning rate.
+## Limitations
+- This checkpoint was trained for representation learning, not diagnosis.
+- Performance should be validated on task-specific subject-level or study-level splits.
+- Scanner, protocol, site, and preprocessing differences can affect embeddings.
+- External users should preserve the saved preprocessing pipeline for NIfTI inference.
+- Retrieval monitoring during training is not a substitute for downstream clinical validation.
+## Citation
+If you use this checkpoint, please cite this model repository and the upstream MedSigLIP model where appropriate.

__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from .configuration_brain_mri_siglip import BrainMRISiglipConfig
+from .modeling_brain_mri_siglip import BrainMRISiglipModel
+from .processing_brain_mri_siglip import BrainMRISiglipProcessor, BrainMRISiglipVolumeProcessor
+__all__ = [
+    "BrainMRISiglipConfig",
+    "BrainMRISiglipModel",
+    "BrainMRISiglipProcessor",
+    "BrainMRISiglipVolumeProcessor",
+]
+try:
+    BrainMRISiglipConfig.register_for_auto_class("AutoConfig")
+    BrainMRISiglipModel.register_for_auto_class("AutoModel")
+    BrainMRISiglipProcessor.register_for_auto_class("AutoProcessor")
+except Exception:
+    # Registration is best-effort and not required for local imports.
+    pass

common.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Common utility helpers."""
+from __future__ import annotations
+import shutil
+from pathlib import Path
+from typing import Iterable
+from typing import Sequence, Tuple, Union
+def to_3tuple(value: Union[int, Sequence[int]], name: str) -> Tuple[int, int, int]:
+    if isinstance(value, int):
+        return (value, value, value)
+    if len(value) != 3:
+        raise ValueError(f"`{name}` must be an int or length-3 sequence. Got: {value}")
+    return (int(value[0]), int(value[1]), int(value[2]))
+REMOTE_CODE_FILES = (
+    "__init__.py",
+    "common.py",
+    "configuration_brain_mri_siglip.py",
+    "modeling_brain_mri_siglip.py",
+    "offline_aligned_preprocessing.py",
+    "processing_brain_mri_siglip.py",
+)
+def copy_remote_code_files(destination: Union[str, Path], file_names: Iterable[str] = REMOTE_CODE_FILES) -> None:
+    src_dir = Path(__file__).resolve().parent
+    dst_dir = Path(destination)
+    dst_dir.mkdir(parents=True, exist_ok=True)
+    for name in file_names:
+        src_file = src_dir / name
+        if src_file.exists():
+            shutil.copy2(src_file, dst_dir / name)

config.json ADDED Viewed

	@@ -0,0 +1,181 @@

+{
+  "architectures": [
+    "BrainMRISiglipModel"
+  ],
+  "attn_implementation": null,
+  "auto_map": {
+    "AutoConfig": "configuration_brain_mri_siglip.BrainMRISiglipConfig",
+    "AutoModel": "modeling_brain_mri_siglip.BrainMRISiglipModel",
+    "AutoProcessor": "processing_brain_mri_siglip.BrainMRISiglipProcessor"
+  },
+  "dtype": "float32",
+  "initializer_range": 0.02,
+  "logit_bias_init_value": -10.0,
+  "logit_scale_init_value": 2.6592,
+  "logit_scale_max": 100.0,
+  "logit_scale_min": 0.001,
+  "max_text_length": 64,
+  "model_type": "brain-mri-siglip",
+  "num_channels": 1,
+  "patch_size": [
+    8,
+    16,
+    16
+  ],
+  "projection_dim": 1152,
+  "text_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 49406,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dtype": null,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 49407,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1152,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "intermediate_size": 4304,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-06,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 64,
+    "min_length": 0,
+    "model_type": "siglip_text_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 27,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 1,
+    "prefix": null,
+    "problem_type": null,
+    "projection_size": 1152,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torchscript": false,
+    "transformers_version": "4.57.6",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "vocab_size": 32000
+  },
+  "text_model_name_or_path": "google/medsiglip-448",
+  "transformers_version": "4.57.6",
+  "vision_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dtype": null,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1152,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 448,
+    "intermediate_size": 4304,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-06,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "siglip_vision_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 1,
+    "num_hidden_layers": 27,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 14,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torchscript": false,
+    "transformers_version": "4.57.6",
+    "typical_p": 1.0,
+    "use_bfloat16": false
+  },
+  "volume_size": [
+    128,
+    192,
+    192
+  ]
+}

configuration_brain_mri_siglip.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""Configuration for Brain MRI SigLIP."""
+from __future__ import annotations
+from typing import Any, Dict, Mapping, Optional, Sequence, Union
+from transformers import PretrainedConfig, SiglipTextConfig, SiglipVisionConfig
+from .common import to_3tuple
+class BrainMRISiglipConfig(PretrainedConfig):
+    r"""Configuration class for :class:`BrainMRISiglipModel`."""
+    model_type = "brain-mri-siglip"
+    def __init__(
+        self,
+        text_config: Optional[Mapping[str, Any]] = None,
+        vision_config: Optional[Mapping[str, Any]] = None,
+        text_model_name_or_path: str = "google/medsiglip-448",
+        volume_size: Union[int, Sequence[int]] = (128, 192, 192),
+        patch_size: Union[int, Sequence[int]] = (8, 16, 16),
+        num_channels: int = 1,
+        projection_dim: Optional[int] = None,
+        logit_scale_init_value: float = 2.6592,
+        logit_scale_min: float = 1e-3,
+        logit_bias_init_value: float = -10.0,
+        logit_scale_max: float = 100.0,
+        attn_implementation: Optional[str] = None,
+        max_text_length: int = 64,
+        initializer_range: float = 0.02,
+        auto_map: Optional[Mapping[str, str]] = None,
+        **kwargs: Any,
+    ) -> None:
+        if text_config is None:
+            text_config_dict = SiglipTextConfig().to_dict()
+        else:
+            text_config_dict = dict(text_config)
+        if vision_config is None:
+            vision_config_dict = SiglipVisionConfig().to_dict()
+        else:
+            vision_config_dict = dict(vision_config)
+        resolved_volume_size = to_3tuple(volume_size, "volume_size")
+        resolved_patch_size = to_3tuple(patch_size, "patch_size")
+        if any(v <= 0 for v in resolved_volume_size):
+            raise ValueError(f"`volume_size` must contain positive integers. Got {resolved_volume_size}.")
+        if any(p <= 0 for p in resolved_patch_size):
+            raise ValueError(f"`patch_size` must contain positive integers. Got {resolved_patch_size}.")
+        if any(v % p != 0 for v, p in zip(resolved_volume_size, resolved_patch_size)):
+            raise ValueError(
+                f"`volume_size` must be divisible by `patch_size`. "
+                f"Got volume_size={resolved_volume_size}, patch_size={resolved_patch_size}."
+            )
+        vision_config_dict["num_channels"] = int(num_channels)
+        if projection_dim is None:
+            projection_dim = int(
+                text_config_dict.get(
+                    "projection_size",
+                    text_config_dict.get("hidden_size", vision_config_dict.get("hidden_size", 768)),
+                )
+            )
+        if auto_map is None:
+            # Keep module paths as `<module>.<Class>` for compatibility with HF dynamic loader.
+            auto_map = {
+                "AutoConfig": "configuration_brain_mri_siglip.BrainMRISiglipConfig",
+                "AutoModel": "modeling_brain_mri_siglip.BrainMRISiglipModel",
+                "AutoProcessor": "processing_brain_mri_siglip.BrainMRISiglipProcessor",
+            }
+        self.text_config = text_config_dict
+        self.vision_config = vision_config_dict
+        self.text_model_name_or_path = text_model_name_or_path
+        self.volume_size = list(resolved_volume_size)
+        self.patch_size = list(resolved_patch_size)
+        self.num_channels = int(num_channels)
+        self.projection_dim = int(projection_dim)
+        self.logit_scale_init_value = float(logit_scale_init_value)
+        self.logit_scale_min = float(logit_scale_min)
+        self.logit_bias_init_value = float(logit_bias_init_value)
+        self.logit_scale_max = float(logit_scale_max)
+        self.attn_implementation = attn_implementation
+        self.max_text_length = int(max_text_length)
+        self.initializer_range = float(initializer_range)
+        self.auto_map = dict(auto_map)
+        super().__init__(**kwargs)
+    def get_text_config(self, *args: Any, **kwargs: Any) -> SiglipTextConfig:
+        del args, kwargs
+        config = SiglipTextConfig(**self.text_config)
+        if self.attn_implementation:
+            config._attn_implementation = self.attn_implementation
+        elif getattr(config, "_attn_implementation", None) is None:
+            config._attn_implementation = "sdpa"
+        return config
+    def get_vision_config(self, *args: Any, **kwargs: Any) -> SiglipVisionConfig:
+        del args, kwargs
+        cfg_dict = dict(self.vision_config)
+        cfg_dict["num_channels"] = int(self.num_channels)
+        config = SiglipVisionConfig(**cfg_dict)
+        if self.attn_implementation:
+            config._attn_implementation = self.attn_implementation
+        elif getattr(config, "_attn_implementation", None) is None:
+            config._attn_implementation = "sdpa"
+        return config

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:656edd47b6a98dfa950593e10cb0d8214b30e7eca4a4f725c69129f22aabf055
+size 3536557760

modeling_brain_mri_siglip.py ADDED Viewed

	@@ -0,0 +1,615 @@

+"""Modeling code for Brain MRI SigLIP."""
+from __future__ import annotations
+import math
+from typing import Any, Mapping, Optional, Tuple
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import nn
+from torch.distributed.nn.functional import all_gather as all_gather_with_grad
+from transformers import AutoConfig, AutoModel, PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutputWithPooling
+from transformers.models.siglip import SiglipTextConfig, SiglipVisionConfig
+from transformers.models.siglip.modeling_siglip import (
+    SiglipAttention,
+    SiglipEncoder,
+    SiglipMLP,
+    SiglipMultiheadAttentionPoolingHead,
+    SiglipOutput,
+    SiglipTextModel,
+    default_flax_embed_init,
+)
+from .configuration_brain_mri_siglip import BrainMRISiglipConfig
+def _siglip_sigmoid_loss(logits_per_text: torch.Tensor) -> torch.Tensor:
+    eye = torch.eye(logits_per_text.size(0), device=logits_per_text.device, dtype=logits_per_text.dtype)
+    labels = -torch.ones_like(logits_per_text) + 2 * eye
+    loglik = F.logsigmoid(labels * logits_per_text)
+    nll = -torch.sum(loglik, dim=-1)
+    return nll.mean()
+def _lecun_normal_(tensor: torch.Tensor) -> torch.Tensor:
+    fan_in, _ = nn.init._calculate_fan_in_and_fan_out(tensor)
+    if fan_in <= 0:
+        return nn.init.normal_(tensor, mean=0.0, std=1.0)
+    return nn.init.normal_(tensor, mean=0.0, std=1.0 / math.sqrt(fan_in))
+def _siglip_embedding_init_(tensor: torch.Tensor) -> torch.Tensor:
+    default_flax_embed_init(tensor)
+    return tensor
+def _distributed_concat_with_grad(embeddings: torch.Tensor) -> torch.Tensor:
+    if not dist.is_available() or not dist.is_initialized():
+        return embeddings
+    world_size = dist.get_world_size()
+    local_batch = embeddings.shape[0]
+    local_batch_tensor = torch.tensor([local_batch], dtype=torch.long, device=embeddings.device)
+    batch_sizes = [torch.zeros_like(local_batch_tensor) for _ in range(world_size)]
+    dist.all_gather(batch_sizes, local_batch_tensor)
+    batch_sizes_int = [int(size.item()) for size in batch_sizes]
+    max_batch = max(batch_sizes_int)
+    if local_batch < max_batch:
+        pad_shape = (max_batch - local_batch, embeddings.shape[1])
+        padding = embeddings.new_zeros(pad_shape)
+        padded_embeddings = torch.cat([embeddings, padding], dim=0)
+    else:
+        padded_embeddings = embeddings
+    gathered = all_gather_with_grad(padded_embeddings)
+    if isinstance(gathered, torch.Tensor):
+        if gathered.ndim == 3 and gathered.shape[0] == world_size:
+            chunks = [gathered[rank] for rank in range(world_size)]
+        else:
+            chunks = list(torch.split(gathered, max_batch, dim=0))
+    else:
+        chunks = list(gathered)
+    trimmed = [chunk[: batch_sizes_int[rank]] for rank, chunk in enumerate(chunks) if batch_sizes_int[rank] > 0]
+    if not trimmed:
+        return embeddings.new_zeros((0, embeddings.shape[1]))
+    return torch.cat(trimmed, dim=0)
+def _load_state_dict_with_flexible_prefix(
+    module: nn.Module,
+    source_state_dict: Mapping[str, torch.Tensor],
+    strict: bool = True,
+) -> Tuple[Any, Any]:
+    target_keys = list(module.state_dict().keys())
+    source_keys = list(source_state_dict.keys())
+    if not target_keys or not source_keys:
+        return module.load_state_dict(source_state_dict, strict=strict)
+    target_has_text_model_prefix = all(key.startswith("text_model.") for key in target_keys)
+    source_has_text_model_prefix = all(key.startswith("text_model.") for key in source_keys)
+    aligned_state_dict = dict(source_state_dict)
+    if target_has_text_model_prefix and not source_has_text_model_prefix:
+        aligned_state_dict = {f"text_model.{key}": value for key, value in source_state_dict.items()}
+    elif source_has_text_model_prefix and not target_has_text_model_prefix:
+        aligned_state_dict = {
+            key[len("text_model.") :]: value for key, value in source_state_dict.items() if key.startswith("text_model.")
+        }
+    return module.load_state_dict(aligned_state_dict, strict=strict)
+class SiglipVisionEmbeddings3D(nn.Module):
+    """3D patch embeddings for MRI volumes."""
+    def __init__(
+        self,
+        vision_config: SiglipVisionConfig,
+        volume_size: Tuple[int, int, int],
+        patch_size: Tuple[int, int, int],
+        num_channels: int,
+    ) -> None:
+        super().__init__()
+        self.embed_dim = int(vision_config.hidden_size)
+        self.volume_size = tuple(int(v) for v in volume_size)
+        self.patch_size = tuple(int(v) for v in patch_size)
+        if any(v % p != 0 for v, p in zip(self.volume_size, self.patch_size)):
+            raise ValueError(
+                "Volume size must be divisible by patch size for all dimensions. "
+                f"Got volume_size={self.volume_size}, patch_size={self.patch_size}."
+            )
+        self.patch_embedding = nn.Conv3d(
+            in_channels=int(num_channels),
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            padding=0,
+        )
+        patches_per_dim = tuple(v // p for v, p in zip(self.volume_size, self.patch_size))
+        self.grid_size = patches_per_dim
+        self.num_patches = int(patches_per_dim[0] * patches_per_dim[1] * patches_per_dim[2])
+        self.position_embedding = nn.Embedding(self.num_patches, self.embed_dim)
+        self.register_buffer("position_ids", torch.arange(self.num_patches).expand((1, -1)), persistent=False)
+    def _interpolate_position_embeddings(
+        self,
+        grid_size: Tuple[int, int, int],
+        target_dtype: torch.dtype,
+        target_device: torch.device,
+    ) -> torch.Tensor:
+        base_grid_depth, base_grid_height, base_grid_width = self.grid_size
+        position_embeddings = self.position_embedding.weight.reshape(
+            base_grid_depth,
+            base_grid_height,
+            base_grid_width,
+            self.embed_dim,
+        )
+        position_embeddings = position_embeddings.permute(3, 0, 1, 2).unsqueeze(0)
+        position_embeddings = F.interpolate(
+            position_embeddings,
+            size=grid_size,
+            mode="trilinear",
+            align_corners=False,
+        )
+        position_embeddings = position_embeddings.squeeze(0).permute(1, 2, 3, 0).reshape(1, -1, self.embed_dim)
+        return position_embeddings.to(dtype=target_dtype, device=target_device)
+    def _get_position_embeddings(
+        self,
+        grid_size: Tuple[int, int, int],
+        target_dtype: torch.dtype,
+        target_device: torch.device,
+        interpolate_pos_encoding: bool,
+    ) -> torch.Tensor:
+        num_patches = int(grid_size[0] * grid_size[1] * grid_size[2])
+        if num_patches == self.num_patches:
+            return self.position_embedding(self.position_ids).to(dtype=target_dtype, device=target_device)
+        if not interpolate_pos_encoding:
+            raise ValueError(
+                f"Unexpected number of patches: {num_patches} vs expected {self.num_patches}. "
+                "Enable `interpolate_pos_encoding=True` for variable volume sizes."
+            )
+        return self._interpolate_position_embeddings(grid_size, target_dtype=target_dtype, target_device=target_device)
+    def forward(self, pixel_values: torch.Tensor, interpolate_pos_encoding: bool = True) -> torch.Tensor:
+        if pixel_values.ndim != 5:
+            raise ValueError(
+                "`pixel_values` must have shape [batch, channels, depth, height, width]. "
+                f"Got shape {tuple(pixel_values.shape)}"
+            )
+        spatial_shape = tuple(int(v) for v in pixel_values.shape[-3:])
+        if any(dim % patch != 0 for dim, patch in zip(spatial_shape, self.patch_size)):
+            raise ValueError(
+                f"Input spatial size {spatial_shape} must be divisible by patch_size {self.patch_size}."
+            )
+        target_dtype = self.patch_embedding.weight.dtype
+        embeddings = self.patch_embedding(pixel_values.to(dtype=target_dtype))
+        grid_size = tuple(int(v) for v in embeddings.shape[-3:])
+        embeddings = embeddings.flatten(2).transpose(1, 2)
+        position_embeddings = self._get_position_embeddings(
+            grid_size=grid_size,
+            target_dtype=embeddings.dtype,
+            target_device=embeddings.device,
+            interpolate_pos_encoding=interpolate_pos_encoding,
+        )
+        return embeddings + position_embeddings
+class BrainMRISiglipVisionTransformer(nn.Module):
+    """SigLIP vision tower with 3D embeddings."""
+    def __init__(self, config: BrainMRISiglipConfig) -> None:
+        super().__init__()
+        vision_config = config.get_vision_config()
+        volume_size = tuple(int(v) for v in config.volume_size)
+        patch_size = tuple(int(v) for v in config.patch_size)
+        self.embeddings = SiglipVisionEmbeddings3D(
+            vision_config=vision_config,
+            volume_size=volume_size,
+            patch_size=patch_size,
+            num_channels=int(config.num_channels),
+        )
+        self.encoder = SiglipEncoder(vision_config)
+        self.post_layernorm = nn.LayerNorm(vision_config.hidden_size, eps=vision_config.layer_norm_eps)
+        self.head = SiglipMultiheadAttentionPoolingHead(vision_config)
+    def forward(
+        self,
+        pixel_values: torch.Tensor,
+        interpolate_pos_encoding: bool = True,
+        **kwargs: Any,
+    ) -> BaseModelOutputWithPooling:
+        hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
+        encoder_outputs = self.encoder(inputs_embeds=hidden_states, **kwargs)
+        last_hidden_state = self.post_layernorm(encoder_outputs.last_hidden_state)
+        pooler_output = self.head(last_hidden_state)
+        return BaseModelOutputWithPooling(last_hidden_state=last_hidden_state, pooler_output=pooler_output)
+class BrainMRISiglipPreTrainedModel(PreTrainedModel):
+    config_class = BrainMRISiglipConfig
+    base_model_prefix = "brain_mri_siglip"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module: nn.Module) -> None:
+        if isinstance(module, SiglipVisionEmbeddings3D):
+            width = int(self.config.get_vision_config().hidden_size)
+            nn.init.normal_(module.position_embedding.weight, std=1.0 / math.sqrt(width))
+            _lecun_normal_(module.patch_embedding.weight)
+            if module.patch_embedding.bias is not None:
+                nn.init.zeros_(module.patch_embedding.bias)
+            return
+        if isinstance(module, nn.Embedding):
+            _siglip_embedding_init_(module.weight)
+            return
+        if isinstance(module, SiglipAttention):
+            nn.init.xavier_uniform_(module.q_proj.weight)
+            nn.init.xavier_uniform_(module.k_proj.weight)
+            nn.init.xavier_uniform_(module.v_proj.weight)
+            nn.init.xavier_uniform_(module.out_proj.weight)
+            if module.q_proj.bias is not None:
+                nn.init.zeros_(module.q_proj.bias)
+            if module.k_proj.bias is not None:
+                nn.init.zeros_(module.k_proj.bias)
+            if module.v_proj.bias is not None:
+                nn.init.zeros_(module.v_proj.bias)
+            if module.out_proj.bias is not None:
+                nn.init.zeros_(module.out_proj.bias)
+            return
+        if isinstance(module, SiglipMLP):
+            nn.init.xavier_uniform_(module.fc1.weight)
+            nn.init.xavier_uniform_(module.fc2.weight)
+            if module.fc1.bias is not None:
+                nn.init.normal_(module.fc1.bias, std=1e-6)
+            if module.fc2.bias is not None:
+                nn.init.normal_(module.fc2.bias, std=1e-6)
+            return
+        if isinstance(module, SiglipMultiheadAttentionPoolingHead):
+            nn.init.xavier_uniform_(module.probe)
+            nn.init.xavier_uniform_(module.attention.in_proj_weight)
+            if module.attention.in_proj_bias is not None:
+                nn.init.zeros_(module.attention.in_proj_bias)
+            return
+        if isinstance(module, (nn.Linear, nn.Conv3d)):
+            _lecun_normal_(module.weight)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+            return
+        if isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+            return
+class BrainMRISiglipModel(BrainMRISiglipPreTrainedModel):
+    """3D MRI + text dual-encoder model with SigLIP contrastive loss."""
+    def __init__(self, config: BrainMRISiglipConfig) -> None:
+        super().__init__(config)
+        self.text_config = config.get_text_config()
+        self.vision_config = config.get_vision_config()
+        self.text_model = SiglipTextModel(self.text_config)
+        self.vision_model = BrainMRISiglipVisionTransformer(config)
+        projection_dim = int(config.projection_dim)
+        self.visual_projection = nn.Linear(self.vision_config.hidden_size, projection_dim, bias=False)
+        self.text_projection = nn.Linear(self.text_config.hidden_size, projection_dim, bias=False)
+        self.logit_scale = nn.Parameter(torch.tensor(float(config.logit_scale_init_value)))
+        self.logit_bias = nn.Parameter(torch.tensor(float(config.logit_bias_init_value)))
+        self.post_init()
+    @classmethod
+    def from_medsiglip_pretrained(
+        cls,
+        text_model_name_or_path: str = "google/medsiglip-448",
+        trust_remote_code: bool = True,
+        local_files_only: bool = False,
+        **kwargs: Any,
+    ) -> "BrainMRISiglipModel":
+        base_config = AutoConfig.from_pretrained(
+            text_model_name_or_path,
+            trust_remote_code=trust_remote_code,
+            local_files_only=local_files_only,
+        )
+        if hasattr(base_config, "text_config"):
+            raw_text_config = base_config.text_config
+            text_config = raw_text_config.to_dict() if hasattr(raw_text_config, "to_dict") else dict(raw_text_config)
+        else:
+            text_config = SiglipTextConfig().to_dict()
+        if hasattr(base_config, "vision_config"):
+            raw_vision_config = base_config.vision_config
+            vision_config = (
+                raw_vision_config.to_dict()
+                if hasattr(raw_vision_config, "to_dict")
+                else dict(raw_vision_config)
+            )
+        else:
+            vision_config = SiglipVisionConfig().to_dict()
+        projection_dim = kwargs.pop(
+            "projection_dim",
+            int(getattr(base_config, "projection_dim", text_config.get("projection_size", text_config["hidden_size"]))),
+        )
+        config = BrainMRISiglipConfig(
+            text_config=text_config,
+            vision_config=vision_config,
+            projection_dim=projection_dim,
+            text_model_name_or_path=text_model_name_or_path,
+            **kwargs,
+        )
+        model = cls(config)
+        model.load_text_tower_from_pretrained(
+            text_model_name_or_path,
+            trust_remote_code=trust_remote_code,
+            local_files_only=local_files_only,
+        )
+        return model
+    def load_text_tower_from_pretrained(
+        self,
+        text_model_name_or_path: str,
+        trust_remote_code: bool = True,
+        local_files_only: bool = False,
+        strict: bool = True,
+    ) -> Tuple[Any, Any]:
+        source_model = None
+        try:
+            source_model = AutoModel.from_pretrained(
+                text_model_name_or_path,
+                trust_remote_code=trust_remote_code,
+                local_files_only=local_files_only,
+            )
+            if hasattr(source_model, "text_model"):
+                source_text_model = source_model.text_model
+            elif isinstance(source_model, SiglipTextModel):
+                source_text_model = source_model
+            else:
+                raise ValueError(
+                    f"Could not find a SigLIP text tower in `{text_model_name_or_path}` "
+                    f"({type(source_model).__name__})."
+                )
+            missing, unexpected = _load_state_dict_with_flexible_prefix(
+                self.text_model,
+                source_text_model.state_dict(),
+                strict=strict,
+            )
+            if hasattr(source_model, "text_projection") and isinstance(source_model.text_projection, nn.Linear):
+                if source_model.text_projection.weight.shape == self.text_projection.weight.shape:
+                    self.text_projection.load_state_dict(source_model.text_projection.state_dict())
+            if hasattr(source_model, "logit_scale") and source_model.logit_scale.shape == self.logit_scale.shape:
+                self.logit_scale.data.copy_(source_model.logit_scale.data)
+            if hasattr(source_model, "logit_bias") and source_model.logit_bias.shape == self.logit_bias.shape:
+                self.logit_bias.data.copy_(source_model.logit_bias.data)
+            return missing, unexpected
+        finally:
+            if source_model is not None:
+                del source_model
+    def freeze_text_tower(self, trainable_layers: int = 0) -> None:
+        for parameter in self.text_model.parameters():
+            parameter.requires_grad = False
+        trainable_layers = int(trainable_layers)
+        if trainable_layers > 0 and hasattr(self.text_model, "text_model") and hasattr(
+            self.text_model.text_model, "encoder"
+        ):
+            layers = self.text_model.text_model.encoder.layers
+            for layer in layers[-trainable_layers:]:
+                for parameter in layer.parameters():
+                    parameter.requires_grad = True
+            for module_name in ("final_layer_norm", "head"):
+                if hasattr(self.text_model.text_model, module_name):
+                    for parameter in getattr(self.text_model.text_model, module_name).parameters():
+                        parameter.requires_grad = True
+        for parameter in self.text_projection.parameters():
+            parameter.requires_grad = True
+    def freeze_vision_tower(self, trainable_layers: int = 0, train_embeddings: bool = False) -> None:
+        for parameter in self.vision_model.parameters():
+            parameter.requires_grad = False
+        if train_embeddings:
+            for parameter in self.vision_model.embeddings.parameters():
+                parameter.requires_grad = True
+        trainable_layers = int(trainable_layers)
+        if trainable_layers > 0:
+            layers = self.vision_model.encoder.layers
+            for layer in layers[-trainable_layers:]:
+                for parameter in layer.parameters():
+                    parameter.requires_grad = True
+            for parameter in self.vision_model.post_layernorm.parameters():
+                parameter.requires_grad = True
+            for parameter in self.vision_model.head.parameters():
+                parameter.requires_grad = True
+        for parameter in self.visual_projection.parameters():
+            parameter.requires_grad = True
+    def get_text_features(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        text_kwargs: Optional[Mapping[str, Any]] = None,
+        **kwargs: Any,
+    ) -> torch.FloatTensor:
+        kwargs = dict(kwargs)
+        nested_text_kwargs = kwargs.pop("text_kwargs", None)
+        if kwargs:
+            raise TypeError(f"Unexpected keyword arguments for text tower: {sorted(kwargs.keys())}")
+        merged_text_kwargs: dict[str, Any] = {}
+        if nested_text_kwargs:
+            merged_text_kwargs.update(dict(nested_text_kwargs))
+        if text_kwargs:
+            merged_text_kwargs.update(dict(text_kwargs))
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            **merged_text_kwargs,
+        )
+        text_features = self.text_projection(text_outputs.pooler_output)
+        return F.normalize(text_features, dim=-1)
+    def get_image_features(
+        self,
+        pixel_values: torch.FloatTensor,
+        interpolate_pos_encoding: bool = True,
+        vision_kwargs: Optional[Mapping[str, Any]] = None,
+        **kwargs: Any,
+    ) -> torch.FloatTensor:
+        kwargs = dict(kwargs)
+        nested_vision_kwargs = kwargs.pop("vision_kwargs", None)
+        legacy_interpolate_pos_encoding = kwargs.pop("interpolate_pos_encoding", None)
+        if kwargs:
+            raise TypeError(f"Unexpected keyword arguments for vision tower: {sorted(kwargs.keys())}")
+        merged_vision_kwargs: dict[str, Any] = {}
+        if nested_vision_kwargs:
+            merged_vision_kwargs.update(dict(nested_vision_kwargs))
+        if vision_kwargs:
+            merged_vision_kwargs.update(dict(vision_kwargs))
+        if legacy_interpolate_pos_encoding is not None:
+            interpolate_pos_encoding = bool(legacy_interpolate_pos_encoding)
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            interpolate_pos_encoding=interpolate_pos_encoding,
+            **merged_vision_kwargs,
+        )
+        image_features = self.visual_projection(vision_outputs.pooler_output)
+        return F.normalize(image_features, dim=-1)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        return_loss: Optional[bool] = None,
+        gather_loss: bool = False,
+        interpolate_pos_encoding: bool = True,
+        vision_kwargs: Optional[Mapping[str, Any]] = None,
+        text_kwargs: Optional[Mapping[str, Any]] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs: Any,
+    ) -> SiglipOutput:
+        if pixel_values is None:
+            raise ValueError("`pixel_values` must be provided.")
+        if input_ids is None:
+            raise ValueError("`input_ids` must be provided.")
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        return_loss = bool(return_loss) if return_loss is not None else False
+        kwargs = dict(kwargs)
+        nested_vision_kwargs = kwargs.pop("vision_kwargs", None)
+        nested_text_kwargs = kwargs.pop("text_kwargs", None)
+        legacy_interpolate_pos_encoding = kwargs.pop("interpolate_pos_encoding", None)
+        if kwargs:
+            raise TypeError(f"Unexpected keyword arguments in model.forward: {sorted(kwargs.keys())}")
+        merged_vision_kwargs: dict[str, Any] = {}
+        merged_text_kwargs: dict[str, Any] = {}
+        if nested_vision_kwargs:
+            merged_vision_kwargs.update(dict(nested_vision_kwargs))
+        if vision_kwargs:
+            merged_vision_kwargs.update(dict(vision_kwargs))
+        if nested_text_kwargs:
+            merged_text_kwargs.update(dict(nested_text_kwargs))
+        if text_kwargs:
+            merged_text_kwargs.update(dict(text_kwargs))
+        if legacy_interpolate_pos_encoding is not None:
+            interpolate_pos_encoding = bool(legacy_interpolate_pos_encoding)
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            interpolate_pos_encoding=interpolate_pos_encoding,
+            **merged_vision_kwargs,
+        )
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            **merged_text_kwargs,
+        )
+        image_embeds = self.visual_projection(vision_outputs.pooler_output)
+        text_embeds = self.text_projection(text_outputs.pooler_output)
+        image_embeds = F.normalize(image_embeds, p=2, dim=-1)
+        text_embeds = F.normalize(text_embeds, p=2, dim=-1)
+        image_embeds_for_loss = image_embeds
+        text_embeds_for_loss = text_embeds
+        if gather_loss and return_loss:
+            image_embeds_for_loss = _distributed_concat_with_grad(image_embeds)
+            text_embeds_for_loss = _distributed_concat_with_grad(text_embeds)
+        logit_scale = self.logit_scale.exp().clamp(
+            min=float(self.config.logit_scale_min),
+            max=float(self.config.logit_scale_max),
+        )
+        local_logits_per_text = torch.matmul(
+            text_embeds,
+            image_embeds.t().to(text_embeds.device),
+        )
+        local_logits_per_text = local_logits_per_text * logit_scale + self.logit_bias
+        local_logits_per_image = local_logits_per_text.t()
+        loss = None
+        if return_loss:
+            loss_logits_per_text = torch.matmul(
+                text_embeds_for_loss,
+                image_embeds_for_loss.t().to(text_embeds_for_loss.device),
+            )
+            loss_logits_per_text = loss_logits_per_text * logit_scale + self.logit_bias
+            loss = _siglip_sigmoid_loss(loss_logits_per_text)
+        if not return_dict:
+            output = (
+                local_logits_per_image,
+                local_logits_per_text,
+                text_embeds,
+                image_embeds,
+                text_outputs,
+                vision_outputs,
+            )
+            return ((loss,) + output) if loss is not None else output
+        return SiglipOutput(
+            loss=loss,
+            logits_per_image=local_logits_per_image,
+            logits_per_text=local_logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )

offline_aligned_preprocessing.py ADDED Viewed

	@@ -0,0 +1,286 @@

+"""Shared offline-aligned preprocessing helpers for 3D brain MRI volumes."""
+from __future__ import annotations
+import math
+from pathlib import Path
+from typing import Any, Mapping
+import nibabel as nib
+import numpy as np
+import torch
+import torch.nn.functional as F
+try:
+    from scipy import ndimage as scipy_ndimage
+except Exception:  # pragma: no cover - optional import surface
+    scipy_ndimage = None
+TARGET_SHAPE = (128, 192, 192)
+TARGET_SPACING = (1.25, 1.0, 1.0)
+CROP_MARGIN_MM = 5.0
+FOREGROUND_THRESHOLD = 1e-3
+BACKGROUND_VALUE = -1.0
+FOREGROUND_STRATEGY = "largest_component_nonzero"
+GENERIC_RECIPE_ID = "generic_foreground_128x192x192_fp16_v1"
+GENERIC_CACHE_VERSION = 1
+def load_canonical_nifti(path: str | Path):
+    return nib.as_closest_canonical(nib.load(str(path)))
+def load_image_spacing(image) -> tuple[float, float, float]:
+    zooms = image.header.get_zooms()[:3]
+    if len(zooms) != 3:
+        raise ValueError(f"Expected a 3D image spacing tuple, got {zooms}.")
+    return tuple(float(value) for value in zooms)
+def coerce_volume_to_3d(volume: np.ndarray) -> np.ndarray:
+    if volume.ndim == 3:
+        return volume.astype(np.float32, copy=False)
+    if volume.ndim != 4:
+        raise ValueError(f"Expected a 3D or 4D volume, got shape {volume.shape}.")
+    if volume.shape[0] <= 4 and volume.shape[-1] > 4:
+        selected = volume[0]
+    else:
+        selected = volume[..., 0]
+    return np.asarray(selected, dtype=np.float32)
+def largest_connected_component(mask: np.ndarray) -> np.ndarray:
+    if not mask.any() or scipy_ndimage is None:
+        return mask
+    structure = scipy_ndimage.generate_binary_structure(mask.ndim, 1)
+    labels, num_labels = scipy_ndimage.label(mask, structure=structure)
+    if num_labels <= 1:
+        return mask
+    counts = np.bincount(labels.reshape(-1))
+    if counts.size <= 1:
+        return mask
+    counts[0] = 0
+    winning_label = int(counts.argmax())
+    if winning_label <= 0 or counts[winning_label] <= 0:
+        return mask
+    return labels == winning_label
+def build_foreground_mask(volume: np.ndarray, threshold: float = FOREGROUND_THRESHOLD) -> np.ndarray:
+    sanitized = np.nan_to_num(volume, nan=0.0, posinf=0.0, neginf=0.0)
+    raw_mask = np.abs(sanitized) > float(threshold)
+    if not raw_mask.any():
+        return np.ones_like(sanitized, dtype=bool)
+    component_mask = largest_connected_component(raw_mask)
+    component_count = int(component_mask.sum())
+    raw_count = int(raw_mask.sum())
+    if component_count <= 0:
+        return raw_mask
+    if component_count < 512 and raw_count > component_count:
+        return raw_mask
+    return component_mask
+def compute_crop_bbox(
+    mask: np.ndarray,
+    spacing: tuple[float, float, float],
+    margin_mm: float = CROP_MARGIN_MM,
+) -> tuple[tuple[int, int], ...]:
+    coords = np.where(mask)
+    if coords[0].size == 0:
+        raise ValueError("Foreground mask contains no positive voxels after selection.")
+    bbox = []
+    for axis, values in enumerate(coords):
+        margin_voxels = int(math.ceil(float(margin_mm) / float(spacing[axis])))
+        start = max(0, int(values.min()) - margin_voxels)
+        stop = min(mask.shape[axis], int(values.max()) + margin_voxels + 1)
+        bbox.append((start, stop))
+    return tuple(bbox)
+def crop_volume_and_mask(
+    volume: np.ndarray,
+    mask: np.ndarray,
+    spacing: tuple[float, float, float],
+    margin_mm: float = CROP_MARGIN_MM,
+) -> tuple[np.ndarray, np.ndarray, tuple[tuple[int, int], ...]]:
+    bbox = compute_crop_bbox(mask, spacing, margin_mm=margin_mm)
+    slices = tuple(slice(start, stop) for start, stop in bbox)
+    return volume[slices], mask[slices], bbox
+def normalize_foreground_only(volume: np.ndarray, mask: np.ndarray) -> np.ndarray:
+    sanitized = np.nan_to_num(volume, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32, copy=False)
+    foreground_values = sanitized[mask]
+    if foreground_values.size == 0:
+        raise ValueError("Cannot normalize volume because the foreground mask is empty.")
+    if foreground_values.size > 1_000_000:
+        step = max(1, foreground_values.size // 1_000_000)
+        foreground_values = foreground_values[::step]
+    low, high = np.percentile(foreground_values, [0.5, 99.5])
+    if not np.isfinite(low) or not np.isfinite(high) or high <= low:
+        normalized = np.zeros_like(sanitized, dtype=np.float32)
+    else:
+        normalized = np.clip(sanitized, float(low), float(high))
+        normalized = np.clip((normalized - float(low)) / float(high - low), 0.0, 1.0)
+        normalized = normalized * 2.0 - 1.0
+    return normalized.astype(np.float32, copy=False)
+def resize_volume(volume: np.ndarray, size: tuple[int, int, int], mode: str) -> np.ndarray:
+    tensor = torch.from_numpy(volume).unsqueeze(0).unsqueeze(0)
+    kwargs = {}
+    if mode in {"linear", "bilinear", "bicubic", "trilinear"}:
+        kwargs["align_corners"] = False
+    tensor = F.interpolate(tensor, size=size, mode=mode, **kwargs)
+    return tensor.squeeze(0).squeeze(0).cpu().numpy().astype(np.float32, copy=False)
+def resize_mask(mask: np.ndarray, size: tuple[int, int, int]) -> np.ndarray:
+    tensor = torch.from_numpy(mask.astype(np.float32, copy=False)).unsqueeze(0).unsqueeze(0)
+    tensor = F.interpolate(tensor, size=size, mode="nearest")
+    return tensor.squeeze(0).squeeze(0).cpu().numpy() > 0.5
+def resample_to_target_spacing(
+    volume: np.ndarray,
+    mask: np.ndarray,
+    source_spacing: tuple[float, float, float],
+    target_spacing: tuple[float, float, float] = TARGET_SPACING,
+) -> tuple[np.ndarray, np.ndarray]:
+    target_shape = []
+    for current_size, src, dst in zip(volume.shape, source_spacing, target_spacing):
+        target_shape.append(max(1, int(round(float(current_size) * float(src) / float(dst)))))
+    target_shape_tuple = tuple(target_shape)
+    if target_shape_tuple == tuple(int(v) for v in volume.shape):
+        return volume.astype(np.float32, copy=False), mask
+    return (
+        resize_volume(volume, target_shape_tuple, mode="trilinear"),
+        resize_mask(mask, target_shape_tuple),
+    )
+def downscale_to_fit(
+    volume: np.ndarray,
+    mask: np.ndarray,
+    target_shape: tuple[int, int, int] = TARGET_SHAPE,
+) -> tuple[np.ndarray, np.ndarray]:
+    current_shape = tuple(int(v) for v in volume.shape)
+    if all(current <= target for current, target in zip(current_shape, target_shape)):
+        return volume, mask
+    scale = min(float(target) / float(current) for current, target in zip(current_shape, target_shape))
+    if scale >= 1.0:
+        return volume, mask
+    new_shape = tuple(
+        min(target, max(1, int(math.floor(float(current) * scale))))
+        for current, target in zip(current_shape, target_shape)
+    )
+    return (
+        resize_volume(volume, new_shape, mode="trilinear"),
+        resize_mask(mask, new_shape),
+    )
+def center_pad(
+    array: np.ndarray,
+    target_shape: tuple[int, int, int] = TARGET_SHAPE,
+    fill_value: float = BACKGROUND_VALUE,
+) -> np.ndarray:
+    if any(current > target for current, target in zip(array.shape, target_shape)):
+        raise ValueError(f"Cannot center-pad shape {array.shape} into smaller target {target_shape}.")
+    pad_width = []
+    for current, target in zip(array.shape, target_shape):
+        delta = target - current
+        before = delta // 2
+        after = delta - before
+        pad_width.append((before, after))
+    return np.pad(array, pad_width=tuple(pad_width), mode="constant", constant_values=fill_value)
+def preprocess_image_with_foreground_mask(
+    image_path: str | Path,
+    *,
+    target_shape: tuple[int, int, int] = TARGET_SHAPE,
+    target_spacing: tuple[float, float, float] = TARGET_SPACING,
+    crop_margin_mm: float = CROP_MARGIN_MM,
+    foreground_threshold: float = FOREGROUND_THRESHOLD,
+    background_value: float = BACKGROUND_VALUE,
+    foreground_strategy: str = FOREGROUND_STRATEGY,
+    recipe_id: str = GENERIC_RECIPE_ID,
+    cache_version: int = GENERIC_CACHE_VERSION,
+) -> dict[str, object]:
+    image_path = Path(image_path)
+    image = load_canonical_nifti(image_path)
+    source_shape = tuple(int(value) for value in image.shape)
+    source_spacing = load_image_spacing(image)
+    volume = np.asarray(image.get_fdata(dtype=np.float32), dtype=np.float32)
+    volume = coerce_volume_to_3d(volume)
+    foreground_mask = build_foreground_mask(volume, threshold=foreground_threshold)
+    cropped_volume, cropped_mask, crop_bbox = crop_volume_and_mask(
+        volume,
+        foreground_mask,
+        source_spacing,
+        margin_mm=crop_margin_mm,
+    )
+    normalized_volume = normalize_foreground_only(cropped_volume, cropped_mask)
+    resampled_volume, resampled_mask = resample_to_target_spacing(
+        normalized_volume,
+        cropped_mask,
+        source_spacing=source_spacing,
+        target_spacing=target_spacing,
+    )
+    fitted_volume, fitted_mask = downscale_to_fit(
+        resampled_volume,
+        resampled_mask,
+        target_shape=target_shape,
+    )
+    fitted_volume = np.clip(fitted_volume, -1.0, 1.0).astype(np.float32, copy=False)
+    fitted_volume[~fitted_mask] = float(background_value)
+    padded_volume = center_pad(
+        fitted_volume,
+        target_shape=target_shape,
+        fill_value=float(background_value),
+    ).astype(np.float32, copy=False)
+    pixel_values = torch.from_numpy(padded_volume).unsqueeze(0).to(dtype=torch.float16).contiguous()
+    return {
+        "pixel_values": pixel_values,
+        "source_image": str(image_path),
+        "source_shape": list(source_shape),
+        "source_spacing": list(source_spacing),
+        "crop_bbox": [[int(start), int(stop)] for start, stop in crop_bbox],
+        "foreground_strategy": foreground_strategy,
+        "recipe_id": recipe_id,
+        "cache_version": int(cache_version),
+    }
+def validate_fixed_payload(
+    payload: Mapping[str, Any],
+    *,
+    target_shape: tuple[int, int, int] = TARGET_SHAPE,
+) -> None:
+    pixel_values = payload.get("pixel_values")
+    if not isinstance(pixel_values, torch.Tensor):
+        raise TypeError("`pixel_values` must be a torch.Tensor.")
+    expected_shape = (1,) + tuple(target_shape)
+    if tuple(pixel_values.shape) != expected_shape:
+        raise ValueError(f"Expected tensor shape {expected_shape}, got {tuple(pixel_values.shape)}.")
+    if pixel_values.dtype != torch.float16:
+        raise ValueError(f"Expected tensor dtype torch.float16, got {pixel_values.dtype}.")
+    if not torch.isfinite(pixel_values).all():
+        raise ValueError("Tensor contains non-finite values.")
+    min_value = float(pixel_values.min().item())
+    max_value = float(pixel_values.max().item())
+    if min_value < -1.01 or max_value > 1.01:
+        raise ValueError(f"Expected tensor values in [-1, 1]. Got min={min_value}, max={max_value}.")

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "canonicalize_orientation": true,
+  "clip_percentiles": [
+    0.5,
+    99.5
+  ],
+  "crop_margin": 4,
+  "do_clip": true,
+  "do_crop_foreground": true,
+  "do_normalize": true,
+  "effective_pad_value": -1.0,
+  "foreground_threshold": 0.001,
+  "image_processor_type": "BrainMRISiglipVolumeProcessor",
+  "interpolation_mode": "trilinear",
+  "max_channel_dim": 4,
+  "output_range": [
+    -1.0,
+    1.0
+  ],
+  "pad_value": null,
+  "path_background_value": -1.0,
+  "path_crop_margin_mm": 5.0,
+  "path_foreground_strategy": "largest_component_nonzero",
+  "path_foreground_threshold": 0.001,
+  "path_generic_cache_version": 1,
+  "path_generic_recipe_id": "generic_foreground_128x192x192_fp16_v1",
+  "path_recipe_mode": "auto",
+  "path_target_shape": [
+    128,
+    192,
+    192
+  ],
+  "path_target_spacing": [
+    1.25,
+    1.0,
+    1.0
+  ],
+  "prefer_nibabel_resample": false,
+  "resize_strategy": "pad_or_crop",
+  "spacing": [
+    1.25,
+    1.0,
+    1.0
+  ],
+  "spacing_tolerance": 0.001,
+  "use_foreground_intensity_stats": true,
+  "volume_size": [
+    128,
+    192,
+    192
+  ]
+}

processing_brain_mri_siglip.py ADDED Viewed

	@@ -0,0 +1,680 @@

+"""Processor code for Brain MRI SigLIP."""
+from __future__ import annotations
+import json
+import logging
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer
+from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
+from transformers.processing_utils import ProcessorMixin
+from transformers.tokenization_utils_base import PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
+from .common import copy_remote_code_files, to_3tuple
+from .offline_aligned_preprocessing import (
+    BACKGROUND_VALUE as DEFAULT_PATH_BACKGROUND_VALUE,
+    CROP_MARGIN_MM as DEFAULT_PATH_CROP_MARGIN_MM,
+    FOREGROUND_STRATEGY as DEFAULT_PATH_FOREGROUND_STRATEGY,
+    FOREGROUND_THRESHOLD as DEFAULT_PATH_FOREGROUND_THRESHOLD,
+    GENERIC_CACHE_VERSION,
+    GENERIC_RECIPE_ID,
+    TARGET_SHAPE as DEFAULT_PATH_TARGET_SHAPE,
+    TARGET_SPACING as DEFAULT_PATH_TARGET_SPACING,
+    preprocess_image_with_foreground_mask,
+)
+try:
+    from scripts.fomo_300k_offline_pt.common import (
+        is_fomo_300k_path,
+        preprocess_fomo_300k_image,
+    )
+except Exception:  # pragma: no cover - optional import surface
+    is_fomo_300k_path = None
+    preprocess_fomo_300k_image = None
+try:
+    from scripts.mr_rate_offline_pt.common import (
+        is_mr_rate_path,
+        preprocess_mr_rate_image,
+    )
+except Exception:  # pragma: no cover - optional import surface
+    is_mr_rate_path = None
+    preprocess_mr_rate_image = None
+try:
+    import nibabel as nib
+    try:
+        from nibabel import processing as nib_processing
+    except Exception:  # pragma: no cover - optional import
+        nib_processing = None
+except Exception:  # pragma: no cover - optional import
+    nib = None
+    nib_processing = None
+VolumeInput = Union[str, Path, np.ndarray, torch.Tensor]
+SpacingInput = Optional[Union[Sequence[float], Sequence[Sequence[float]]]]
+LOGGER = logging.getLogger(__name__)
+def _ensure_list(values: Union[VolumeInput, Sequence[VolumeInput]]) -> List[VolumeInput]:
+    if isinstance(values, (str, Path, np.ndarray, torch.Tensor)):
+        return [values]
+    return list(values)
+def _normalize_spacing_value(value: Optional[Sequence[float]], field_name: str) -> Optional[Tuple[float, float, float]]:
+    if value is None:
+        return None
+    if len(value) != 3:
+        raise ValueError(f"`{field_name}` must be a length-3 sequence. Got: {value}")
+    return (float(value[0]), float(value[1]), float(value[2]))
+def _ensure_spacing_list(
+    source_spacings: SpacingInput,
+    batch_size: int,
+) -> List[Optional[Tuple[float, float, float]]]:
+    if source_spacings is None:
+        return [None] * batch_size
+    if batch_size == 1 and isinstance(source_spacings, Sequence) and len(source_spacings) == 3 and not isinstance(
+        source_spacings[0], (list, tuple)
+    ):
+        return [_normalize_spacing_value(source_spacings, "source_spacing")]
+    values = list(source_spacings)
+    if len(values) != batch_size:
+        raise ValueError(
+            f"`source_spacings` must have length {batch_size} to match the input batch. Got {len(values)}."
+        )
+    return [_normalize_spacing_value(value, "source_spacing") for value in values]
+def _normalize_shape_value(
+    value: Sequence[int],
+    field_name: str,
+) -> Tuple[int, int, int]:
+    normalized = to_3tuple(value, field_name)
+    return (int(normalized[0]), int(normalized[1]), int(normalized[2]))
+class BrainMRISiglipVolumeProcessor(BaseImageProcessor):
+    """Image processor for 3D brain MRI volumes."""
+    model_input_names = ["pixel_values"]
+    def __init__(
+        self,
+        volume_size: Union[int, Sequence[int]] = (128, 192, 192),
+        clip_percentiles: Tuple[float, float] = (0.5, 99.5),
+        output_range: Tuple[float, float] = (-1.0, 1.0),
+        do_clip: bool = True,
+        do_normalize: bool = True,
+        interpolation_mode: str = "trilinear",
+        max_channel_dim: int = 4,
+        canonicalize_orientation: bool = True,
+        spacing: Optional[Sequence[float]] = None,
+        spacing_tolerance: float = 1e-3,
+        prefer_nibabel_resample: bool = True,
+        use_foreground_intensity_stats: bool = True,
+        do_crop_foreground: bool = True,
+        foreground_threshold: float = 1e-3,
+        crop_margin: int = 4,
+        resize_strategy: str = "pad_or_crop",
+        pad_value: Optional[float] = None,
+        path_recipe_mode: str = "auto",
+        path_target_shape: Union[int, Sequence[int]] = DEFAULT_PATH_TARGET_SHAPE,
+        path_target_spacing: Optional[Sequence[float]] = DEFAULT_PATH_TARGET_SPACING,
+        path_crop_margin_mm: float = DEFAULT_PATH_CROP_MARGIN_MM,
+        path_foreground_threshold: float = DEFAULT_PATH_FOREGROUND_THRESHOLD,
+        path_background_value: float = DEFAULT_PATH_BACKGROUND_VALUE,
+        path_foreground_strategy: str = DEFAULT_PATH_FOREGROUND_STRATEGY,
+        path_generic_recipe_id: str = GENERIC_RECIPE_ID,
+        path_generic_cache_version: int = GENERIC_CACHE_VERSION,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.volume_size = list(to_3tuple(volume_size, "volume_size"))
+        self.clip_percentiles = (float(clip_percentiles[0]), float(clip_percentiles[1]))
+        self.output_range = (float(output_range[0]), float(output_range[1]))
+        self.do_clip = bool(do_clip)
+        self.do_normalize = bool(do_normalize)
+        self.interpolation_mode = str(interpolation_mode)
+        self.max_channel_dim = int(max_channel_dim)
+        self.canonicalize_orientation = bool(canonicalize_orientation)
+        self.spacing = list(_normalize_spacing_value(spacing, "spacing")) if spacing is not None else None
+        self.spacing_tolerance = float(spacing_tolerance)
+        self.prefer_nibabel_resample = bool(prefer_nibabel_resample)
+        self.use_foreground_intensity_stats = bool(use_foreground_intensity_stats)
+        self.do_crop_foreground = bool(do_crop_foreground)
+        self.foreground_threshold = float(foreground_threshold)
+        self.crop_margin = int(crop_margin)
+        self.resize_strategy = str(resize_strategy)
+        self.pad_value = None if pad_value is None else float(pad_value)
+        self.path_recipe_mode = str(path_recipe_mode)
+        self.path_target_shape = list(_normalize_shape_value(path_target_shape, "path_target_shape"))
+        self.path_target_spacing = (
+            list(_normalize_spacing_value(path_target_spacing, "path_target_spacing"))
+            if path_target_spacing is not None
+            else None
+        )
+        self.path_crop_margin_mm = float(path_crop_margin_mm)
+        self.path_foreground_threshold = float(path_foreground_threshold)
+        self.path_background_value = float(path_background_value)
+        self.path_foreground_strategy = str(path_foreground_strategy)
+        self.path_generic_recipe_id = str(path_generic_recipe_id)
+        self.path_generic_cache_version = int(path_generic_cache_version)
+        self.effective_pad_value = self._resolve_pad_value()
+        if self.max_channel_dim <= 0:
+            raise ValueError(f"`max_channel_dim` must be > 0. Got {self.max_channel_dim}.")
+        if not (0.0 <= self.clip_percentiles[0] < self.clip_percentiles[1] <= 100.0):
+            raise ValueError(
+                "`clip_percentiles` must satisfy 0 <= low < high <= 100. "
+                f"Got {self.clip_percentiles}."
+            )
+        if self.resize_strategy not in {"pad_or_crop", "interpolate"}:
+            raise ValueError(
+                "`resize_strategy` must be one of: pad_or_crop, interpolate. "
+                f"Got {self.resize_strategy!r}."
+            )
+        if self.path_recipe_mode not in {"auto", "legacy"}:
+            raise ValueError(
+                "`path_recipe_mode` must be one of: auto, legacy. "
+                f"Got {self.path_recipe_mode!r}."
+            )
+        if self.path_crop_margin_mm < 0:
+            raise ValueError(f"`path_crop_margin_mm` must be >= 0. Got {self.path_crop_margin_mm}.")
+        if self.path_foreground_threshold < 0:
+            raise ValueError(
+                f"`path_foreground_threshold` must be >= 0. Got {self.path_foreground_threshold}."
+            )
+        if self.spacing_tolerance < 0:
+            raise ValueError(f"`spacing_tolerance` must be >= 0. Got {self.spacing_tolerance}.")
+    def get_path_recipe_config(self) -> Dict[str, Any]:
+        return {
+            "path_recipe_mode": self.path_recipe_mode,
+            "path_target_shape": list(self.path_target_shape),
+            "path_target_spacing": None if self.path_target_spacing is None else list(self.path_target_spacing),
+            "path_crop_margin_mm": self.path_crop_margin_mm,
+            "path_foreground_threshold": self.path_foreground_threshold,
+            "path_background_value": self.path_background_value,
+            "path_foreground_strategy": self.path_foreground_strategy,
+            "path_generic_recipe_id": self.path_generic_recipe_id,
+            "path_generic_cache_version": self.path_generic_cache_version,
+        }
+    def _target_spacing(self) -> Optional[Tuple[float, float, float]]:
+        if self.spacing is None:
+            return None
+        return tuple(float(item) for item in self.spacing)
+    def _resolve_pad_value(self) -> float:
+        if self.pad_value is not None:
+            return float(self.pad_value)
+        if self.do_normalize:
+            return float(self.output_range[0])
+        return 0.0
+    def _spacing_matches(
+        self,
+        source_spacing: Optional[Tuple[float, float, float]],
+        target_spacing: Optional[Tuple[float, float, float]],
+    ) -> bool:
+        if source_spacing is None or target_spacing is None:
+            return False
+        return all(abs(src - dst) <= self.spacing_tolerance for src, dst in zip(source_spacing, target_spacing))
+    def _nibabel_resample_order(self) -> int:
+        if self.interpolation_mode == "nearest":
+            return 0
+        return 1
+    def _resample_nifti_image(
+        self,
+        image,
+        source_spacing: Optional[Tuple[float, float, float]],
+    ) -> tuple[Any, Optional[Tuple[float, float, float]], bool]:
+        if not self.prefer_nibabel_resample or nib_processing is None:
+            return image, source_spacing, False
+        target_spacing = self._target_spacing()
+        if target_spacing is None or self._spacing_matches(source_spacing, target_spacing):
+            return image, source_spacing, False
+        resampled = nib_processing.resample_to_output(
+            image,
+            voxel_sizes=target_spacing,
+            order=self._nibabel_resample_order(),
+        )
+        return resampled, target_spacing, True
+    def _load_volume(
+        self,
+        value: VolumeInput,
+        source_spacing: Optional[Tuple[float, float, float]] = None,
+    ) -> tuple[np.ndarray, Optional[Tuple[float, float, float]], bool]:
+        if isinstance(value, (str, Path)):
+            if nib is None:
+                raise ImportError("`nibabel` is required to load NIfTI paths.")
+            image = nib.load(str(value))
+            if self.canonicalize_orientation:
+                image = nib.as_closest_canonical(image)
+            image_spacing = image.header.get_zooms()[:3]
+            resolved_spacing = None
+            if len(image_spacing) == 3:
+                resolved_spacing = tuple(float(item) for item in image_spacing)
+            image, resolved_spacing, used_nibabel_resample = self._resample_nifti_image(image, resolved_spacing)
+            return (
+                np.asarray(image.get_fdata(dtype=np.float32), dtype=np.float32),
+                resolved_spacing,
+                used_nibabel_resample,
+            )
+        if isinstance(value, torch.Tensor):
+            return value.detach().cpu().numpy().astype(np.float32, copy=False), source_spacing, False
+        if isinstance(value, np.ndarray):
+            return value.astype(np.float32, copy=False), source_spacing, False
+        raise TypeError(f"Unsupported volume input type: {type(value).__name__}")
+    def _preprocess_with_offline_recipe(self, value: VolumeInput) -> Optional[np.ndarray]:
+        if self.path_recipe_mode != "auto" or not isinstance(value, (str, Path)):
+            return None
+        image_path = str(value)
+        try:
+            if is_mr_rate_path is not None and preprocess_mr_rate_image is not None and is_mr_rate_path(image_path):
+                payload = preprocess_mr_rate_image(image_path)
+                return payload["pixel_values"].detach().cpu().numpy().astype(np.float32, copy=False)
+            if is_fomo_300k_path is not None and preprocess_fomo_300k_image is not None and is_fomo_300k_path(image_path):
+                payload = preprocess_fomo_300k_image(image_path)
+                return payload["pixel_values"].detach().cpu().numpy().astype(np.float32, copy=False)
+            payload = preprocess_image_with_foreground_mask(
+                image_path,
+                target_shape=tuple(int(value) for value in self.path_target_shape),
+                target_spacing=None
+                if self.path_target_spacing is None
+                else tuple(float(value) for value in self.path_target_spacing),
+                crop_margin_mm=self.path_crop_margin_mm,
+                foreground_threshold=self.path_foreground_threshold,
+                background_value=self.path_background_value,
+                foreground_strategy=self.path_foreground_strategy,
+                recipe_id=self.path_generic_recipe_id,
+                cache_version=self.path_generic_cache_version,
+            )
+            return payload["pixel_values"].detach().cpu().numpy().astype(np.float32, copy=False)
+        except Exception as exc:
+            LOGGER.warning(
+                "Falling back to legacy online preprocessing for %s after offline-recipe path failed: %s",
+                image_path,
+                exc,
+            )
+        return None
+    def _ensure_channel_first(self, volume: np.ndarray) -> np.ndarray:
+        if volume.ndim == 3:
+            return volume[None, ...]
+        if volume.ndim != 4:
+            raise ValueError(
+                "Volume must be 3D or 4D. For 4D volume, expected channel-first `[C, D, H, W]` "
+                "or channel-last `[D, H, W, C]`."
+            )
+        if volume.shape[0] <= self.max_channel_dim:
+            return volume
+        if volume.shape[-1] <= self.max_channel_dim:
+            return np.moveaxis(volume, -1, 0)
+        raise ValueError(
+            f"Cannot infer channel dimension for shape {volume.shape}. Expected channel dim <= {self.max_channel_dim}. "
+            "Please provide volume in [C, D, H, W] or [D, H, W, C] format."
+        )
+    def _foreground_mask(self, volume: np.ndarray) -> np.ndarray:
+        threshold = abs(self.foreground_threshold)
+        if volume.ndim == 4:
+            return np.any(np.abs(volume) > threshold, axis=0)
+        return np.abs(volume) > threshold
+    def _intensity_stats_values(self, volume: np.ndarray) -> np.ndarray:
+        if not self.use_foreground_intensity_stats:
+            return volume.reshape(-1)
+        mask = self._foreground_mask(volume)
+        if not mask.any():
+            return volume.reshape(-1)
+        if volume.ndim == 4:
+            return volume[:, mask].reshape(-1)
+        return volume[mask].reshape(-1)
+    def _clip_and_normalize(self, volume: np.ndarray) -> np.ndarray:
+        output = volume
+        if self.do_clip or self.do_normalize:
+            # Sanitize before percentile so NaN/inf don't corrupt the result.
+            output = np.nan_to_num(output, nan=0.0, posinf=0.0, neginf=0.0)
+            stats_values = self._intensity_stats_values(output)
+            if self.do_clip:
+                flat = stats_values
+                if flat.size > 1_000_000:
+                    # Deterministic stride-based subsample for speed.
+                    step = max(1, flat.size // 1_000_000)
+                    flat = flat[::step]
+                low, high = np.percentile(flat, self.clip_percentiles)
+            else:
+                low, high = float(stats_values.min()), float(stats_values.max())
+            if np.isfinite(low) and np.isfinite(high) and high > low:
+                if self.do_clip:
+                    output = np.clip(output, low, high)
+                if self.do_normalize:
+                    out_low, out_high = self.output_range
+                    output = np.clip((output - low) / (high - low), 0.0, 1.0)
+                    output = output * (out_high - out_low) + out_low
+            elif self.do_normalize:
+                output = np.zeros_like(output, dtype=np.float32)
+        return output.astype(np.float32, copy=False)
+    def _resample_spacing(
+        self,
+        volume: np.ndarray,
+        source_spacing: Optional[Tuple[float, float, float]],
+        affine: Optional[np.ndarray] = None,
+    ) -> np.ndarray:
+        if self.spacing is None or source_spacing is None:
+            return volume
+        target_spacing = self._target_spacing()
+        if self._spacing_matches(source_spacing, target_spacing):
+            return volume
+        target_shape = []
+        for current_size, src, dst in zip(volume.shape[1:], source_spacing, target_spacing):
+            target_shape.append(max(1, int(round(float(current_size) * float(src) / float(dst)))))
+        if tuple(target_shape) == tuple(int(dim) for dim in volume.shape[1:]):
+            return volume
+        tensor = torch.from_numpy(volume).unsqueeze(0)
+        tensor = F.interpolate(
+            tensor,
+            size=tuple(target_shape),
+            mode=self.interpolation_mode,
+            align_corners=False if self.interpolation_mode in {"linear", "bilinear", "bicubic", "trilinear"} else None,
+        )
+        return tensor.squeeze(0).numpy().astype(np.float32, copy=False)
+    # def _crop_foreground(self, volume: np.ndarray) -> np.ndarray:
+    #     if not self.do_crop_foreground:
+    #         return volume
+    #     # Per-axis projection avoids the massive temporary arrays from np.where.
+    #     src = volume[0] if volume.ndim == 4 else volume
+    #     mask = src > self.foreground_threshold
+    #     if not mask.any():
+    #         return volume
+    #     slices = []
+    #     for dim in range(mask.ndim):
+    #         proj = mask.any(axis=tuple(d for d in range(mask.ndim) if d != dim))
+    #         lo = int(np.argmax(proj))
+    #         hi = len(proj) - 1 - int(np.argmax(proj[::-1]))
+    #         slices.append(slice(lo, hi + 1))
+    #     return volume[(slice(None),) + tuple(slices)].astype(np.float32, copy=False)
+    def _crop_foreground(self, volume: np.ndarray) -> np.ndarray:
+        if not self.do_crop_foreground:
+            return volume
+        margin = self.crop_margin
+        src = self._foreground_mask(volume)
+        if not src.any():
+            return volume
+        slices = []
+        for dim in range(src.ndim):
+            proj = src.any(axis=tuple(d for d in range(src.ndim) if d != dim))
+            lo = int(np.argmax(proj))
+            hi = len(proj) - 1 - int(np.argmax(proj[::-1]))
+            lo = max(0, lo - margin)
+            hi = min(src.shape[dim] - 1, hi + margin)
+            slices.append(slice(lo, hi + 1))
+        return volume[(slice(None),) + tuple(slices)].astype(np.float32, copy=False)
+    def _pad_or_crop_volume(self, volume: np.ndarray) -> np.ndarray:
+        target_size = tuple(int(v) for v in self.volume_size)
+        if volume.shape[1:] == target_size:
+            return volume
+        slices = [slice(None)]
+        for current, target in zip(volume.shape[1:], target_size):
+            if current > target:
+                start = max(0, (current - target) // 2)
+                slices.append(slice(start, start + target))
+            else:
+                slices.append(slice(0, current))
+        cropped = volume[tuple(slices)]
+        pad_width = [(0, 0)]
+        for current, target in zip(cropped.shape[1:], target_size):
+            if current < target:
+                delta = target - current
+                before = delta // 2
+                after = delta - before
+                pad_width.append((before, after))
+            else:
+                pad_width.append((0, 0))
+        if any(before != 0 or after != 0 for before, after in pad_width[1:]):
+            cropped = np.pad(
+                cropped,
+                pad_width=pad_width,
+                mode="constant",
+                constant_values=self.effective_pad_value,
+            )
+        return cropped.astype(np.float32, copy=False)
+    def _resize_volume(self, volume: np.ndarray) -> np.ndarray:
+        target_size = tuple(int(v) for v in self.volume_size)
+        if volume.shape[1:] == target_size:
+            return volume
+        if self.resize_strategy == "pad_or_crop":
+            return self._pad_or_crop_volume(volume)
+        tensor = torch.from_numpy(volume).unsqueeze(0)
+        tensor = F.interpolate(
+            tensor,
+            size=target_size,
+            mode=self.interpolation_mode,
+            align_corners=False if self.interpolation_mode in {"linear", "bilinear", "bicubic", "trilinear"} else None,
+        )
+        return tensor.squeeze(0).numpy().astype(np.float32, copy=False)
+    def preprocess(
+        self,
+        volumes: Union[VolumeInput, Sequence[VolumeInput]],
+        return_tensors: Optional[Union[str, bool]] = "pt",
+        source_spacings: SpacingInput = None,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        del kwargs
+        items = _ensure_list(volumes)
+        spacing_values = _ensure_spacing_list(source_spacings, len(items))
+        batch = []
+        for item, source_spacing in zip(items, spacing_values):
+            recipe_aligned = self._preprocess_with_offline_recipe(item)
+            if recipe_aligned is not None:
+                batch.append(torch.from_numpy(recipe_aligned))
+                continue
+            volume, loaded_spacing, used_nibabel_resample = self._load_volume(item, source_spacing=source_spacing)
+            volume = self._ensure_channel_first(volume)
+            if not used_nibabel_resample:
+                volume = self._resample_spacing(volume, source_spacing=loaded_spacing)
+            volume = self._crop_foreground(volume)
+            volume = self._clip_and_normalize(volume)
+            volume = self._resize_volume(volume)
+            batch.append(torch.from_numpy(volume))
+        pixel_values = torch.stack(batch, dim=0).to(dtype=torch.float32)
+        return BatchFeature(data={"pixel_values": pixel_values}, tensor_type=return_tensors)
+    def __call__(
+        self,
+        volumes: Union[VolumeInput, Sequence[VolumeInput]],
+        return_tensors: Optional[Union[str, bool]] = "pt",
+        **kwargs: Any,
+    ) -> BatchFeature:
+        return self.preprocess(volumes=volumes, return_tensors=return_tensors, **kwargs)
+class BrainMRISiglipProcessor(ProcessorMixin):
+    """Processor wrapping MRI volume processor + tokenizer."""
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "BaseImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(self, image_processor: BrainMRISiglipVolumeProcessor, tokenizer) -> None:
+        super().__init__(image_processor=image_processor, tokenizer=tokenizer)
+    @classmethod
+    def from_text_pretrained(
+        cls,
+        text_model_name_or_path: str = "google/medsiglip-448",
+        volume_size: Union[int, Sequence[int]] = (128, 192, 192),
+        local_files_only: bool = False,
+        trust_remote_code: bool = True,
+        **kwargs: Any,
+    ) -> "BrainMRISiglipProcessor":
+        tokenizer = AutoTokenizer.from_pretrained(
+            text_model_name_or_path,
+            local_files_only=local_files_only,
+            trust_remote_code=trust_remote_code,
+        )
+        image_processor = BrainMRISiglipVolumeProcessor(volume_size=volume_size, **kwargs)
+        return cls(image_processor=image_processor, tokenizer=tokenizer)
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, Path], **kwargs: Any):
+        image_processor_kwargs = dict(kwargs.pop("image_processor_kwargs", {}) or {})
+        tokenizer_kwargs = dict(kwargs.pop("tokenizer_kwargs", {}) or {})
+        # Backward-compatible convenience: treat image-specific keys as image processor kwargs.
+        image_only_keys = {
+            "volume_size",
+            "clip_percentiles",
+            "output_range",
+            "do_clip",
+            "do_normalize",
+            "interpolation_mode",
+            "max_channel_dim",
+            "canonicalize_orientation",
+            "spacing",
+            "spacing_tolerance",
+            "prefer_nibabel_resample",
+            "use_foreground_intensity_stats",
+            "do_crop_foreground",
+            "foreground_threshold",
+            "crop_margin",
+            "resize_strategy",
+            "pad_value",
+            "path_recipe_mode",
+            "path_target_shape",
+            "path_target_spacing",
+            "path_crop_margin_mm",
+            "path_foreground_threshold",
+            "path_background_value",
+            "path_foreground_strategy",
+            "path_generic_recipe_id",
+            "path_generic_cache_version",
+        }
+        shared_kwargs = dict(kwargs)
+        for key in list(shared_kwargs.keys()):
+            if key in image_only_keys and key not in image_processor_kwargs:
+                image_processor_kwargs[key] = shared_kwargs.pop(key)
+        image_processor = BrainMRISiglipVolumeProcessor.from_pretrained(
+            pretrained_model_name_or_path,
+            **shared_kwargs,
+            **image_processor_kwargs,
+        )
+        tokenizer = AutoTokenizer.from_pretrained(
+            pretrained_model_name_or_path,
+            **shared_kwargs,
+            **tokenizer_kwargs,
+        )
+        return cls(image_processor=image_processor, tokenizer=tokenizer)
+    def save_pretrained(self, save_directory: Union[str, Path], **kwargs: Any) -> tuple[str]:
+        save_path = Path(save_directory)
+        save_path.mkdir(parents=True, exist_ok=True)
+        self.image_processor.save_pretrained(str(save_path), **kwargs)
+        self.tokenizer.save_pretrained(str(save_path), **kwargs)
+        processor_config = {
+            "processor_class": self.__class__.__name__,
+            "auto_map": {"AutoProcessor": "processing_brain_mri_siglip.BrainMRISiglipProcessor"},
+            "offline_aligned_preprocessing": self.image_processor.get_path_recipe_config(),
+        }
+        (save_path / "processor_config.json").write_text(json.dumps(processor_config, indent=2), encoding="utf-8")
+        copy_remote_code_files(save_path)
+        return (str(save_path),)
+    @property
+    def model_input_names(self) -> List[str]:
+        names = list(self.tokenizer.model_input_names)
+        for item in self.image_processor.model_input_names:
+            if item not in names:
+                names.append(item)
+        return names
+    def __call__(
+        self,
+        text: Optional[Union[TextInput, PreTokenizedInput, Sequence[TextInput], Sequence[PreTokenizedInput]]] = None,
+        volumes: Optional[Union[VolumeInput, Sequence[VolumeInput]]] = None,
+        padding: Union[bool, str, PaddingStrategy] = "max_length",
+        truncation: Union[bool, str, TruncationStrategy] = True,
+        max_length: Optional[int] = None,
+        return_tensors: Optional[Union[str, bool]] = "pt",
+        **kwargs: Any,
+    ) -> BatchFeature:
+        if text is None and volumes is None:
+            raise ValueError("At least one of `text` or `volumes` must be provided.")
+        image_processor_kwargs = dict(kwargs.pop("image_processor_kwargs", {}) or {})
+        image_only_keys = {"source_spacings"}
+        for key in list(kwargs.keys()):
+            if key in image_only_keys and key not in image_processor_kwargs:
+                image_processor_kwargs[key] = kwargs.pop(key)
+        data: Dict[str, Any] = {}
+        if text is not None:
+            text_inputs = self.tokenizer(
+                text,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            data.update(dict(text_inputs))
+        if volumes is not None:
+            image_inputs = self.image_processor(
+                volumes=volumes,
+                return_tensors=return_tensors,
+                **image_processor_kwargs,
+            )
+            data.update(dict(image_inputs))
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    def batch_decode(self, *args: Any, **kwargs: Any):
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args: Any, **kwargs: Any):
+        return self.tokenizer.decode(*args, **kwargs)

processor_config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "processor_class": "BrainMRISiglipProcessor",
+  "auto_map": {
+    "AutoProcessor": "processing_brain_mri_siglip.BrainMRISiglipProcessor"
+  },
+  "offline_aligned_preprocessing": {
+    "path_recipe_mode": "auto",
+    "path_target_shape": [
+      128,
+      192,
+      192
+    ],
+    "path_target_spacing": [
+      1.25,
+      1.0,
+      1.0
+    ],
+    "path_crop_margin_mm": 5.0,
+    "path_foreground_threshold": 0.001,
+    "path_background_value": -1.0,
+    "path_foreground_strategy": "largest_component_nonzero",
+    "path_generic_recipe_id": "generic_foreground_128x192x192_fp16_v1",
+    "path_generic_cache_version": 1
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "</s>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  }
+}

spiece.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e5036bed065526c3c212dfbe288752391797c4bb1a284aa18c9a0b23fcaf8ec
+size 798330

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "added_tokens_decoder": {
+    "1": {
+      "content": "</s>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<unk>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": true,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_input_names": [
+    "input_ids"
+  ],
+  "model_max_length": 64,
+  "pad_token": "</s>",
+  "processor_class": "SiglipProcessor",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "SiglipTokenizer",
+  "unk_token": "<unk>"
+}