Super-squash branch 'main' using huggingface_hub

Browse files

Files changed (12) hide show

.gitattributes +35 -0
README.md +100 -0
common_spear.py +702 -0
config.json +167 -0
configuration_spear.py +347 -0
generation_config.json +3 -0
model-00001-of-00003.safetensors +3 -0
model-00002-of-00003.safetensors +3 -0
model-00003-of-00003.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_spear.py +0 -0
processing_spear.py +1897 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+license: gemma
+library_name: transformers
+pipeline_tag: visual-question-answering
+---
+# SPEAR-1 model card
+SPEAR-1 is a cutting-edge Vision-Language-Action (VLA) model capable of achieving performance __superior or on par with state-of-the-art models such as pi0-FAST and pi0.5__
+on multiple embodiments while being trained __on 20x less robot data__.
+This model was developed by [INSAIT](https://insait.ai/), a special unit of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.
+Code and model weights for SPEAR-1 models are free to used under the Gemma license.
+This repo provides model weights fine-tuned for a Franka setup with one wrist and one external camera.
+## Model description
+The key to SPEAR-1's data efficiency is SPEAR-VLM, a 3D-aware VLM. SPEAR-VLM extends PaliGemma with the MoGe depth encoder and is trained on 3D VQA tasks using
+primarily non-robot data sources, such as EgoExo-4D.
+SPEAR-1's architecture combines SPEAR-VLM with a DiT action expert. It is first pre-trained on a mixture of robot demonstration datasets from Open X Embodiment and
+then fine-tuned for specific embodiments.
+## Use with 🤗 Transformers
+We provide a fully `AutoModel` compatible implementation of SPEAR-1 that can be used via transformers.
+### Environment setup
+The current implementation requires the following additional dependencies: `roma`, `timm`, `flash-attn`.
+Here is a snippet to set up a working environment for inference via `uv`:
+```
+uv venv python 3.10.12
+source .venv/bin/activate
+uv pip install --torch-backend=cu126 roma==1.5.0 numpy==2.2.4 torch==2.6.0 torchvision==0.21.0 transformers==4.47.0 timm==1.0.15
+uv pip install --no-build-isolation setuptools psutil flash-attn==2.7.3
+```
+### Example usage
+```python
+from typing import Dict
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoModel
+model = AutoModel.from_pretrained("INSAIT-Institute/spear1-franka")
+model = model.to(dtype=torch.bfloat16, device="cuda").eval()
+main_image = np.asarray(Image.open("path/to/main_image.png"))
+wrist_image = np.asarray(Image.open("path/to/wrist_image.png"))
+ee_translation = np.array([0.36, 0.0, 0.56])
+ee_rotation = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
+gripper = np.array(1.0)
+model_input: Dict[str, np.ndarray | str | Dict[str, np.ndarray]] = {
+    "images": {
+      "main": main_image, # (H, W, C)
+      "wrist": wrist_image, # (H, W, C)
+    },
+    "ee_translation": ee_translation, # (3,)
+    "ee_rotation": ee_rotation, # (3, 3)
+    "gripper": gripper, # (1,)
+    "language_instruction": "put the carrot on the blue plate",
+    "dataset_name": "droid"
+}
+model_output: Dict[str, np.ndarray] = model.predict_action(model_input)
+ctrl_translation: np.ndarray = model_output["translation"] # (S, 3)
+ctrl_rotation: np.ndarray = model_output["rotation"] # (S, 3, 3)
+ctrl_gripper: np.ndarray = model_output["gripper"] # (S, 1)
+```
+## Action space
+SPEAR-1 predicts action chunks of delta end-effector positions. Each step in the predicted action chunk is relative to the input state.
+Given the current end-effector position `[R, t]` and a model prediction `A_rel = [[R_1, t_1], ..., [R_n, t_n]]`, absolute end effector pose commands can be computed as:
+```
+A_abs = [[R * R_1, t + t_1], ..., [R * R_n, t * t_n]]
+```
+## Community Feedback
+We welcome feedback from the community to help improve SPEAR-1. If you have suggestions, encounter any issues, or have ideas for improvements, please contact us.
+## Summary
+- __Model type__: Vision-Language-Action with flow-matching action decoding
+- __Contact__: contact@insait.ai
+- __License__: Gemma Terms of Use

common_spear.py ADDED Viewed

	@@ -0,0 +1,702 @@

+import collections.abc
+import dataclasses
+import enum
+import inspect
+import types
+from collections.abc import Mapping as MappingABC
+from functools import cached_property
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Iterable,
+    List,
+    Mapping,
+    Optional,
+    Sequence,
+    Tuple,
+    Type,
+    Union,
+)
+import torch
+import transformers
+class StrEnum(str, enum.Enum):
+    """
+    A minimal drop-in replacement for backports.strenum.StrEnum
+    """
+    def __str__(self):
+        return str(self.value)
+    def __new__(cls, value):
+        # Create new instance that properly handles string initialization
+        if isinstance(value, str):
+            obj = str.__new__(cls, value)
+            obj._value_ = value
+            return obj
+        return super().__new__(cls, value)
+    @classmethod
+    def _missing_(cls, value):
+        # Enhanced lookup by string value with better error handling
+        if isinstance(value, str):
+            for member in cls:
+                if member.value == value:
+                    return member
+        # Return None to let enum handle the KeyError
+        return None
+    def __eq__(self, other):
+        # Allow comparison with string values
+        if isinstance(other, str):
+            return self.value == other
+        return super().__eq__(other)
+    def __hash__(self):
+        # Ensure consistent hashing
+        return hash(self.value)
+class _cached_classproperty:
+    def __init__(self, func):
+        self.func = func
+        self._values = {}
+    def __get__(self, obj, klass):
+        if klass not in self._values.keys():
+            self._values[klass] = self.func.__get__(obj, klass)()
+        return self._values[klass]
+def cached_classproperty(func):
+    if not isinstance(func, (classmethod, staticmethod)):
+        func = classmethod(func)
+    return _cached_classproperty(func)
+@dataclasses.dataclass
+class Dataclass:
+    def __post_init__(self):
+        pass
+    @classmethod
+    def make_empty(cls) -> "Dataclass":
+        return cls(
+            **{
+                k: (v.make_empty() if inspect.isclass(v) and issubclass(v, Dataclass) else None)
+                for (k, v) in cls.types.items()
+            }
+        )
+    @cached_classproperty
+    def fields(cls) -> Tuple[dataclasses.Field, ...]:
+        """Returns a sorted list of the Field objects"""
+        return tuple(sorted(dataclasses.fields(cls), key=lambda x: x.name))
+    @cached_classproperty
+    def types(cls) -> Dict[str, type]:
+        return {f.name: f.type for f in cls.fields}
+    def as_json(self, recursive: bool = True) -> dict:
+        return {k: v.as_json() if isinstance(v, Dataclass) and recursive else v for (k, v) in self.items()}
+    @classmethod
+    def keys(cls) -> List[str]:
+        return [field.name for field in cls.fields]
+    def values(self):
+        return [getattr(self, field.name) for field in self.fields]
+    def items(self, recursive: bool = False):
+        for key, value in zip(self.keys(), self.values(), strict=True):
+            if recursive and isinstance(value, Dataclass):
+                for subkey, subvalue in value.items(recursive=True):
+                    yield (f"{key}.{subkey}", subvalue)
+            else:
+                yield (key, value)
+    def replace(self, **kwargs):
+        """
+        Return a new instance of Dataclass with the kwargs overwritten.
+        """
+        kwargs = maybe_chained_keys_to_nested_dict(kwargs)
+        data = self.as_json(recursive=False)
+        for key, value in kwargs.items():
+            value_type = self.types.get(key, None)
+            if value_type is None:
+                raise KeyError(f"Dataclass {self.__class__} does not have a field {key}")
+            value_type = get_maybe_optional_type(value_type)
+            if inspect.isclass(value_type) and issubclass(value_type, Dataclass):
+                if isinstance(value, dict):
+                    data[key] = data[key].replace(**value)
+                else:
+                    data[key] = value
+            else:
+                data[key] = value
+        return self.__class__(**data)
+    def apply(self, fcn: Callable, recursive: bool = True, skip_nones: bool = False) -> "Dataclass":
+        def fcn_wrapper(value: Any) -> Any:
+            if value is None and skip_nones:
+                return None
+            if isinstance(value, dict) and recursive:
+                return type(value)(**{k: fcn(v) for (k, v) in value.items()})
+            if isinstance(value, (list, tuple)) and recursive:
+                return type(value)([fcn(v) for v in value])
+            if isinstance(value, Dataclass) and recursive:
+                return value.apply(fcn, recursive=True, skip_nones=skip_nones)
+            return fcn(value)
+        return self.__class__(**{key: fcn_wrapper(value) for (key, value) in self.items()})
+    def __getitem__(self, index) -> "Dataclass":
+        def extract(obj):
+            if obj is None:
+                return None
+            if isinstance(obj, torch.Tensor):
+                return obj[index]
+            raise ValueError(f"Cannot slice {obj.__class__.__name__} object")
+        return self.apply(extract)
+class Config:
+    def __init__(self, **kwargs):
+        self._apply_defaults()
+        self._set_attributes(**kwargs)
+        super().__init__()
+        self.__post_init__()
+    def _apply_defaults(self):
+        """
+        Initializes all annotated fields with defaults or sensible instances.
+        """
+        annotations = getattr(self, "__annotations__", {})
+        for key, type_hint in annotations.items():
+            # Skip if already set via class-level value or __init__ kwarg
+            if hasattr(self, key):
+                continue
+            # Case 1: class variable has a default (declared at class level)
+            if key in self.__class__.__dict__:
+                setattr(self, key, getattr(self.__class__, key))
+                continue
+            # Case 2: if the type is another Config subclass, default-construct it
+            if inspect.isclass(type_hint) and issubclass(type_hint, Config):
+                setattr(self, key, type_hint())
+                continue
+            # Case 3: fallback None (or empty dict for mappings)
+            if hasattr(type_hint, "__origin__") and type_hint.__origin__ in (
+                dict,
+                Dict,
+                MappingABC,
+            ):
+                setattr(self, key, {})
+            else:
+                setattr(self, key, None)
+    def _set_attributes(self, **kwargs):
+        subconfig_types = self._subconfig_types
+        for key, value in kwargs.items():
+            if key in subconfig_types:
+                if not isinstance(value, Mapping):
+                    raise ValueError(
+                        f"{self.__class__.__name__}.{key} expects dict-like object for nested config, but got: {value}"
+                    )
+                setattr(self, key, subconfig_types[key](**value))
+            else:
+                setattr(self, key, value)
+    def keys(self) -> List[str]:
+        """Get all annotated keys including those from parent classes."""
+        all_keys = {}
+        # Walk through MRO in reverse to respect inheritance order
+        for cls in reversed(self.__class__.__mro__):
+            if cls is object:
+                continue
+            all_keys.update(getattr(cls, "__annotations__", {}))
+        return list(all_keys.keys())
+    def items(self) -> Iterable[Tuple[str, Any]]:
+        for key in self.keys():
+            yield (key, getattr(self, key))
+    @cached_classproperty
+    def _subconfig_types(cls) -> dict[str, Type]:
+        keys = {
+            key: value
+            for (key, value) in cls.__annotations__.items()
+            if inspect.isclass(value) and issubclass(value, Config)
+        }
+        for base in cls.__bases__:
+            if not issubclass(base, Config):
+                continue
+            keys = {**keys, **base._subconfig_types}
+        return keys
+    def __post_init__(self):
+        pass
+    def as_json(self) -> dict:
+        data = {}
+        for key, value in self.items():
+            if isinstance(value, Config):
+                data[key] = value.as_json()
+            elif (
+                isinstance(value, collections.abc.Sequence)
+                and len(value) > 0
+                and isinstance(value[0], Config)
+            ):
+                data[key] = [v.as_json() for v in value]
+            elif (
+                isinstance(value, collections.abc.Mapping)
+                and len(value) > 0
+                and isinstance(next(iter(value.values())), Config)
+            ):
+                data[key] = {k: v.as_json() for k, v in value.items()}
+            else:
+                data[key] = value
+        return data
+class HFConfigMixin(transformers.PretrainedConfig):
+    """
+    Bridge between your Config system and HF PretrainedConfig.
+    Usage:
+        class SPEAR1Config(HFConfigMixin, Config):
+            model_type = "spear1"
+            processor_config: PaliGemmaProcessorConfig
+            ...
+    """
+    def __init__(self, **kwargs):
+        # Let HF's machinery initialize its own attributes / defaults first.
+        # PretrainedConfig.__init__ will set things like `model_type`,
+        # `_name_or_path`, `architectures`, and keep a `kwargs`->dict of extra items.
+        super().__init__(**kwargs)
+        # Now initialize your Config behavior: set defaults and construct nested configs.
+        # We call Config.__init__ explicitly because HFConfigMixin inherits from PretrainedConfig,
+        # and the user's concrete class will use multiple-inheritance with Config.
+        # (This approach mirrors the earlier MRO design: class Concrete(HFConfigMixin, Config).)
+        # We pass kwargs again so nested configs get overridden by user kwargs.
+        # Note: Config.__init__ itself calls super().__init__() — but because we are calling
+        # Config.__init__ directly (not via super()) the MRO won't re-call PretrainedConfig.__init__ here.
+        # (I.e., we are deliberately calling the concrete base initializer.)
+        Config.__init__(self, **kwargs)  # type: ignore[name-defined]
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Merge HF PretrainedConfig serialization and Config.as_json().
+        Strategy:
+          1. Take HF dict (super().to_dict()) so HF metadata/defaults are present.
+          2. Take our nested config dict (Config.as_json(self)).
+          3. Update the HF dict with our nested config dict so annotated fields
+             (nested configs, lists/dicts that should be recursively serialized)
+             take precedence.
+        """
+        # HF's representation (contains model_type, etc.). This is trusted HF serialization.
+        hf = super().to_dict()
+        # Our nested config representation (recursively serializes Config objects).
+        # Do not call self.to_dict() because that would recurse back here.
+        cfg_json = Config.as_json(self)  # type: ignore[name-defined]
+        # Merge: prefer cfg_json values for keys present in our config (so nested configs
+        # are represented as dicts rather than raw objects or omitted).
+        merged: Dict[str, Any] = dict(hf)
+        merged.update(cfg_json)
+        return merged
+    @classmethod
+    def from_dict(
+        cls: Type["HFConfigMixin"],
+        config_dict: Dict[str, Any],
+        **kwargs,
+    ) -> "HFConfigMixin":
+        """
+        Construct by delegating to the class constructor — that will instantiate nested configs.
+        This is simple and consistent with PretrainedConfig.from_dict/from_pretrained behavior.
+        """
+        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
+        instance = cls(**config_dict)
+        if return_unused_kwargs:
+            # Return tuple of (instance, unused_kwargs) if requested
+            # Since we consume everything in __init__, unused is typically empty
+            return instance, {}
+        return instance
+class Configurable:
+    def __init__(self, config: Config):
+        self._config = config
+    @property
+    def config(self) -> Config:
+        return self._config
+class RotationFormat(StrEnum):
+    """Determines how rotations will be encoded in the loaded batch"""
+    EULER = "euler"
+    QUATERNION = "quaternion"
+    ROTMAT = "rotmat"
+class ResizeMode(StrEnum):
+    """
+    Different modes for resizing images.
+    """
+    MATCH_WIDTH = "match_width"
+    MATCH_HEIGHT = "match_height"
+    MATCH_MAX = "match_max"
+    NAIVE = "naive"
+    SMART = "smart"
+    PAD = "pad"
+    CROP = "crop"
+class Normalization(StrEnum):
+    """Action normalization types"""
+    NONE = "none"
+    BOUNDS = "bounds"
+    BOUNDS_Q99 = "bounds_q99"
+    MEAN = "mean"
+def expand_dims(tensor: torch.Tensor, ndim: int, order: Sequence[int]) -> torch.Tensor:
+    """
+    Expand the dimensions of `tensor` to `ndim` such that all new dimensions have size of 1
+    Args:
+        tensor: torch.Tensor of any shape
+        ndim: Number of output dimensions. Must be >= `tensor.ndim`
+        order: Sequence of size `tensor.ndim + 1`. Contains only values of 1 and a single value of -1,
+            indicating where the new `ndim - tensor.ndim` dimensions will be inserted
+    Returns:
+        torch.Tensor with dimensions `ndim`, a view of `tensor`
+    Ex:
+        expand_dims(torch.ones([2, 3, 4]), ndim=5, order=[1, -1, 1, 1]).shape -> [2, 1, 1, 3, 4]
+        expand_dims(torch.ones([2, 3, 4]), ndim=5, order=[-1, 1, 1, 1]).shape -> [1, 1, 2, 3, 4]
+        expand_dims(torch.ones([2, 3, 4]), ndim=5, order=[1, 1, 1, -1]).shape -> [2, 3, 4, 1, 1]
+    """
+    assert tensor.ndim <= ndim, f"{tensor.ndim} > {ndim}; shape={tensor.shape}"
+    assert len(order) == tensor.ndim + 1, f"{len(order)} != {tensor.ndim + 1}; shape={tensor.shape}"
+    order = list(order)
+    assert order.count(-1) == 1, "Order must have exactly one value of -1"
+    assert order.count(1) == len(order) - 1, "Order must have exactly len(order) - 1 values of 1"
+    if tensor.ndim == ndim:
+        return tensor
+    insert_index = order.index(-1)
+    view = list(tensor.shape[:insert_index]) + [1] * (ndim - tensor.ndim) + list(tensor.shape[insert_index:])
+    tensor = tensor.view(view)
+    return tensor
+def merge_dicts_recursive(dict_1: Dict[str, Any], dict_2: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Merges dict_1 with dict_2 recursively.
+    Handles clashing keys:
+        1. If both values are dicts, merges them recursively
+        2. If any value is not a dict, raises ValueError
+    """
+    merged = dict(dict_1)
+    for key, value in dict_2.items():
+        if key in merged:
+            if not type(merged[key]) is type(value) is dict:
+                raise ValueError(f"Multiple values provided for key {key}: {merged[key]} and {value}")
+            merged[key] = merge_dicts_recursive(merged[key], value)
+        else:
+            merged[key] = value
+    return merged
+def maybe_chained_keys_to_nested_dict(data: Dict[str, Any]) -> Dict[str, Any]:
+    """Converts a dict with keys of the form "key1.key2.key3" to a nested dict"""
+    unpacked_data: Dict[str, Any] = {}
+    for key, value in data.items():
+        if "." not in key:
+            unpacked_data = merge_dicts_recursive(unpacked_data, {key: value})
+        else:
+            (mainkey, subkey) = key.split(".", maxsplit=1)
+            nested_value = maybe_chained_keys_to_nested_dict({subkey: value})
+            unpacked_data = merge_dicts_recursive(unpacked_data, {mainkey: nested_value})
+    return unpacked_data
+def annotation_is_union(type_value: Type) -> bool:
+    return getattr(type_value, "__origin__", None) is Union or type(type_value) is types.UnionType
+def annotation_is_optional(type_value: Type) -> bool:
+    if annotation_is_union(type_value):
+        union_args = set(type_value.__args__)
+        if len(union_args) == 2 and type(None) in union_args:
+            return True
+    return False
+def get_maybe_optional_type(type_value: Type[Optional[Any]]) -> Type[Any]:
+    if annotation_is_optional(type_value):
+        type_args = type_value.__args__
+        if type_args[1] is type(None):
+            return type_args[0]
+        return type_args[1]
+    return type_value
+@dataclasses.dataclass
+class RoboticsTarget(Dataclass):
+    control_tokens_ids: Optional[torch.Tensor]
+    text_tokens_ids: Optional[torch.Tensor]
+    translation: torch.Tensor
+    rotation: torch.Tensor
+    gripper: torch.Tensor
+    valid_mask: torch.Tensor
+@dataclasses.dataclass
+class RoboticsControlPlan(Dataclass):
+    translation_m: torch.Tensor
+    rotmat: torch.Tensor
+    gripper_prob: torch.Tensor
+    valid_mask: torch.Tensor
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.translation_m.ndim == 3, self.translation_m.shape
+        assert self.rotmat.ndim == 3, self.rotmat.shape
+        assert self.gripper_prob.ndim == 3, self.gripper_prob.shape
+@dataclasses.dataclass
+class RoboticsInput(Dataclass):
+    images: Dict[str, torch.Tensor]
+    input_ids: torch.Tensor
+    attn_mask: torch.Tensor
+    ee_pose_translation: torch.Tensor
+    ee_pose_rotation: torch.Tensor
+    gripper: torch.Tensor
+    joints: torch.Tensor
+    control_tokens_ids: Optional[torch.Tensor]
+    @property
+    def inputs_embeds(self) -> Optional[torch.Tensor]:
+        return None
+    @property
+    def past_key_values(self) -> Optional[List[torch.Tensor]]:
+        return None
+    @cached_property
+    def multimodal_indices(self) -> torch.Tensor:
+        """
+        Returns a torch.Tensor containing only the indices of the batch examples which are multimodal.
+        Return shape is [B]
+        """
+        return torch.arange(self.input_ids.shape[0], dtype=torch.int64, device=self.input_ids.device)
+    @cached_property
+    def unimodal_indices(self) -> torch.Tensor:
+        """
+        Returns a torch.Tensor containing only the indices of the batch examples which are unimodal.
+        Return shape is [B]
+        """
+        return torch.tensor([], dtype=torch.int64, device=self.input_ids.device)
+@dataclasses.dataclass
+class FlowInput(Dataclass):
+    timestep: torch.Tensor
+    translation_t: torch.Tensor
+    rotation_t: torch.Tensor
+    gripper_t: torch.Tensor
+    translation_t0: torch.Tensor
+    rotation_t0: torch.Tensor
+    gripper_t0: torch.Tensor
+@dataclasses.dataclass
+class RoboticsFlowInput(RoboticsInput):
+    """Input to the entire Robotics VLM"""
+    flow_input: FlowInput
+@dataclasses.dataclass
+class DiffusionInput(Dataclass):
+    timestep: torch.Tensor
+    noised_translation: torch.Tensor
+    noised_rotation: torch.Tensor
+    noised_gripper: torch.Tensor
+@dataclasses.dataclass
+class LLMOutput(Dataclass):
+    """Fork of transformers.modeling_outputs.CausalLMOutputWithPast"""
+    input_ids: torch.Tensor
+    logits: Optional[torch.Tensor]
+    output_ids: Optional[torch.Tensor]
+    loss: Optional[torch.Tensor]
+    past_key_values: List[Tuple[torch.Tensor, torch.Tensor]]
+    hidden_states: List[torch.Tensor]
+    text_indices: torch.Tensor
+    image_indices: torch.Tensor
+    @classmethod
+    def from_transformers(
+        cls,
+        input_ids: torch.Tensor,
+        llm_output: transformers.modeling_outputs.CausalLMOutputWithPast,
+        text_indices: Optional[torch.Tensor],
+        image_indices: Optional[torch.Tensor],
+    ) -> "LLMOutput":
+        return LLMOutput(
+            input_ids=input_ids,
+            logits=llm_output.logits,
+            output_ids=None,
+            loss=llm_output.loss,
+            past_key_values=(
+                list(llm_output.past_key_values) if llm_output.past_key_values is not None else []
+            ),
+            hidden_states=(list(llm_output.hidden_states) if llm_output.hidden_states is not None else []),
+            text_indices=text_indices,
+            image_indices=image_indices,
+        )
+    def compress(self) -> "LLMOutput":
+        """
+        Compress the data contained in the class so it can be moved between CPU and GPU or concatenated
+        much faster:
+            - hidden_states - huge tensors; take a lot of CPU time to move across devices or concat
+            - past_key_values - huge tensors; take a lot of CPU time to move across devices or concat
+            - logits - huge last dimension; takes a lot of CPU time to move across devices or concat
+        """
+        replace: Dict[str, Any] = {
+            "hidden_states": [],
+            "past_key_values": [],
+            "loss": None,
+            "input_ids": None,
+        }
+        if self.logits is not None:
+            replace["logits"] = None
+            if self.output_ids is None or self.output_ids.shape[1] != self.text_indices.shape[0]:
+                replace["output_ids"] = (
+                    torch.index_select(self.logits, dim=1, index=self.text_indices)
+                    .argmax(dim=-1)
+                    .to(dtype=torch.int64)
+                )
+        return self.replace(**replace)
+@dataclasses.dataclass
+class RoboticsOutput(Dataclass):
+    translation: Optional[torch.Tensor]
+    rotation: Optional[torch.Tensor]
+    gripper: Optional[torch.Tensor]
+    token_logits: Optional[torch.Tensor]
+    token_ids: Optional[torch.Tensor]
+    llm_output: LLMOutput
+    def compress(self) -> "RoboticsOutput":
+        """
+        Compress output and drop unnecessary components to speed up transfer GPU <-> CPU.
+        Note that LLM logits can be extremely expensive since their size is [B, S, vocab_size], which
+        can reach millions or billions of values for large vocab_size
+        """
+        replace: Dict[str, Any] = {
+            "llm_output": self.llm_output.compress(),
+            "token_logits": None,
+        }
+        if self.token_logits is not None and self.token_ids is None:
+            replace["token_ids"] = torch.argmax(self.token_logits, dim=-1)
+        return self.replace(**replace)
+@dataclasses.dataclass
+class VLMOutput(Dataclass):
+    llm_output: LLMOutput
+    vit_tokens: Optional[torch.Tensor]
+    attn_mask: torch.Tensor
+    def compress(self) -> "VLMOutput":
+        """
+        Compress output and drop unnecessary components to speed up transfer GPU <-> CPU.
+        Note that LLM logits can be extremely expensive since their size is [B, S, vocab_size], which
+        can reach millions or billions of values for large vocab_size
+        """
+        return self.replace(llm_output=self.llm_output.compress())
+def is_quaternion(quaternion: torch.Tensor) -> bool:
+    return quaternion.shape[-1] == 4
+def quaternion_half_cover(quaternion: torch.Tensor) -> torch.Tensor:
+    """
+    Flip quaternions so they cover only a half the space. If the q_w is negative, flip the quaternion.
+    If q_w is 0, then choose such that the first non-zero component is positive. Note that geometrically,
+    this doesn't correspond to a single hemisphere of the unit sphere. Follows
+    https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.transform.Rotation.as_quat.html#scipy.spatial.transform.Rotation.as_quat
+    """
+    assert is_quaternion(quaternion), quaternion.shape
+    with torch.no_grad():
+        is_zero = quaternion == 0
+        flip_condition = (
+            (quaternion[..., -1:] < 0)
+            | is_zero[..., -1:] & (quaternion[..., 0:1] < 0)
+            | is_zero[..., -1:] & is_zero[..., 0:1] & (quaternion[..., 1:2] < 0)
+            | is_zero[..., -1:] & is_zero[..., 0:1] & is_zero[..., 1:2] & (quaternion[..., 2:3] < 0)
+        )
+    quaternion = torch.where(flip_condition, -quaternion, quaternion)
+    return quaternion
+def is_rotmat_3x3(rotmat: torch.Tensor) -> bool:
+    return rotmat.shape[-2:] == torch.Size([3, 3])
+def is_rotmat_9(rotmat: torch.Tensor) -> bool:
+    return rotmat.shape[-1] == 9
+def rotmat_as_9(rotmat: torch.Tensor) -> torch.Tensor:
+    """Convert any rotmat input to [..., 9] shape"""
+    if is_rotmat_9(rotmat):
+        return rotmat
+    if is_rotmat_3x3(rotmat):
+        return rotmat.reshape(*rotmat.shape[:-2], 9)
+    raise ValueError(f"Can't convert tensor of shape {rotmat.shape} to a 3x3 rotation matrix")
+def is_rotmat(rotmat: torch.Tensor) -> bool:
+    """
+    Checks if the tensor shape matches that of a rotmat. However, it's not guaranteed the data is a
+    valid rotmat. `is_orthonormal_rotmat` performs this additional check.
+    NOTE: This might incorrectly return True if the underlying data is euler angles and accidentally
+    `rotmat.shape[-2:] == [3, 3]`. This would happen very rarely, but use with caution
+    """
+    return is_rotmat_3x3(rotmat) or is_rotmat_9(rotmat)
+def rotmat_as_3x3(rotmat: torch.Tensor) -> torch.Tensor:
+    """Convert any rotmat input to [..., 3, 3] shape"""
+    if rotmat.shape[-1] == 9:
+        return rotmat.reshape(*rotmat.shape[:-1], 3, 3)
+    if rotmat.shape[-2:] == torch.Size([3, 3]):
+        return rotmat
+    raise ValueError(f"Can't convert tensor of shape {rotmat.shape} to a 3x3 rotation matrix")

config.json ADDED Viewed

	@@ -0,0 +1,167 @@

+{
+  "_auto_class": null,
+  "_name_or_path": "/scratch/giuliano_albanese/spear-hf",
+  "architectures": [
+    "SPEAR1"
+  ],
+  "attribute_map": {},
+  "auto_map": {
+    "AutoConfig": "configuration_spear.SPEAR1Config",
+    "AutoModel": "modeling_spear.SPEAR1"
+  },
+  "autoclass": "barrel.pipes.vlams.models.vlams.vlam.VLAM",
+  "base_config_key": "",
+  "control_module_config": {
+    "control_decoder_config": {
+      "block_config": {
+        "activation": "GELU",
+        "attn_implementation": "sdpa",
+        "dropout": 0.0,
+        "feature_size": 1024,
+        "head_dim": 256,
+        "hidden_size": 4096,
+        "norm": "RMSNorm",
+        "num_heads": 8,
+        "num_kv_heads": 1,
+        "position_embed_config": {
+          "base": 10000,
+          "cached": true,
+          "embedding_dim": 256,
+          "num_embeddings": 512
+        }
+      },
+      "num_blocks": 18
+    },
+    "noised_control_proj_config": {
+      "activation": "SiLU",
+      "layers": [
+        8,
+        2048,
+        1024,
+        1024
+      ],
+      "norm": null,
+      "time_embed": {
+        "activation": "SiLU",
+        "layers": [],
+        "learnable_features": false,
+        "max_period": 10000.0,
+        "norm": null,
+        "num_features": 1024
+      }
+    },
+    "robot_state_proj_config": {
+      "activation": "SiLU",
+      "fourier": false,
+      "layers": [
+        8,
+        1024
+      ],
+      "mode": "ee_pose_gripper"
+    },
+    "rotation_components": 4,
+    "token_size": 1024
+  },
+  "is_composition": false,
+  "model_type": "spear1",
+  "processor_config": {
+    "control_io_config": {
+      "future_control_offset_sec": 0.0,
+      "future_controls_sequence_length": 5,
+      "future_controls_sequence_stride_sec": 0.2,
+      "future_frames_sequence_length": 1,
+      "future_frames_sequence_stride_sec": null,
+      "past_frames_sequence_length": 1,
+      "past_frames_stride_sec": null,
+      "past_scalars_sequence_length": 1,
+      "past_scalars_stride_sec": null,
+      "sequence_frames": 1,
+      "sequence_frames_stride_sec": null
+    },
+    "control_stats_path": "barrel/pipes/vlams/types/control_stats.yaml",
+    "control_tokenizer_config": {},
+    "delta_controls": true,
+    "distribution_hyperparams": {
+      "alpha": 1.5,
+      "beta": 1.0
+    },
+    "eef_control_frame": false,
+    "image_resize": "smart",
+    "joints_norm": {
+      "high": [
+        3.141592653589793,
+        3.141592653589793,
+        3.141592653589793,
+        3.141592653589793,
+        3.141592653589793,
+        3.141592653589793,
+        3.141592653589793
+      ],
+      "low": [
+        -3.141592653589793,
+        -3.141592653589793,
+        -3.141592653589793,
+        -3.141592653589793,
+        -3.141592653589793,
+        -3.141592653589793,
+        -3.141592653589793
+      ]
+    },
+    "num_inference_steps": 10,
+    "obs_rotation_norm": "none",
+    "obs_translation_norm": "bounds_q99",
+    "observation_stats_path": "barrel/pipes/vlams/types/observation_stats.yaml",
+    "r0_distribution": "uniform",
+    "rotation_format": "quaternion",
+    "rotation_norm": "none",
+    "sig_min": 0.001,
+    "timestep_distribution": "beta",
+    "translation_norm": {
+      "high": [
+        0.04,
+        0.04,
+        0.04
+      ],
+      "low": [
+        -0.04,
+        -0.04,
+        -0.04
+      ]
+    }
+  },
+  "sub_configs": {},
+  "torch_dtype": "float32",
+  "transformers_version": "4.47.0",
+  "vlm_config": {
+    "attn_implementation": "flash_attention_2",
+    "depth_tokens": 1024,
+    "lm_head": false,
+    "mean_resizing": false,
+    "model_id": "google/paligemma-3b-mix-224",
+    "paligemma_3d_config": {
+      "depth_config": {
+        "hf_filename": "moge/moge-vit-large-patch-14-backbone.pt",
+        "hf_hub_repo": "nikonikolov/vlams"
+      },
+      "depth_layers": 4,
+      "depth_only": false,
+      "mask_prob": 0.0,
+      "projection": "features_add"
+    },
+    "processor_config": {
+      "image_sizes": {
+        "main": {
+          "height": 210,
+          "width": 280
+        },
+        "wrist": {
+          "height": 112,
+          "width": 112
+        }
+      },
+      "image_token": "<image>",
+      "max_language_tokens": 75
+    },
+    "train_only_depth_tokens": false
+  }
+}

configuration_spear.py ADDED Viewed

	@@ -0,0 +1,347 @@

+import collections
+import collections.abc
+from typing import Any, Dict, List, Optional, Tuple
+import numpy as np
+from .common_spear import (
+    Config,
+    HFConfigMixin,
+    Normalization,
+    ResizeMode,
+    RotationFormat,
+)
+class InputSequencingConfig(Config):
+    """
+    past_frames_sequence_length: number of past images needed in a single robot state
+    past_scalars_sequence_length: number of past scalar state data, e.g. actions, poses, etc,
+        needed in a single robot state
+    past_frames_stride_sec: sampling rate, determines how far apart in time each point in the sequence
+        is. If None, ignored and takes the default data collection frequency from the dataset
+    past_scalars_stride_sec: similar to past_frames_stride_sec
+    sequence_frames: number of temporally-sequential points in a single example in the batch
+    sequence_frames_stride_sec: sampling rate
+    Understanding sequence_frames:
+        TODO: sequences are possibly useful in some rare cases, maybe sequence modeling problems,
+            but yet to be confirmed. Keeping for now, but could be removed if proved unnecessary
+        - past_scalars_sequence_length, past_frames_sequence_length, future_controls_sequence_length,
+            future_frames_sequence_length are hyperparameters refering to a SINGLE dataset example / 'state'.
+            It is assumed that `past_scalars_sequence_length` and `past_frames_sequence_length` are the min
+            number of observations that comprise a single 'state'
+        - sequence_frames is a hyperparameter refering to the entire learning process. It controls the size
+            of the sequence dimension in the batch. It's treated similarly to the batch dimension, with the
+            difference that points in the sequence dimensions are temporally aligned. Unlike `past_*`
+            attributes, in supervised learning a label is loaded for every point in the sequence dimension
+            and the loss usually computed over the entire sequence dimension.
+    """
+    past_scalars_sequence_length: int = 1
+    past_frames_sequence_length: int = 1
+    past_scalars_stride_sec: Optional[float] = None
+    past_frames_stride_sec: Optional[float] = None
+    sequence_frames: int = 1
+    sequence_frames_stride_sec: Optional[float] = None
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.past_scalars_sequence_length >= 1, self.past_scalars_sequence_length
+        assert self.past_frames_sequence_length >= 1, self.past_frames_sequence_length
+        assert self.sequence_frames >= 1, self.sequence_frames
+        if self.past_frames_stride_sec is not None:
+            assert self.past_frames_stride_sec >= 0.0, self.past_frames_stride_sec
+        if self.past_scalars_stride_sec is not None:
+            assert self.past_scalars_stride_sec >= 0.0, self.past_scalars_stride_sec
+        if self.sequence_frames_stride_sec is not None:
+            assert self.sequence_frames_stride_sec >= 0.0, self.sequence_frames_stride_sec
+    def assert_same_past(self) -> None:
+        assert (
+            self.past_frames_stride_sec == self.past_scalars_stride_sec
+        ), f"{self.past_frames_stride_sec} != {self.past_scalars_stride_sec}"
+        assert (
+            self.past_frames_sequence_length == self.past_scalars_sequence_length
+        ), f"{self.past_frames_sequence_length} != {self.past_scalars_sequence_length}"
+class OutputSequencingConfig(Config):
+    """
+    future_controls_sequence_length: number of control steps in the future the model predicts
+    future_frames_sequence_length: number of future frames the model predicts
+        (only relevant for neural networks that learn some sort of a world model)
+    future_controls_sequence_stride_sec / future_frames_sequence_stride_sec: sampling rate
+        that determines how far apart in time each point in the sequence is. If None,
+        ignored and takes the default data collection frequency from the dataset
+    future_control_offset_sec: time interval between the last observation and the first
+    point at which control is predicted. Serves as a 'causality hyperparameter', allowing
+    for predicting controls slightly further into the future in environments with dynamics
+    where the observed effects of an action appear slightly later
+    """
+    future_controls_sequence_length: int = 1
+    future_controls_sequence_stride_sec: Optional[float] = None
+    future_frames_sequence_length: int = 1
+    future_frames_sequence_stride_sec: Optional[float] = None
+    future_control_offset_sec: float = 0.0
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.future_controls_sequence_length >= 1, self.future_controls_sequence_length
+        assert self.future_frames_sequence_length >= 1, self.future_frames_sequence_length
+        assert self.future_control_offset_sec >= 0.0, self.future_control_offset_sec
+        if self.future_controls_sequence_stride_sec is not None:
+            assert self.future_controls_sequence_stride_sec >= 0.0, self.future_controls_sequence_stride_sec
+        if self.future_frames_sequence_stride_sec is not None:
+            assert self.future_frames_sequence_stride_sec >= 0.0, self.future_frames_sequence_stride_sec
+class ControlDataIOConfig(InputSequencingConfig, OutputSequencingConfig):
+    pass
+class ControlTokenizerConfig(Config):
+    pass
+class EmptyTokenizerConfig(ControlTokenizerConfig):
+    pass
+class VLAMProcessorConfig(Config):
+    control_io_config: ControlDataIOConfig = ControlDataIOConfig()
+    obs_translation_norm: Normalization | Dict[str, Tuple[float, float, float]] = Normalization.NONE
+    obs_rotation_norm: Normalization = Normalization.NONE
+    translation_norm: Normalization | Dict[str, Tuple[float, float, float]] = Normalization.NONE
+    rotation_norm: Normalization = Normalization.NONE
+    joints_norm: Dict[str, Tuple[float, ...]] = {
+        "low": (-np.pi,) * 7,
+        "high": (np.pi,) * 7,
+    }
+    rotation_format: RotationFormat = RotationFormat.QUATERNION
+    eef_control_frame: bool = False
+    delta_controls: bool = False
+    image_resize: ResizeMode = ResizeMode.SMART
+    control_tokenizer_config: EmptyTokenizerConfig = EmptyTokenizerConfig()
+    control_stats_path: str = "barrel/pipes/vlams/types/control_stats.yaml"
+    observation_stats_path: str = "barrel/pipes/vlams/types/observation_stats.yaml"
+    def __post_init__(self):
+        super().__post_init__()
+        if isinstance(self.translation_norm, collections.abc.Mapping):
+            assert all((len(value) == 3 for value in self.translation_norm.values())), self.translation_norm
+            assert set(self.translation_norm.keys()) in (
+                {"low", "high"},
+                {"mean", "std"},
+            ), self.translation_norm
+        assert isinstance(self.joints_norm, collections.abc.Mapping), type(self.joints_norm)
+        assert all((len(value) == 7 for value in self.joints_norm.values())), self.joints_norm
+        assert set(self.joints_norm.keys()) in (
+            {"low", "high"},
+            {"mean", "std"},
+        ), self.joints_norm
+class RegressionProcessorConfig(VLAMProcessorConfig):
+    pass
+class PiZeroFlowProcessorConfig(RegressionProcessorConfig):
+    num_inference_steps: int
+    r0_distribution: str = "uniform"
+    timestep_distribution: str
+    distribution_hyperparams: Dict[str, Any] = {}
+    sig_min: float = 0.001
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.r0_distribution in ["normal", "uniform"]
+class VLMConfig(Config):
+    pass
+class VLMProcessorConfig(Config):
+    pass
+class ImageSizeConfig(Config):
+    width: int
+    height: int
+    def to_dict(self):
+        return {"width": self.width, "height": self.height}
+class PaliGemmaProcessorConfig(Config):
+    image_token: str = "<image>"
+    image_sizes: Dict[str, ImageSizeConfig] = {"main": ImageSizeConfig(width=224, height=224)}
+    max_language_tokens: int = 75
+    def __post_init__(self):
+        super().__post_init__()
+        self.image_sizes = {
+            camera_name: (
+                ImageSizeConfig(**camera_image_size)
+                if not isinstance(camera_image_size, ImageSizeConfig)
+                else camera_image_size
+            )
+            for camera_name, camera_image_size in self.image_sizes.items()
+        }
+        for camera_name, camera_image_size in self.image_sizes.items():
+            assert camera_image_size.height % 14 == 0, f"{camera_name}: {camera_image_size}"
+            assert camera_image_size.width % 14 == 0, f"{camera_name}: {camera_image_size}"
+    @property
+    def num_image_tokens(self) -> Dict[str, int]:
+        return {
+            camera_name: camera_image_size.height // 14 * (camera_image_size.width // 14)
+            for (camera_name, camera_image_size) in self.image_sizes.items()
+        }
+    @property
+    def is_single_image_size(self) -> bool:
+        return (
+            len(self.image_sizes) == 1
+            or len(set(((image_size.height, image_size.width) for image_size in self.image_sizes.values())))
+            == 1
+        )
+    @property
+    def camera_names(self) -> List[str]:
+        return list(self.image_sizes.keys())
+    def to_dict(self) -> Dict[str, Any]:
+        base_dict = {
+            "image_token": self.image_token,
+            "max_language_tokens": self.max_language_tokens,
+        }
+        base_dict["image_sizes"] = {
+            camera_name: camera_image_size.to_dict()
+            for camera_name, camera_image_size in self.image_sizes.items()
+        }
+        return base_dict
+class PaliGemmaVLMConfig(Config):
+    model_id: str = "google/paligemma-3b-mix-224"
+    attn_implementation: str = "flash_attention_2"
+    processor_config: PaliGemmaProcessorConfig
+    lm_head: bool = False
+    paligemma_3d_config: Dict[str, Any] = {}
+    depth_tokens: int = 0
+    train_only_depth_tokens: bool = False
+    mean_resizing: bool = False
+    def __post_init__(self):
+        super().__post_init__()
+        if self.train_only_depth_tokens:
+            assert self.depth_tokens > 0, self.depth_tokens
+        if self.paligemma_3d_config.get("mask_prob", 0.0) != 0.0:
+            raise NotImplementedError(
+                f"Masking is deprecated, but got mask_prob={self.paligemma_3d_config['mask_prob']}"
+            )
+    @property
+    def paligemma_3d_config_dict(self) -> Dict[str, Any]:
+        if len(self.paligemma_3d_config) == 0:
+            return {}
+        config = dict(self.paligemma_3d_config)
+        config["depth_config"] = dict(config["depth_config"])
+        config["depth_config"]["image_sizes"] = {
+            camera_name: camera_image_size.to_dict()
+            for camera_name, camera_image_size in self.processor_config.image_sizes.items()
+        }
+        return config
+    @property
+    def with_depth(self) -> bool:
+        return len(self.paligemma_3d_config) > 0
+class FourierFeaturesConfig(Config):
+    num_features: int = 256
+    learnable_features: bool = False
+    max_period: float = 10000.0
+    layers: List[int] = [256, 512, 256]
+    activation: str = "SiLU"
+    norm: Optional[str] = None
+class NoisedControlProjectorConfig(Config):
+    time_embed: FourierFeaturesConfig
+    layers: List[int] = []
+    activation: str = "SiLU"
+    norm: Optional[str] = None
+class RobotStateProjectorConfig(Config):
+    layers: List[int] = []
+    mode: str = "none"
+    activation: str = "GELU"
+    fourier: bool = False
+    def __post_init__(self):
+        super().__post_init__()
+        assert self.mode in [
+            "ee_pose",
+            "ee_pose_gripper",
+            "ee_pose_joints",
+            "joints",
+            "all",
+            "none",
+        ], self.mode
+class RotaryPositionalEncodingConfig(Config):
+    num_embeddings: int
+    embedding_dim: int
+    base: int = 10000
+    cached: bool = True
+class PiZeroFlowMatchingDecoderBlockConfig(Config):
+    feature_size: int
+    head_dim: int = 128
+    num_heads: int = 32
+    num_kv_heads: int = 1
+    hidden_size: int
+    activation: str = "GELU"
+    norm: str = "RMSNorm"
+    dropout: float = 0.0
+    attn_implementation: str = "sdpa"
+    position_embed_config: RotaryPositionalEncodingConfig
+class PiZeroFlowMatchingDecoderConfig(Config):
+    num_blocks: int
+    block_config: PiZeroFlowMatchingDecoderBlockConfig
+class PiZeroFlowMatchingModuleConfig(Config):
+    token_size: int = 1024
+    noised_control_proj_config: NoisedControlProjectorConfig
+    robot_state_proj_config: RobotStateProjectorConfig
+    control_decoder_config: PiZeroFlowMatchingDecoderConfig
+    rotation_components: int = 3
+class SPEAR1Config(HFConfigMixin, Config):
+    model_type: str = "spear1"
+    processor_config: PiZeroFlowProcessorConfig
+    vlm_config: PaliGemmaVLMConfig
+    control_module_config: PiZeroFlowMatchingModuleConfig
+    def __init__(self, **kwargs):
+        if "auto_map" not in kwargs:
+            kwargs["auto_map"] = {
+                "AutoConfig": "configuration_spear.SPEAR1Config",
+                "AutoModel": "modeling_spear.SPEAR1",
+            }
+        super().__init__(**kwargs)

generation_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "transformers_version": "4.47.0"
+}

model-00001-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0992d3b5ffdc8b896812ed19801bc9ebda65708237681ced90e642c90e0a0d2
+size 4962008480

model-00002-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:db48d29ee9567705a81718181eac6c644d2d996f1e91c497e8c891702050c36e
+size 4999821656

model-00003-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1c7e1d6dae46553546f53a3c9fa76a8a2d2e07664a575ce38962ae2930eb7562
+size 4245980072

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_spear.py ADDED Viewed

The diff for this file is too large to render. See raw diff

processing_spear.py ADDED Viewed

	@@ -0,0 +1,1897 @@

+import collections
+import collections.abc
+import re
+import warnings
+from abc import abstractmethod
+from functools import cached_property
+from typing import Dict, List, Optional, Sequence, Tuple, TypeVar
+import numpy as np
+import PIL.Image
+import roma
+import torch
+import torchvision.transforms.v2
+import transformers
+import yaml
+from .common_spear import (
+    Configurable,
+    FlowInput,
+    Normalization,
+    ResizeMode,
+    RoboticsControlPlan,
+    RoboticsFlowInput,
+    RoboticsInput,
+    RoboticsOutput,
+    RoboticsTarget,
+    RotationFormat,
+    expand_dims,
+    is_quaternion,
+    is_rotmat,
+    is_rotmat_3x3,
+    is_rotmat_9,
+    quaternion_half_cover,
+    rotmat_as_3x3,
+    rotmat_as_9,
+)
+from .configuration_spear import (
+    ControlDataIOConfig,
+    ImageSizeConfig,
+    PaliGemmaProcessorConfig,
+)
+class VLMProcessor(Configurable):
+    @abstractmethod
+    def preprocess_inputs(
+        self, chat: List[str], images: Dict[str, List[PIL.Image.Image]]
+    ) -> Dict[str, torch.Tensor | Dict[str, torch.Tensor]]: ...
+    @property
+    @abstractmethod
+    def tokenizer(self) -> transformers.PreTrainedTokenizerBase:
+        pass
+    @property
+    @abstractmethod
+    def image_sizes(self) -> Dict[str, ImageSizeConfig]:
+        pass
+class EmptyTokenizer(Configurable):
+    """
+    Takes the LLM hidden states from `llm_layer_indices` and concatenates them to produce the
+    desired result. Includes the hidden states for the image tokens.
+    """
+    def __init__(self, config, tokenizer: transformers.PreTrainedTokenizerBase) -> None:
+        super().__init__(config)
+        self.tokenizer = tokenizer
+    def __call__(self, *_) -> str:
+        return ""
+def np_unique(
+    data: np.ndarray,
+) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
+    """
+    Compute unique elements in data and corresponding indices.
+    np.unique returns the values in a sorted order, even if the source is not sorted. Thus, if you simply
+    run np.unique on unsorted data, the indices you will get will be invalid.
+    """
+    (_, indices, inverse) = np.unique(data, return_index=True, return_inverse=True)
+    (_, indices_of_first_occurence, inverse_indices, counts) = np.unique(
+        indices[inverse], return_index=True, return_inverse=True, return_counts=True
+    )
+    unique_ids = data[indices_of_first_occurence]
+    return unique_ids, indices_of_first_occurence, inverse_indices, counts
+def euler_to_rotmat(angles: torch.Tensor) -> torch.Tensor:
+    """
+    Args:
+        angles: Euler angles in radians in the format 'xyz', shape [..., 3]
+    Returns:
+        torch.Tensor of shape [..., 3, 3] containing rotation matrices
+    """
+    return roma.euler_to_rotmat(convention="xyz", angles=angles, degrees=False)
+def euler_to_unit_quaternion(angles: torch.Tensor) -> torch.Tensor:
+    """
+    Args:
+        angles: Euler angles in radians in the format 'xyz', shape [..., 3]
+    Returns:
+        torch.Tensor of shape [..., 4] containing unit quaternions
+    """
+    return roma.euler_to_unitquat(convention="xyz", angles=angles, degrees=False, normalize=True)
+def normalize_quaternion(quaternion: torch.Tensor, eps: float = 1e-08) -> torch.Tensor:
+    """
+    Args:
+        quaternion: Unnormalized quaternion, torch.Tensor of shape [..., 4]
+        eps: Small constant to prevent division by zero
+    Returns:
+        torch.Tensor of shape [..., 4] of unit quaternions
+    """
+    return quaternion / (quaternion.norm(dim=-1, keepdim=True).detach() + eps)
+def quaternion_to_euler(quaternion: torch.Tensor) -> torch.Tensor:
+    """
+    Args:
+        quaternion: torch.Tensor of shape [..., 4]; Can be non-normalized
+    Returns:
+        torch.Tensor of shape [..., 3, 3] containing rotation matrices in SO(3)
+    """
+    unit_quat = normalize_quaternion(quaternion)
+    rotmat = roma.unitquat_to_euler(convention="xyz", quat=unit_quat, as_tuple=False, degrees=False)
+    return rotmat
+def quaternion_to_rotmat(quaternion: torch.Tensor) -> torch.Tensor:
+    """
+    Args:
+        quaternion: torch.Tensor of shape [..., 4]; Can be non-normalized
+    Returns:
+        torch.Tensor of shape [..., 3, 3] containing rotation matrices in SO(3)
+    """
+    unit_quat = normalize_quaternion(quaternion)
+    rotmat = roma.unitquat_to_rotmat(unit_quat)
+    return rotmat
+def rotmat_to_unit_quaternion(rotmat: torch.Tensor) -> torch.Tensor:
+    """
+    Args:
+        rotmat: Batch of rotation matrices, shape [..., 3, 3]
+    Returns:
+        Batch of unit quaternions, shape [..., 4]
+    """
+    rotmat = rotmat_as_3x3(rotmat)
+    return roma.rotmat_to_unitquat(rotmat)
+def rotmat_to_euler(rotmat: torch.Tensor) -> torch.Tensor:
+    """
+    Args:
+        rotmat: Batch of rotation matrices, shape [..., 3, 3]
+    Returns:
+        Batch of Euler angles in radiant, shape [..., 3]
+    """
+    rotmat = rotmat_as_3x3(rotmat)
+    return roma.rotmat_to_euler(convention="xyz", rotmat=rotmat, as_tuple=False, degrees=False)
+def symmetric_orthogonalization(x: torch.Tensor) -> torch.Tensor:
+    """
+    Maps 9D input vectors onto SO(3) via symmetric orthogonalization.
+        - Let SVD(M) = U \Sigma V^T
+        - Returned value is SVD+(M) =  U diag(1, 1, det(UV^T)) V^T
+        - det(UV^T) ensures that det(SVD+(M)) = 1
+        - The return value is a rotation matrix (ortonormal) with the least-squares distance to M
+    Args:
+        x: Input matrices, not necessarily orthonormal, shape [..., 9] or [..., 3, 3]
+    Returns:
+        torch.Tensor with the same shape as x, where each inner 3x3 matrix is in SO(3)
+    """
+    with warnings.catch_warnings():
+        warnings.filterwarnings(
+            "ignore",
+            message="In CPU autocast, but the target dtype is not supported. Disabling autocast.",
+        )
+        with torch.autocast(device_type=x.device.type, dtype=torch.float32):
+            matrices = x.view(-1, 3, 3)
+            matrices = matrices.to(dtype=torch.float32)
+            (u, s, v) = torch.svd(matrices)
+            vt = torch.transpose(v, 1, 2)
+            det = torch.det(torch.matmul(u, vt)).view(-1, 1, 1)
+            diag_vt = torch.cat((vt[:, :2, :], vt[:, -1:, :] * det), dim=1)
+            result = torch.matmul(u, diag_vt)
+            result = result.view(*x.shape)
+    result = result.to(dtype=x.dtype)
+    return result
+def is_rotmat_orthonormal(
+    rotmat: torch.Tensor, epsilon: float = 1e-06, reduction: str = "none"
+) -> torch.Tensor | bool:
+    """
+    Check if a rotation matrix is orthonormal or not.
+    Args:
+        rotmat: torch.Tensor of shape [..., 3, 3] or [..., 9]
+        epsilon: Tolerance for numerical comparisons. Bigger values allow for more freedom. Generally,
+            anything smaller than 1e-6 might incorrectly detect some otrhonormal matrices as not
+        reduction:
+            'none' - returns torch.Tensor of bools with the same batch shape
+            'all' - returns a bool, True is ALL matrices in the batch are orthonormal
+    Returns:
+        torch.Tensor with the same batch shape or bool
+    """
+    assert is_rotmat(rotmat)
+    rotmat = rotmat_as_3x3(rotmat.to(dtype=torch.float32))
+    is_orthonormal = roma.is_orthonormal_matrix(rotmat, epsilon=epsilon)
+    if reduction == "none":
+        return is_orthonormal
+    if reduction == "all":
+        return bool(torch.all(is_orthonormal).item())
+    raise ValueError(f"Unknown reduction mode {reduction}")
+def is_orthonormal_rotmat(rotmat: torch.Tensor) -> bool:
+    """
+    Checks if the tensor shape matches that of a rotmat. If the last dimensions of shape are 3x3,
+    also checks if the data is a valid rotmat. This is to avoid a possible clash with euler angles
+    when accidentally `rotmat.shape[-2:] == [3, 3]`
+    """
+    return (
+        is_rotmat_9(rotmat)
+        or is_rotmat_3x3(rotmat)
+        and is_rotmat_orthonormal(rotmat, epsilon=0.01, reduction="all")
+    )
+def is_euler(euler: torch.Tensor) -> bool:
+    return euler.shape[-1] == 3 and not is_orthonormal_rotmat(euler)
+def normalize_rotation(rotation: torch.Tensor) -> torch.Tensor:
+    if is_quaternion(rotation):
+        return normalize_quaternion(rotation)
+    if is_euler(rotation):
+        return rotation
+    if is_rotmat(rotation):
+        is_flat = is_rotmat_9(rotation)
+        rotation = rotmat_as_3x3(rotation) if is_flat else rotation
+        rotmat = roma.special_gramschmidt(rotation)
+        rotmat = rotmat_as_9(rotmat) if is_flat else rotmat
+        return rotmat
+    raise ValueError(f"Unknown rotation format: {rotation.shape}")
+def rotation_format_from_tensor(rotation) -> RotationFormat:
+    if is_quaternion(rotation):
+        return RotationFormat.QUATERNION
+    if is_orthonormal_rotmat(rotation):
+        return RotationFormat.ROTMAT
+    if is_euler(rotation):
+        return RotationFormat.EULER
+    raise ValueError(f"Tensor shape {rotation.shape} is not a valid rotation format")
+def is_unit_quaternion(
+    quaternion: torch.Tensor, epsilon: float = 1e-08, reduction: str = "none"
+) -> torch.Tensor | bool:
+    """
+    Check if a quternion is normalized or not.
+    Args:
+        quaternion: torch.Tensor of shape [..., 4]
+        tolerance: Tolerance for numerical comparisons
+        reduction:
+            'none' - returns torch.Tensor of bools with the same batch shape
+            'all' - returns a bool, True if ALL quaternions in the batch are normalized
+    Returns:
+        torch.Tensor with the same batch shape or bool
+    """
+    assert is_quaternion(quaternion)
+    is_norm = torch.isclose(
+        quaternion.norm(dim=-1, keepdim=True),
+        torch.tensor(1.0, dtype=quaternion.dtype, device=quaternion.device),
+        atol=epsilon,
+    )
+    if reduction == "none":
+        return is_norm
+    if reduction == "all":
+        return bool(torch.all(is_norm).item())
+    raise ValueError(f"Unknown reduction mode {reduction}")
+def convert_rotation(
+    rotation: torch.Tensor | np.ndarray,
+    output_format: RotationFormat,
+    autonorm: bool = True,
+    half_cover: bool = True,
+) -> torch.Tensor | np.ndarray:
+    is_np = isinstance(rotation, np.ndarray)
+    if is_np:
+        rotation = torch.from_numpy(rotation)
+    if is_quaternion(rotation):
+        if autonorm and not is_unit_quaternion(rotation, reduction="all"):
+            rotation = normalize_quaternion(rotation)
+        if output_format == RotationFormat.QUATERNION:
+            output = rotation
+        elif output_format == RotationFormat.ROTMAT:
+            output = rotmat_as_9(quaternion_to_rotmat(rotation))
+        elif output_format == RotationFormat.EULER:
+            output = quaternion_to_euler(rotation)
+        else:
+            raise NotImplementedError(f"Unsupported rotation format: {output_format}")
+    elif is_orthonormal_rotmat(rotation):
+        if autonorm and not is_rotmat_orthonormal(rotation, epsilon=0.01, reduction="all"):
+            rotation = symmetric_orthogonalization(rotation)
+        if output_format == RotationFormat.QUATERNION:
+            output = rotmat_to_unit_quaternion(rotation)
+        elif output_format == RotationFormat.ROTMAT:
+            output = rotmat_as_9(rotation)
+        elif output_format == RotationFormat.EULER:
+            output = rotmat_to_euler(rotation)
+        else:
+            raise NotImplementedError(f"Unsupported rotation format: {output_format}")
+    elif is_euler(rotation):
+        if output_format == RotationFormat.QUATERNION:
+            output = euler_to_unit_quaternion(rotation)
+        elif output_format == RotationFormat.ROTMAT:
+            output = rotmat_as_9(euler_to_rotmat(rotation))
+        elif output_format == RotationFormat.EULER:
+            output = rotation
+        else:
+            raise NotImplementedError(f"Unsupported rotation format: {output_format}")
+    else:
+        raise ValueError(f"Unknown rotation encoding with shape {rotation.shape}")
+    if output_format == RotationFormat.QUATERNION and half_cover:
+        output = quaternion_half_cover(output)
+    if is_np:
+        output = output.numpy()
+    return output
+def delta_to_relative_rotations(rotation_sequence: torch.Tensor) -> torch.Tensor:
+    """
+    Transform a sequence of rotation representations encoded w.r.t. the PREVIOUS rotation frame in the
+    sequence to the 0-th element preceding the sequence
+    Ex:
+        `rotation_sequence` contains the rotations: R_01, R_12, R_23, R_34, where R0 is the base frame,
+            implicitly encoded in R_01 and R_10 converts from R0 frame to R1 frame
+        Output: R_01, R_02, R_03, R_04
+    Args:
+        rotation_sequence: torch.Tensor of shape [..., S, 9], [..., S, 3, 3] or [..., S, 4], containing
+            either rotation matrices (R_01, R_12, R_23, R_34, ...) or quaternions
+    Returns:
+        torch.Tensor of shape [..., S, 9], [..., S, 3, 3] or [..., S, 4] containing transformed rotations
+            (R_01, R_02, R_03, R_04, ...)
+    TODO: Can you make it work without for loop
+    """
+    assert rotation_sequence.ndim >= 3, rotation_sequence.shape
+    rotation_format: RotationFormat = rotation_format_from_tensor(rotation_sequence)
+    rotation_sequence = convert_rotation(rotation_sequence, RotationFormat.QUATERNION)
+    batch_dims = np.arange(rotation_sequence.ndim - 2)
+    delta_rotations = torch.cat(
+        [rotation_sequence[..., :1, :]]
+        + [
+            roma.quat_composition(rotation_sequence[..., :i, :].permute(-2, *batch_dims, -1).unsqueeze(-2))
+            for i in range(2, rotation_sequence.shape[-2] + 1)
+        ],
+        dim=-2,
+    )
+    delta_rotations = convert_rotation(delta_rotations, rotation_format)
+    return delta_rotations
+def assert_np_hwc_or_hw_image(image: np.ndarray | PIL.Image.Image) -> np.ndarray:
+    """Make sure image is of type np.ndarray and HWC format"""
+    if isinstance(image, PIL.Image.Image):
+        image = np.asarray(image)
+    assert isinstance(image, np.ndarray), type(image)
+    assert image.ndim in [2, 3], image.shape
+    if image.ndim == 3:
+        assert image.shape[-1] <= 4, image.shape
+    return image
+def hw_from_image(image: PIL.Image.Image | np.ndarray) -> tuple[int, int]:
+    if isinstance(image, np.ndarray):
+        (height, width) = image.shape[:2]
+    else:
+        (width, height) = image.size
+    return height, width
+def pad_image(
+    image: PIL.Image.Image | np.ndarray,
+    target_size: dict[str, int],
+    pad_value: tuple[int, int, int] | tuple[float, float, float] | int | float = 0,
+) -> PIL.Image.Image | np.ndarray:
+    """Pad image adding a symmetric border around the height/width."""
+    assert isinstance(image, (PIL.Image.Image, np.ndarray)), type(image)
+    (height, width) = hw_from_image(image)
+    (target_width, target_height) = (target_size["width"], target_size["height"])
+    if width == target_width and height == target_height:
+        return image
+    assert target_width >= width, f"Can't pad image of width {width} to {target_width}"
+    assert target_height >= height, f"Can't pad image of height {height} to {target_height}"
+    (horizontal_pad, vertical_pad) = (
+        int((target_width - width) / 2),
+        int((target_height - height) / 2),
+    )
+    if isinstance(image, np.ndarray):
+        padding = ((vertical_pad, vertical_pad), (horizontal_pad, horizontal_pad)) + ((0, 0),) * (
+            image.ndim - 2
+        )
+        image = np.pad(image, padding, mode="constant", constant_values=pad_value)
+    else:
+        padding = (horizontal_pad, vertical_pad, horizontal_pad, vertical_pad)
+        image = torchvision.transforms.v2.functional.pad(
+            image, padding=padding, fill=pad_value, padding_mode="constant"
+        )
+    return image
+def pad_image_to_ratio(
+    image: PIL.Image.Image | np.ndarray,
+    target_wh_ratio: float,
+    pad_value: tuple[int, int, int] | tuple[float, float, float] | int | float = 0,
+) -> PIL.Image.Image | np.ndarray:
+    """Pad image to a target aspect ratio."""
+    (height, width) = hw_from_image(image)
+    wh_ratio = width / height
+    if target_wh_ratio >= wh_ratio:
+        pad_size = {"width": round(height * target_wh_ratio), "height": height}
+    else:
+        pad_size = {"width": width, "height": round(width / target_wh_ratio)}
+    image = pad_image(image, target_size=pad_size, pad_value=pad_value)
+    return image
+def crop_image(
+    image: np.ndarray | PIL.Image.Image,
+    start_height: int,
+    start_width: int,
+    target_height: int,
+    target_width: int,
+) -> np.ndarray | PIL.Image.Image:
+    np_image = assert_np_hwc_or_hw_image(image)
+    (height, width) = hw_from_image(image)
+    assert target_width <= width, f"Can't crop image of width {width} to {target_width}"
+    assert target_height <= height, f"Can't crop image of width {height} to {target_height}"
+    (start_height, start_width) = (round(start_height), round(start_width))
+    (target_height, target_width) = (round(target_height), round(target_width))
+    np_image = np_image[
+        start_height : start_height + target_height,
+        start_width : start_width + target_width,
+        ...,
+    ]
+    image = PIL.Image.fromarray(np_image) if isinstance(image, PIL.Image.Image) else np_image
+    return image
+def crop_image_center(
+    image: np.ndarray | PIL.Image.Image, target_size: dict[str, int]
+) -> np.ndarray | PIL.Image.Image:
+    np_image = assert_np_hwc_or_hw_image(image)
+    (height, width) = np_image.shape[:2]
+    (target_height, target_width) = (target_size["height"], target_size["width"])
+    assert target_width <= width, f"Can't crop image of width {width} to {target_width}"
+    assert target_height <= height, f"Can't crop image of width {height} to {target_height}"
+    top = (height - target_height) // 2
+    left = (width - target_width) // 2
+    np_image = crop_image(np_image, top, left, target_height, target_width)
+    image = PIL.Image.fromarray(np_image) if isinstance(image, PIL.Image.Image) else np_image
+    return image
+def crop_image_to_ratio(
+    image: PIL.Image.Image | np.ndarray, target_wh_ratio: float
+) -> PIL.Image.Image | np.ndarray:
+    """Pad image to a target aspect ratio."""
+    (height, width) = hw_from_image(image)
+    wh_ratio = width / height
+    if target_wh_ratio >= wh_ratio:
+        crop_size = {"width": width, "height": round(width / target_wh_ratio)}
+    else:
+        crop_size = {"width": round(height * target_wh_ratio), "height": height}
+    image = crop_image_center(image, target_size=crop_size)
+    return image
+def crop_and_pad_image_to_ratio(
+    image: PIL.Image.Image | np.ndarray,
+    target_wh_ratio: float,
+    mode: ResizeMode | str,
+    pad_value: tuple[int, int, int] | tuple[float, float, float] | int | float = 0,
+) -> PIL.Image.Image | np.ndarray:
+    """
+    Crop and pad an image to a target size depending on the mode.
+    It's expected that the source image and target size have different aspect ratios.
+    Args:
+        image: The image to crop and pad.
+        target_size: The target size to crop and pad the image to.
+        mode: The mode to use for cropping and padding.
+    """
+    (height, width) = hw_from_image(image)
+    wh_ratio = width / height
+    if np.isclose(wh_ratio, target_wh_ratio, rtol=0.01, atol=0.0001):
+        return image
+    if mode == ResizeMode.SMART:
+        aspect_ratio = max(width, height) / min(width, height)
+        target_ratio = max(target_wh_ratio, 1 / target_wh_ratio)
+        if aspect_ratio == 1:
+            if target_ratio >= 4 / 3 - 0.01:
+                crop_wh_ratio = 4 / 3 if target_wh_ratio >= 1.0 else 3 / 4
+                image = crop_image_to_ratio(image, crop_wh_ratio)
+            else:
+                pass
+        elif aspect_ratio <= 4 / 3 + 0.01:
+            if wh_ratio >= 1.0 != (target_wh_ratio >= 1.0):
+                image = crop_image_to_ratio(image, 1.0)
+        elif wh_ratio >= 1.0 != (target_wh_ratio >= 1.0):
+            image = crop_image_to_ratio(image, 1.0)
+        elif target_ratio >= 4 / 3 + 0.01:
+            pass
+        else:
+            crop_wh_ratio = 4 / 3 if target_wh_ratio >= 1.0 else 3 / 4
+            image = crop_image_to_ratio(image, crop_wh_ratio)
+        image = pad_image_to_ratio(image, target_wh_ratio, pad_value=pad_value)
+    elif mode == ResizeMode.PAD:
+        image = pad_image_to_ratio(image, target_wh_ratio, pad_value=pad_value)
+    elif mode == ResizeMode.CROP:
+        image = crop_image_to_ratio(image, target_wh_ratio)
+    else:
+        raise ValueError(f"Mode {mode} not supported")
+    return image
+def is_single_channel_image(image: np.ndarray | PIL.Image.Image) -> bool:
+    if isinstance(image, PIL.Image.Image):
+        return image.mode in [
+            "1",
+            "L",
+            "LA",
+            "La",
+            "P",
+            "PA",
+            "F",
+            "I",
+            "I;16",
+            "I;16L",
+            "I;16B",
+            "I;16N",
+        ]
+    if isinstance(image, np.ndarray):
+        return image.ndim == 2 or image.ndim == 3 and image.shape[2] == 1
+    raise ValueError(f"Unsupported image type: {type(image)}")
+def is_binary_mask(image: np.ndarray | PIL.Image.Image) -> bool:
+    image = np.asarray(image)
+    return image.dtype in [np.uint8, np.bool_] and np.max(image) == 1
+def resize_image(
+    image: PIL.Image.Image | np.ndarray,
+    target_size: dict[str, int],
+    mode: ResizeMode | str,
+    resample: PIL.Image.Resampling | str = "auto",
+    pad_value: tuple[int, int, int] | tuple[float, float, float] | int | float = 0,
+) -> PIL.Image.Image | np.ndarray:
+    (target_width, target_height) = (target_size["width"], target_size["height"])
+    (height, width) = hw_from_image(image)
+    if height == target_height and width == target_width:
+        return image
+    if resample == "auto":
+        if is_single_channel_image(image):
+            resample = PIL.Image.Resampling.BILINEAR
+        else:
+            resample = PIL.Image.Resampling.LANCZOS
+    else:
+        assert isinstance(resample, PIL.Image.Resampling), resample
+        if is_single_channel_image(image) and resample not in [
+            PIL.Image.Resampling.BILINEAR,
+            PIL.Image.Resampling.BICUBIC,
+        ]:
+            raise ValueError(
+                f"Single channel images must be resized with bilinear or bicubic, but got {resample}"
+            )
+    if is_bin_mask := is_binary_mask(image):
+        image = np.asarray(image).astype(np.uint8) * 255
+    if mode == ResizeMode.SMART:
+        image = crop_and_pad_image_to_ratio(
+            image,
+            target_wh_ratio=target_width / target_height,
+            mode=mode,
+            pad_value=pad_value,
+        )
+    pil_image = PIL.Image.fromarray(image) if isinstance(image, np.ndarray) else image
+    if mode in [ResizeMode.NAIVE, ResizeMode.SMART]:
+        pil_image = pil_image.resize((target_width, target_height), resample=resample)
+    else:
+        raise NotImplementedError(f"Mode {mode} not supported")
+    image = np.asarray(pil_image) if isinstance(image, np.ndarray) else pil_image
+    if is_bin_mask:
+        image = image.astype(np.uint8) > 127
+    return image
+def is_global_norm(
+    norm: Normalization | Dict[str, torch.Tensor | np.ndarray | tuple | list],
+) -> bool:
+    """Return true if norm is NONE or global for all datasets"""
+    return norm == Normalization.NONE or isinstance(norm, collections.abc.Mapping)
+def is_mean_norm(
+    norm: Normalization | Dict[str, torch.Tensor | np.ndarray | tuple | list],
+) -> bool:
+    """Return true if norm is based on mean and std"""
+    return (
+        norm == Normalization.MEAN
+        or isinstance(norm, collections.abc.Mapping)
+        and set(norm.keys()) == {"mean", "std"}
+    )
+def _broadcast_shapes(
+    value: torch.Tensor, low: torch.Tensor, high: torch.Tensor
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Broadcast shapes for normalization:
+    Args:
+        value: torch.Tensor of shape [..., num_components]. The entire shape might be:
+            - [num_components]: `value` has no batch dimension
+            - [num_datasets, num_components]: `value` contains entries *aligned* with the dataset bounds
+                contained in `low` and `high`
+            - [num_datasets, ..., num_components]: `value` contains entries *aligned* with the dataset bounds
+                contained in `low` and `high`
+            - [..., num_components]: `value` contains multiple dimensions. In this case, `low` and `high`
+                must be for a single dataset, i.e. `num_datasets = 1`
+        low: torch.Tensor, shape [num_datasets, num_components], where `num_datasets` can be 1 when `low`
+            contains normalization bounds for a single dataset
+        high: torch.Tensor, shape [num_datasets, num_components], where `num_datasets` can be 1 when `high`
+            contains normalization bounds for a single dataset
+    Returns:
+        Tuple of torch.Tensors (low, high), where `low` and `high` have the same number of dimensions as `value`
+    """
+    assert low.ndim == high.ndim == 2, f"{low.shape} != {high.shape} or ndim != 2"
+    assert value.shape[-1] == low.shape[-1] == high.shape[-1], f"{value.shape} != {low.shape} / {high.shape}"
+    if value.ndim == low.ndim == high.ndim:
+        return low, high
+    if value.ndim < low.ndim:
+        assert low.ndim == high.ndim == 2, f"{low.shape}, {high.shape}"
+        assert low.shape[0] == high.shape[0] == 1, f"{low.shape}, {high.shape}"
+        (low, high) = (low.view(-1), high.view(-1))
+        return low, high
+    if low.shape[0] == high.shape[0] == 1:
+        low = expand_dims(low.view(-1), ndim=value.ndim, order=[-1, 1])
+        high = expand_dims(high.view(-1), ndim=value.ndim, order=[-1, 1])
+    else:
+        assert value.shape[0] == low.shape[0] == high.shape[0], f"{value.shape} != {low.shape} / {high.shape}"
+        low = expand_dims(low, ndim=value.ndim, order=[1, -1, 1])
+        high = expand_dims(high, ndim=value.ndim, order=[1, -1, 1])
+    return low, high
+def unnormalize_by_moments(value: torch.Tensor, mean: torch.Tensor, std: torch.Tensor) -> torch.Tensor:
+    (mean, std) = _broadcast_shapes(value, mean, std)
+    (mean, std) = (mean.to(device=value.device), std.to(device=value.device))
+    return value * (std + 1e-08) + mean
+def unnormalize_by_bounds(value: torch.Tensor, low: torch.Tensor, high: torch.Tensor) -> torch.Tensor:
+    (low, high) = _broadcast_shapes(value, low, high)
+    (low, high) = (low.to(device=value.device), high.to(device=value.device))
+    return 0.5 * (value + 1) * (high - low) + low
+def normalize_gripper_by_bounds(
+    value: torch.Tensor, low: torch.Tensor, high: torch.Tensor, binary: bool = True
+) -> torch.Tensor:
+    """
+    If binary, normalize to [0, 1], otherwise normalize to [-1, 1]
+    """
+    (low, high) = _broadcast_shapes(value, low, high)
+    (low, high) = (low.to(device=value.device), high.to(device=value.device))
+    if binary:
+        return torch.clamp((value - low) / torch.clamp(high - low, min=1e-08), min=0.0, max=1.0)
+    return torch.clamp(2 * (value - low) / torch.clamp(high - low, min=1e-08) - 1, min=-1.0, max=1.0)
+def normalize_by_moments(value: torch.Tensor, mean: torch.Tensor, std: torch.Tensor) -> torch.Tensor:
+    (mean, std) = _broadcast_shapes(value, mean, std)
+    (mean, std) = (mean.to(device=value.device), std.to(device=value.device))
+    return (value - mean) / (std + 1e-08)
+def normalize_by_bounds(value: torch.Tensor, low: torch.Tensor, high: torch.Tensor) -> torch.Tensor:
+    (low, high) = _broadcast_shapes(value, low, high)
+    (low, high) = (low.to(device=value.device), high.to(device=value.device))
+    return torch.clamp(2 * (value - low) / torch.clamp(high - low, min=1e-08) - 1, min=-1.0, max=1.0)
+def invert_gripper(gripper: np.ndarray, low: float, high: float) -> np.ndarray:
+    if low < 0.0:
+        return np.clip(-gripper, low, high)
+    return high - np.clip(gripper, low, high)
+GRIPPER_BOUNDS = {
+    "bridge": (0.0, 1.0),
+    "bridge_orig": (0.0, 1.0),
+    "droid": (0.0, 1.0),
+    "roboset": (0.0, 1.0),
+}
+def preprocess_gripper_observation(
+    gripper: np.ndarray, dataset_name: str | np.ndarray, binary: bool = True
+) -> np.ndarray:
+    """
+    Preprocess gripper observation depending on dataset. Input is the raw gripper observation from the dataset
+    or from the robot and output is normalized continuous value.
+        - if `binary`, output is in [0, 1], with 0 = closed and 1 = open.
+        - otherwise, output is in [-1, 1], with -1 = closed and 1 = open.
+    Dataset-specific gripper observations:
+        bridge: continuous; ~[0=closed; 1=open]
+        bridge_orig: continuous; ~[0=closed; 1=open]
+        droid: continuous; [0=open, 1=closed]
+        roboset: continuous; [0=open, 1=closed]
+    """
+    if isinstance(dataset_name, np.ndarray):
+        assert np.unique(dataset_name).size == 1, dataset_name
+        dataset_name = str(dataset_name[0])
+    if dataset_name in [
+        "droid",
+        "roboset",
+    ]:
+        (low, high) = GRIPPER_BOUNDS[dataset_name]
+        gripper = normalize_gripper_by_bounds(
+            torch.from_numpy(invert_gripper(gripper, low=low, high=high)),
+            low=torch.full(gripper.shape, GRIPPER_BOUNDS[dataset_name][0], dtype=torch.float32),
+            high=torch.full(gripper.shape, GRIPPER_BOUNDS[dataset_name][1], dtype=torch.float32),
+            binary=binary,
+        ).numpy()
+    elif dataset_name in [
+        "bridge",
+        "bridge_orig",
+    ]:
+        (low, high) = GRIPPER_BOUNDS[dataset_name]
+        gripper = normalize_gripper_by_bounds(
+            torch.from_numpy(gripper),
+            low=torch.full(gripper.shape, low, dtype=torch.float32),
+            high=torch.full(gripper.shape, high, dtype=torch.float32),
+            binary=binary,
+        ).numpy()
+    else:
+        raise NotImplementedError(f"Unknown dataset: {dataset_name}")
+    return gripper
+def rotation_norm_bounds(
+    rotation_norm: Normalization,
+    rotation_format: RotationFormat,
+    stats: Dict[str, Dict[str, Dict[str, List[float]]]],
+    dataset_names: List[str],
+) -> Dict[str, Dict[str, torch.Tensor]]:
+    if rotation_format == RotationFormat.EULER and rotation_norm != Normalization.NONE:
+        if rotation_norm == Normalization.BOUNDS:
+            results = {
+                dataset_name: {
+                    "low": torch.tensor(dataset_stats["euler"]["min"]),
+                    "high": torch.tensor(dataset_stats["euler"]["max"]),
+                }
+                for (dataset_name, dataset_stats) in stats.items()
+            }
+        elif rotation_norm == Normalization.BOUNDS_Q99:
+            results = {
+                dataset_name: {
+                    "low": torch.tensor(dataset_stats["euler"]["q01"]),
+                    "high": torch.tensor(dataset_stats["euler"]["q99"]),
+                }
+                for (dataset_name, dataset_stats) in stats.items()
+            }
+        else:
+            raise NotImplementedError(f"Normalization type {rotation_norm} not yet implemented")
+    else:
+        assert rotation_norm == Normalization.NONE, rotation_norm
+        if rotation_format == RotationFormat.EULER:
+            rotation_size = 3
+        elif rotation_format == RotationFormat.QUATERNION:
+            rotation_size = 4
+        else:
+            rotation_size = 9
+        results = {
+            dataset_name: {
+                "low": -1 * torch.ones(rotation_size, dtype=torch.float32),
+                "high": 1 * torch.ones(rotation_size, dtype=torch.float32),
+            }
+            for dataset_name in dataset_names
+        }
+    return results
+def translation_norm_bounds(
+    translation_norm: Normalization | tuple,
+    stats: Dict[str, Dict[str, Dict[str, List[float]]]],
+    dataset_names: List[str],
+) -> Dict[str, Dict[str, torch.Tensor]]:
+    if isinstance(translation_norm, (Normalization, str)) and translation_norm != Normalization.NONE:
+        if translation_norm == Normalization.BOUNDS:
+            results = {
+                dataset_name: {
+                    "low": torch.tensor(dataset_stats["translation"]["min"]),
+                    "high": torch.tensor(dataset_stats["translation"]["max"]),
+                }
+                for (dataset_name, dataset_stats) in stats.items()
+            }
+        elif translation_norm == Normalization.BOUNDS_Q99:
+            results = {
+                dataset_name: {
+                    "low": torch.tensor(dataset_stats["translation"]["q01"]),
+                    "high": torch.tensor(dataset_stats["translation"]["q99"]),
+                }
+                for (dataset_name, dataset_stats) in stats.items()
+            }
+        elif translation_norm == Normalization.MEAN:
+            results = {
+                dataset_name: {
+                    "mean": torch.tensor(dataset_stats["translation"]["mean"]),
+                    "std": torch.tensor(dataset_stats["translation"]["std"]),
+                }
+                for (dataset_name, dataset_stats) in stats.items()
+            }
+        else:
+            raise NotImplementedError(f"Normalization type {translation_norm} not yet implemented")
+    elif isinstance(translation_norm, Normalization) and translation_norm == Normalization.NONE:
+        results = {
+            dataset_name: {
+                "low": -1 * torch.ones(3, dtype=torch.float32),
+                "high": 1 * torch.ones(3, dtype=torch.float32),
+            }
+            for dataset_name in dataset_names
+        }
+    else:
+        assert isinstance(translation_norm, collections.abc.Mapping), type(translation_norm)
+        assert all((len(value) == 3 for value in translation_norm.values())), translation_norm
+        assert set(translation_norm.keys()) in (
+            {"low", "high"},
+            {"mean", "std"},
+        ), translation_norm
+        results = {
+            dataset_name: {
+                key: torch.tensor(value, dtype=torch.float32) for (key, value) in translation_norm.items()
+            }
+            for dataset_name in dataset_names
+        }
+    return results
+VLAMProcessorConfigT = TypeVar("VLAMProcessorConfigT")
+class VLAMProcessor(Configurable):
+    def __init__(self, config: VLAMProcessorConfigT, vlm_processor: VLMProcessor):
+        super().__init__(config)
+        self.vlm_processor = vlm_processor
+        self.control_tokenizer = EmptyTokenizer(
+            config=self.config.control_tokenizer_config, tokenizer=self.tokenizer
+        )
+        self.norm_bounds: Dict[str, Dict[str, Dict[str, torch.Tensor]]] = {
+            "obs_translation": self.obs_translation_norm_bounds,
+            "obs_rotation": self.obs_rotation_norm_bounds,
+            "translation": self.translation_norm_bounds,
+            "rotation": self.rotation_norm_bounds,
+            "joints": self.joints_norm_bounds,
+        }
+    @property
+    def tokenizer(self) -> transformers.PreTrainedTokenizerBase:
+        return self.vlm_processor.tokenizer
+    @property
+    def image_sizes(self) -> Dict[str, ImageSizeConfig]:
+        return self.vlm_processor.image_sizes
+    @property
+    def camera_names(self) -> List[str]:
+        return list(self.vlm_processor.image_sizes.keys())
+    @property
+    def control_io_config(self) -> ControlDataIOConfig:
+        return self.config.control_io_config
+    @cached_property
+    def rotation_components(self) -> int:
+        if self.config.rotation_format == RotationFormat.EULER:
+            return 3
+        if self.config.rotation_format == RotationFormat.QUATERNION:
+            return 4
+        if self.config.rotation_format == RotationFormat.ROTMAT:
+            return 9
+        raise NotImplementedError(self.config.rotation_format)
+    @abstractmethod
+    def policy_control_plan_from_model_target(
+        self, target: RoboticsTarget, dataset_name: np.ndarray
+    ) -> RoboticsControlPlan:
+        pass
+    @abstractmethod
+    def policy_control_plan_from_model_output(
+        self,
+        model_output: RoboticsOutput,
+        dataset_name: np.ndarray,
+        valid_mask: torch.Tensor,
+    ) -> RoboticsControlPlan:
+        pass
+    def resize_image(
+        self, camera_name: str, image: PIL.Image.Image | np.ndarray
+    ) -> PIL.Image.Image | np.ndarray:
+        return resize_image(
+            image,
+            target_size={
+                "width": self.image_sizes[camera_name].width,
+                "height": self.image_sizes[camera_name].height,
+            },
+            mode=self.config.image_resize,
+            resample=PIL.Image.Resampling.LANCZOS,
+        )
+    def preprocess_inputs(
+        self,
+        chat: List[str],
+        images: Dict[str, PIL.Image.Image | List[PIL.Image.Image]],
+        ee_pose_translation: np.ndarray,
+        ee_pose_rotation: np.ndarray,
+        gripper: np.ndarray,
+        joints: np.ndarray,
+        dataset_name: np.ndarray,
+        inference_mode: bool,
+        control_target: Optional[RoboticsTarget] = None,
+    ) -> Dict[str, torch.Tensor | Dict[str, torch.Tensor]]:
+        """
+        Preprocess the inputs for a single example
+        Args:
+            instruction: Language instruction
+            images: History of input images with increasing timestamps
+            ee_pose_translation: np.ndarray, shape [..., num_past_scalars, 3]
+            ee_pose_rotation: np.ndarray, shape [..., num_past_scalars, 3 | 4 | 9]
+            joints: np.ndarray, shape  [..., num_past_scalars, <= 7]
+            dataset_name: 1D np.ndarray
+            inference_mode: If True, prepare the input for inference (e.g. don't include target
+                any tokens in the input if relevant). If control_target is available, it should
+                still be preprocessed for test dataset comparison
+            control_target: RoboticsTarget, each component of shape
+                [..., num_control_steps, num_control_components]. Provided only when available, usually
+                during training and dataset test
+        Returns:
+            Dict containing torch.Tensor with inputs
+        """
+        del control_target
+        del inference_mode
+        inputs = self.vlm_processor.preprocess_inputs(chat=chat, images=images)
+        images: Dict[str, torch.Tensor] = inputs["images"]
+        input_ids: torch.Tensor = inputs["input_ids"][..., : self.tokenizer.model_max_length]
+        target_text_tokens_ids: torch.Tensor = inputs["target_ids"][..., : self.tokenizer.model_max_length]
+        attn_mask = torch.ones(input_ids.shape, dtype=torch.bool)
+        ee_pose_translation = torch.tensor(ee_pose_translation, dtype=torch.float32)
+        ee_pose_rotation = torch.tensor(ee_pose_rotation, dtype=torch.float32)
+        ee_pose_rotation = convert_rotation(ee_pose_rotation, self.config.rotation_format, autonorm=True)
+        gripper = preprocess_gripper_observation(gripper, dataset_name)
+        gripper = torch.tensor(gripper, dtype=torch.float32)
+        ee_pose_translation = self.normalize(
+            ee_pose_translation, dataset_name=dataset_name, key="obs_translation"
+        )
+        ee_pose_rotation = self.normalize(ee_pose_rotation, dataset_name=dataset_name, key="obs_rotation")
+        joints = torch.tensor(joints, dtype=torch.float32)
+        if joints.shape[-1] < 7:
+            missing_size = 7 - joints.shape[-1]
+            joints = torch.cat([joints, torch.zeros([*joints.shape[:-1], missing_size])], dim=-1)
+        joints = self.normalize(joints, dataset_name=dataset_name, key="joints")
+        outputs = {
+            "images": images,
+            "input_ids": input_ids,
+            "target_text_tokens_ids": target_text_tokens_ids,
+            "attn_mask": attn_mask,
+            "ee_pose_translation": ee_pose_translation,
+            "ee_pose_rotation": ee_pose_rotation,
+            "gripper": gripper,
+            "joints": joints,
+            "control_tokens_ids": None,
+            "target_control_tokens_ids": None,
+        }
+        return outputs
+    def create_input(
+        self,
+        chat: List[str],
+        images: Dict[str, List[PIL.Image.Image]],
+        ee_pose_translation: np.ndarray,
+        ee_pose_rotation: np.ndarray,
+        gripper: np.ndarray,
+        joints: np.ndarray,
+        dataset_name: np.ndarray,
+        inference_mode: bool,
+        control_target: Optional[RoboticsTarget] = None,
+    ) -> RoboticsInput:
+        inputs = self.preprocess_inputs(
+            chat=chat,
+            images=images,
+            ee_pose_translation=ee_pose_translation,
+            ee_pose_rotation=ee_pose_rotation,
+            gripper=gripper,
+            joints=joints,
+            dataset_name=dataset_name,
+            inference_mode=inference_mode,
+            control_target=control_target,
+        )
+        inputs.pop("target_text_tokens_ids")
+        inputs.pop("target_control_tokens_ids")
+        return RoboticsInput(**inputs)
+    def normalize(self, value: torch.Tensor, dataset_name: np.ndarray, key: str) -> torch.Tensor:
+        if is_mean_norm(getattr(self.config, f"{key}_norm")):
+            (mean, std) = self._norm_bounds_from_dataset_name(dataset_name, component_key=key)
+            output = normalize_by_moments(value, mean=mean, std=std)
+        else:
+            (low, high) = self._norm_bounds_from_dataset_name(dataset_name, component_key=key)
+            output = normalize_by_bounds(value, low=low, high=high)
+        return output
+    def unnormalize(self, value: torch.Tensor, dataset_name: np.ndarray, key: str) -> torch.Tensor:
+        if is_mean_norm(getattr(self.config, f"{key}_norm")):
+            (mean, std) = self._norm_bounds_from_dataset_name(dataset_name, component_key=key)
+            output = unnormalize_by_moments(value, mean=mean, std=std)
+        else:
+            (low, high) = self._norm_bounds_from_dataset_name(dataset_name, component_key=key)
+            output = unnormalize_by_bounds(value, low=low, high=high)
+        return output
+    def _norm_bounds_from_dataset_name(
+        self, dataset_name: np.ndarray, component_key: str
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Create an array of normalization bounds corresponding to dataset names
+        Args:
+            dataset_name: Array of shape [B] of dataset names for which to fetch the low and high
+                normalization bounds. Note the values can be repeating
+            component_key: str. One of 'action', 'translation', 'rotation'. Indicates for which control to
+                compute the normalization bounds
+        Returns:
+            Tuple of low and high bounds or norm and std, each of shape [B, -1]
+        """
+        norm = getattr(self.config, f"{component_key}_norm")
+        if is_mean_norm(norm):
+            (stats_key_1, stats_key_2) = ("mean", "std")
+        else:
+            (stats_key_1, stats_key_2) = ("low", "high")
+        if component_key == "joints":
+            if not isinstance(norm, collections.abc.Mapping):
+                raise NotImplementedError()
+            stats = {
+                key: torch.from_numpy(np.tile(np.reshape(value, [1, -1]), [len(dataset_name), 1]))
+                for (key, value) in self.joints_norm_bounds["ANY"].items()
+            }
+            return tuple(stats.values())
+        component_size = list(list(self.norm_bounds[component_key].values())[0].values())[0].shape[-1]
+        if self.dataset_names == ["ANY"]:
+            stats_1 = self.norm_bounds[component_key]["ANY"][stats_key_1]
+            stats_2 = self.norm_bounds[component_key]["ANY"][stats_key_2]
+            stats_1 = np.repeat(np.expand_dims(stats_1, axis=0), len(dataset_name), axis=0)
+            stats_2 = np.repeat(np.expand_dims(stats_2, axis=0), len(dataset_name), axis=0)
+        else:
+            (unique_names, _, inverse_indices, _) = np_unique(dataset_name)
+            stats_1 = np.zeros([len(unique_names), component_size], dtype=np.float32)
+            stats_2 = np.zeros([len(unique_names), component_size], dtype=np.float32)
+            for i, ds_name in enumerate(unique_names):
+                stats_1[i] = self.norm_bounds[component_key][ds_name][stats_key_1].numpy()
+                stats_2[i] = self.norm_bounds[component_key][ds_name][stats_key_2].numpy()
+            stats_1 = stats_1[inverse_indices]
+            stats_2 = stats_2[inverse_indices]
+        return torch.from_numpy(stats_1), torch.from_numpy(stats_2)
+    @cached_property
+    def obs_rotation_norm_bounds(self) -> Dict[str, Dict[str, torch.Tensor]]:
+        return rotation_norm_bounds(
+            rotation_norm=self.config.obs_rotation_norm,
+            rotation_format=self.config.rotation_format,
+            stats=self._observation_stats,
+            dataset_names=self.dataset_names,
+        )
+    @cached_property
+    def obs_translation_norm_bounds(self) -> Dict[str, Dict[str, torch.Tensor]]:
+        return translation_norm_bounds(
+            translation_norm=self.config.obs_translation_norm,
+            stats=self._observation_stats,
+            dataset_names=self.dataset_names,
+        )
+    @cached_property
+    def rotation_norm_bounds(self) -> Dict[str, Dict[str, torch.Tensor]]:
+        return rotation_norm_bounds(
+            rotation_norm=self.config.rotation_norm,
+            rotation_format=self.config.rotation_format,
+            stats=self._control_stats,
+            dataset_names=self.dataset_names,
+        )
+    @cached_property
+    def translation_norm_bounds(self) -> Dict[str, Dict[str, torch.Tensor]]:
+        return translation_norm_bounds(
+            translation_norm=self.config.translation_norm,
+            stats=self._control_stats,
+            dataset_names=self.dataset_names,
+        )
+    @cached_property
+    def joints_norm_bounds(self) -> Dict[str, Dict[str, torch.Tensor]]:
+        """
+        NOTE:
+            - Joint values across all joints and all datasets vary in the range [-2pi; 2pi]
+            - The effective range of a single joint is in practice one of [-2pi; 0], [-pi; pi], [0; 2pi]
+            - It's possible to shift all ranges to [-pi; pi], but it requires careful handling for each joint
+        """
+        low = torch.tensor(self.config.joints_norm["low"], dtype=torch.float32)
+        high = torch.tensor(self.config.joints_norm["high"], dtype=torch.float32)
+        results = {"ANY": {"low": low, "high": high}}
+        return results
+    @cached_property
+    def _observation_stats(self) -> Dict[str, Dict[str, Dict[str, List[float]]]]:
+        return {
+            "bridge": {
+                "euler": {
+                    "max": [3.141592653589793, 1.570796251296997, 3.141204357147217],
+                    "mean": [
+                        -0.25754162314671525,
+                        -0.12370228389510128,
+                        0.1620053749182691,
+                    ],
+                    "min": [-3.141592653492551, -1.4832241535186768, -3.14153790473938],
+                    "q01": [-3.138795563420751, -0.56544608771801, -1.4952478170394896],
+                    "q99": [3.138720980629329, 0.2677614077925682, 2.0032371997833236],
+                    "std": [3.0257414011616577, 0.1622662085147332, 0.6404942954645315],
+                },
+                "gripper": {
+                    "max": [1.0370277166366577],
+                    "min": [0.04637829214334488],
+                    "q01": [0.05192930996417999],
+                    "q99": [1.0118417739868164],
+                },
+                "joints": {
+                    "max": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "mean": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "min": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "q01": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "q99": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "std": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                },
+                "translation": {
+                    "max": [0.5862360596656799, 0.4034728705883026, 0.3568263053894043],
+                    "mean": [
+                        0.309032678604126,
+                        0.03403777256608009,
+                        0.061277542263269424,
+                    ],
+                    "min": [
+                        -0.04167502000927925,
+                        -0.2889411449432373,
+                        -0.13934996724128723,
+                    ],
+                    "q01": [
+                        0.1711955964565277,
+                        -0.15639324486255646,
+                        -0.048255354166030884,
+                    ],
+                    "q99": [
+                        0.4604376256465912,
+                        0.24112474918365479,
+                        0.18886254727840424,
+                    ],
+                    "std": [
+                        0.0635896623134613,
+                        0.09153717756271362,
+                        0.049334850162267685,
+                    ],
+                },
+            },
+            "bridge_orig": {
+                "euler": {
+                    "max": [3.141592653589793, 1.570796251296997, 3.141204357147217],
+                    "mean": [
+                        -0.25754162314671525,
+                        -0.12370228389510128,
+                        0.1620053749182691,
+                    ],
+                    "min": [-3.141592653492551, -1.4832241535186768, -3.14153790473938],
+                    "q01": [-3.138795563420751, -0.56544608771801, -1.4952478170394896],
+                    "q99": [3.138720980629329, 0.2677614077925682, 2.0032371997833236],
+                    "std": [3.0257414011616577, 0.1622662085147332, 0.6404942954645315],
+                },
+                "gripper": {
+                    "max": [1.0370277166366577],
+                    "min": [0.04637829214334488],
+                    "q01": [0.05192930996417999],
+                    "q99": [1.0118417739868164],
+                },
+                "joints": {
+                    "max": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "mean": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "min": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "q01": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "q99": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                    "std": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
+                },
+                "translation": {
+                    "max": [0.5862360596656799, 0.4034728705883026, 0.3568263053894043],
+                    "mean": [
+                        0.309032678604126,
+                        0.03403777256608009,
+                        0.061277542263269424,
+                    ],
+                    "min": [
+                        -0.04167502000927925,
+                        -0.2889411449432373,
+                        -0.13934996724128723,
+                    ],
+                    "q01": [
+                        0.1711955964565277,
+                        -0.15639324486255646,
+                        -0.048255354166030884,
+                    ],
+                    "q99": [
+                        0.4604376256465912,
+                        0.24112474918365479,
+                        0.18886254727840424,
+                    ],
+                    "std": [
+                        0.0635896623134613,
+                        0.09153717756271362,
+                        0.049334850162267685,
+                    ],
+                },
+            },
+            "droid": {
+                "euler": {
+                    "max": [3.141592502593994, 1.5705928802490234, 3.1415867805480957],
+                    "mean": [
+                        0.3140628098409554,
+                        -0.09296274023036387,
+                        -0.07227215454779846,
+                    ],
+                    "min": [
+                        -3.141592502593994,
+                        -1.5691150426864624,
+                        -3.1415374279022217,
+                    ],
+                    "q01": [
+                        -3.1378602981567383,
+                        -1.2125312042236327,
+                        -2.1614069032669065,
+                    ],
+                    "q99": [3.137854380607605, 0.9200375998020163, 1.9367506909370364],
+                    "std": [2.926265757944871, 0.363273475703332, 0.7576065217938824],
+                },
+                "gripper": {
+                    "max": [1.0],
+                    "min": [0.0],
+                    "q01": [0.0],
+                    "q99": [0.9911894202232361],
+                },
+                "joints": {
+                    "max": [
+                        2.668445110321045,
+                        1.5691218376159668,
+                        2.666306734085083,
+                        -0.3114914000034332,
+                        2.6624162197113037,
+                        4.28157901763916,
+                        2.752457857131958,
+                    ],
+                    "mean": [
+                        0.023137084334640106,
+                        0.2704989977282293,
+                        -0.01451389357228282,
+                        -2.018709403792315,
+                        -0.042720520800030394,
+                        2.350281188152209,
+                        0.12424663946659845,
+                    ],
+                    "min": [
+                        -2.6536705493927,
+                        -1.547789216041565,
+                        -2.6781487464904785,
+                        -2.9409868717193604,
+                        -2.6705946922302246,
+                        0.24893812835216522,
+                        -2.7615714073181152,
+                    ],
+                    "q01": [
+                        -0.9026106441020965,
+                        -0.8547340619564057,
+                        -0.9028875434398651,
+                        -2.7698556280136106,
+                        -1.6851656341552732,
+                        1.2335169839859008,
+                        -1.9587260699272155,
+                    ],
+                    "q99": [
+                        0.9569852340221403,
+                        1.4148830294609054,
+                        0.7693877756595566,
+                        -0.4545914208889008,
+                        1.5623322343826267,
+                        3.475611729621887,
+                        2.263479118347167,
+                    ],
+                    "std": [
+                        0.31695080251469465,
+                        0.49522214687158767,
+                        0.27993538230553827,
+                        0.478161574676113,
+                        0.4969961591445458,
+                        0.45101008525403846,
+                        0.7287264344068457,
+                    ],
+                },
+                "translation": {
+                    "max": [0.8575563430786133, 0.799155592918396, 1.0043904781341553],
+                    "mean": [
+                        0.5283099395864883,
+                        0.005363794653877434,
+                        0.3120132207021294,
+                    ],
+                    "min": [
+                        -0.15604186058044434,
+                        -0.827903687953949,
+                        -0.2347021996974945,
+                    ],
+                    "q01": [
+                        0.26669957995414734,
+                        -0.43774398624897004,
+                        -0.048167889714241026,
+                    ],
+                    "q99": [0.7774086785316463, 0.428325751423835, 0.776091011762619],
+                    "std": [
+                        0.1148424841779685,
+                        0.17489566608140428,
+                        0.16541062032731538,
+                    ],
+                },
+            },
+            "roboset": {
+                "euler": {
+                    "max": [3.1415449294818236, 1.5705575529715636, 3.141527342124582],
+                    "mean": [
+                        -0.0398455755412464,
+                        1.0518070390619125,
+                        -0.015345692503002759,
+                    ],
+                    "min": [
+                        -3.1415813300509536,
+                        -1.5222832468962035,
+                        -3.141575300866071,
+                    ],
+                    "q01": [
+                        -2.9414386317311187,
+                        -0.24976770655101155,
+                        -2.985256521212579,
+                    ],
+                    "q99": [2.9380437893235993, 1.5403010739503078, 2.9746912523985025],
+                    "std": [1.7866587696177456, 0.40620530263065, 1.7288511340250616],
+                },
+                "gripper": {
+                    "max": [0.83056640625],
+                    "min": [0.0001499652862548828],
+                    "q01": [0.0001499652862548828],
+                    "q99": [0.82666015625],
+                },
+                "joints": {
+                    "max": [
+                        0.96240234375,
+                        1.1162109375,
+                        1.1064453125,
+                        -0.98095703125,
+                        2.30859375,
+                        1.576171875,
+                        1.7412109375,
+                    ],
+                    "mean": [
+                        0.005913593806326389,
+                        0.1877261847257614,
+                        0.04653879255056381,
+                        -2.0529513359069824,
+                        -0.011298442259430885,
+                        0.6185526251792908,
+                        -0.01701134257018566,
+                    ],
+                    "min": [
+                        -0.8330078125,
+                        -0.74658203125,
+                        -0.8642578125,
+                        -2.892578125,
+                        -1.390625,
+                        -0.24658203125,
+                        -2.953125,
+                    ],
+                    "q01": [
+                        -0.41015625,
+                        -0.5302734375,
+                        -0.6455078125,
+                        -2.57421875,
+                        -0.76416015625,
+                        -0.0386962890625,
+                        -1.435546875,
+                    ],
+                    "q99": [
+                        0.66455078125,
+                        0.9501953125,
+                        0.7529296875,
+                        -1.251953125,
+                        0.75244140625,
+                        1.2314453125,
+                        1.384765625,
+                    ],
+                    "std": [
+                        0.17915399372577667,
+                        0.32234326004981995,
+                        0.26069700717926025,
+                        0.31767210364341736,
+                        0.205329030752182,
+                        0.33385637402534485,
+                        0.6263682842254639,
+                    ],
+                },
+                "translation": {
+                    "max": [0.5747738480567932, 0.3972920775413513, 0.7443570494651794],
+                    "mean": [
+                        0.3331542909145355,
+                        0.019357483834028244,
+                        0.37330344319343567,
+                    ],
+                    "min": [
+                        0.09978063404560089,
+                        -0.29593944549560547,
+                        0.10065606236457825,
+                    ],
+                    "q01": [
+                        0.18437016010284424,
+                        -0.25699371099472046,
+                        0.15134164690971375,
+                    ],
+                    "q99": [0.543661892414093, 0.29646238684654236, 0.6682320833206177],
+                    "std": [
+                        0.07849054038524628,
+                        0.12241040915250778,
+                        0.1460595279932022,
+                    ],
+                },
+            },
+        }
+    @cached_property
+    def _control_stats(self) -> Dict[str, Dict[str, Dict[str, List[float]]]]:
+        if is_global_norm(self.config.rotation_norm) and is_global_norm(self.config.translation_norm):
+            return {}
+        with open(self.config.control_stats_path, "r") as file:
+            stats = yaml.safe_load(file)
+            if self.config.delta_controls:
+                if self.control_io_config.future_controls_sequence_stride_sec is None:
+                    horizon = 0.0
+                else:
+                    horizon = self.control_io_config.future_controls_sequence_stride_sec
+            elif self.control_io_config.future_controls_sequence_stride_sec is None:
+                if self.control_io_config.future_controls_sequence_length == 1:
+                    horizon = 0.0
+                else:
+                    raise NotImplementedError()
+            else:
+                horizon = (
+                    self.control_io_config.future_controls_sequence_length
+                    * self.control_io_config.future_controls_sequence_stride_sec
+                )
+            key = f"horizon_{round(horizon, 2)}s"
+            if key in stats:
+                stats = stats[key]
+            else:
+                raise ValueError(
+                    f"Missing control statistics key {key} for future_controls_sequence_length={self.config.control_io_config.future_controls_sequence_length} future_controls_sequence_stride_sec={self.config.control_io_config.future_controls_sequence_stride_sec}. Available keys: [{stats.keys()}]"
+                )
+        return stats
+    @cached_property
+    def dataset_names(self) -> List[str]:
+        if (
+            is_global_norm(self.config.rotation_norm)
+            and is_global_norm(self.config.obs_rotation_norm)
+            and is_global_norm(self.config.translation_norm)
+            and is_global_norm(self.config.obs_translation_norm)
+        ):
+            return ["ANY"]
+        return list(set(self._control_stats.keys()) | set(self._observation_stats.keys()))
+def delta_to_relative_translations(translation_sequence: torch.Tensor) -> torch.Tensor:
+    """
+    Transform a sequence of translation vectors encoded w.r.t. PREVIOUS frame in the sequence to encoding
+    w.r.t. the 0-th element preceding the sequence
+    Ex:
+        Sequence of points: T1, T2, T3, T4
+        `translation_sequence` contains the vectors: T0T1, T1T2, T2T3, T3T4, where T0 is the base frame,
+        implicitly encoded in T0T1
+        Output: T0T1, T0T2, T0T3, T0T4
+    Args:
+        translation_sequence: torch.Tensor of shape [..., S, 3], containing the translation vectors, where S
+            corresponds to the sequence dimension
+    Returns:
+        torch.Tensor of the same shape as translation_sequence, containing delta translations
+    """
+    assert translation_sequence.ndim >= 3, translation_sequence.shape
+    delta_translations = torch.cumsum(translation_sequence, dim=-2)
+    return delta_translations
+class RegressionProcessor(VLAMProcessor):
+    def policy_control_plan_from_model_target(
+        self, target: RoboticsTarget, dataset_name: np.ndarray
+    ) -> RoboticsControlPlan:
+        translation_m = self.unnormalize(target.translation, dataset_name=dataset_name, key="translation")
+        rotation = self.unnormalize(target.rotation, dataset_name=dataset_name, key="rotation")
+        rotmat = convert_rotation(rotation, RotationFormat.ROTMAT)
+        gripper_prob = target.gripper
+        if self.config.delta_controls:
+            translation_m = delta_to_relative_translations(translation_m)
+            rotmat = delta_to_relative_rotations(rotmat)
+        return RoboticsControlPlan(
+            translation_m=translation_m,
+            rotmat=rotmat,
+            gripper_prob=gripper_prob,
+            valid_mask=target.valid_mask,
+        )
+    def policy_control_plan_from_model_output(
+        self,
+        model_output: RoboticsOutput,
+        dataset_name: np.ndarray,
+        valid_mask: torch.Tensor,
+    ) -> RoboticsControlPlan:
+        """Called during inference to create control plan from model output"""
+        translation_m = self.unnormalize(
+            model_output.translation, dataset_name=dataset_name, key="translation"
+        )
+        rotation = self.unnormalize(model_output.rotation, dataset_name=dataset_name, key="rotation")
+        rotmat = convert_rotation(rotation, RotationFormat.ROTMAT, autonorm=True)
+        gripper_prob = torch.sigmoid(model_output.gripper)
+        if self.config.delta_controls:
+            translation_m = delta_to_relative_translations(translation_m)
+            rotmat = delta_to_relative_rotations(rotmat)
+        return RoboticsControlPlan(
+            translation_m=translation_m,
+            rotmat=rotmat,
+            gripper_prob=gripper_prob,
+            valid_mask=valid_mask,
+        )
+class PiZeroFlowMatchingProcessor(RegressionProcessor):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.generator: torch.Generator = torch.Generator()
+    @cached_property
+    def beta_distribution(self) -> torch.distributions.Beta:
+        return torch.distributions.Beta(
+            self.config.distribution_hyperparams.get("alpha", 1.5),
+            self.config.distribution_hyperparams.get("beta", 1.0),
+        )
+    def create_input(self, *args, **kwargs) -> RoboticsFlowInput:
+        """In practice used only during inference"""
+        inputs = super().create_input(*args, **kwargs)
+        flow_input: FlowInput = self.sample_t0_input(batch_size=1, device=torch.device("cpu"))
+        inputs = RoboticsFlowInput(**inputs.as_json(), flow_input=flow_input[0, ...])
+        return inputs
+    def sample_timestep(self, batch_size: int) -> torch.Tensor:
+        if self.config.timestep_distribution.lower() == "uniform":
+            eps = 1e-05
+            sample = (torch.rand(1, generator=self.generator) + torch.arange(batch_size) / batch_size) % (
+                1 - eps
+            )
+        elif self.config.timestep_distribution.lower() == "beta":
+            sample = self.beta_distribution.sample([batch_size, 1, 1])
+            sample = (1 - self.config.sig_min) * (1 - sample)
+        else:
+            raise NotImplementedError(self.config.timestep_distribution)
+        sample = sample.view(batch_size, 1, 1)
+        return sample
+    def _psi_t(self, timestep: torch.Tensor, x_0: torch.Tensor, x_1: torch.Tensor) -> torch.Tensor:
+        return (1 - (1 - self.config.sig_min) * timestep) * x_0 + timestep * x_1
+    def _dpsi_dt(self, x_0: torch.Tensor, x_1: torch.Tensor) -> torch.Tensor:
+        return x_1 - (1 - self.config.sig_min) * x_0
+    def sample_t0_input(self, batch_size: int, device: torch.device) -> FlowInput:
+        if self.config.r0_distribution == "normal":
+            controls_t0 = torch.randn(
+                [
+                    batch_size,
+                    self.config.control_io_config.future_controls_sequence_length,
+                    3 + self.rotation_components + 1,
+                ],
+                generator=self.generator,
+            ).to(device=device)
+            (translation_t0, rotation_t0, gripper_t0) = torch.split(
+                controls_t0, [3, self.rotation_components, 1], dim=-1
+            )
+            rotation_t0 = normalize_rotation(rotation_t0)
+        elif self.config.r0_distribution == "uniform":
+            controls_t0 = torch.randn(
+                [
+                    batch_size,
+                    self.config.control_io_config.future_controls_sequence_length,
+                    4,
+                ],
+                generator=self.generator,
+            ).to(device=device)
+            (translation_t0, gripper_t0) = torch.split(controls_t0, [3, 1], dim=-1)
+            rotation_t0 = convert_rotation(
+                roma.random_unitquat(
+                    (
+                        batch_size,
+                        self.config.control_io_config.future_controls_sequence_length,
+                    ),
+                    device=device,
+                ),
+                self.config.rotation_format,
+            )
+        else:
+            raise NotImplementedError(self.config.r0_distribution)
+        if self.config.rotation_format == RotationFormat.QUATERNION:
+            rotation_t0 = quaternion_half_cover(rotation_t0)
+        timestep = torch.zeros([batch_size, 1, 1], device=device)
+        return FlowInput(
+            timestep=timestep,
+            translation_t0=translation_t0,
+            rotation_t0=rotation_t0,
+            gripper_t0=gripper_t0,
+            translation_t=None,
+            rotation_t=None,
+            gripper_t=None,
+        )
+    def policy_control_plan_from_model_output(
+        self,
+        model_output: RoboticsOutput,
+        dataset_name: np.ndarray,
+        valid_mask: torch.Tensor,
+    ) -> RoboticsControlPlan:
+        if self.config.translation_norm == Normalization.NONE or is_mean_norm(self.config.translation_norm):
+            model_output = model_output.replace(translation=torch.clamp(model_output.translation, -1, 1))
+        if self.config.rotation_norm == Normalization.NONE or is_mean_norm(self.config.rotation_norm):
+            model_output = model_output.replace(rotation=torch.clamp(model_output.rotation, -1, 1))
+        control_plan = super().policy_control_plan_from_model_output(model_output, dataset_name, valid_mask)
+        control_plan = control_plan.replace(gripper_prob=torch.clamp(model_output.gripper, 0, 1))
+        return control_plan
+def make_causal_mask(shape: Sequence[int]) -> torch.Tensor:
+    """
+    Create a causal attention mask of shape `shape`
+    Args:
+        shape: Shape of the output mask, the last two dimensions correspond to [query_seq_len, kv_seq_len]
+    Returns:
+        torch.Tensor of dtype torch.bool. False values indicate that the row (i.e. query) can't attend
+            to the corresponding column (i.e. key)
+    Example:
+        shape = (3, 5) -> Mask the upper triangular part
+        [
+            [ 1, 0, 0, 0, 0],
+            [ 1, 1, 0, 0, 0],
+            [ 1, 1, 1, 0, 0]
+        ]
+    """
+    return torch.tril(torch.ones(shape, dtype=torch.bool), diagonal=0)
+def enable_full_attn_blocks(attn_mask: torch.Tensor, full_attn: torch.Tensor) -> torch.Tensor:
+    """
+    Enable full bi-directional attention in `attn_mask` inside specific blocks
+    Args:
+        attn_mask: Existing attention mask of shape [..., query_seq_len, kv_seq_len] and dtype torch.bool
+            where False values indicate disabled attention
+        full_attn: torch.Tensor of shape [query_seq_len], dtype torch.bool. Blocks of True values indicate
+            positions where full bi-directional attention should be enabled
+    Example:
+            1, 0, 0, 0, 0, 0, 0, 0,                 1, 1, 1, 0, 0, 0, 0, 0,
+            1, 1, 0, 0, 0, 0, 0, 0,                 1, 1, 1, 0, 0, 0, 0, 0,
+            1, 1, 1, 0, 0, 0, 0, 0,                 1, 1, 1, 0, 0, 0, 0, 0,
+            1, 1, 1, 1, 0, 0, 0, 0,      ->         1, 1, 1, 1, 0, 0, 0, 0,
+            1, 1, 1, 1, 1, 0, 0, 0,                 1, 1, 1, 1, 1, 0, 0, 0,
+            1, 1, 1, 1, 1, 1, 0, 0,                 1, 1, 1, 1, 1, 1, 1, 1,
+            1, 1, 1, 1, 1, 1, 1, 0,                 1, 1, 1, 1, 1, 1, 1, 1,
+            1, 1, 1, 1, 1, 1, 1, 1,                 1, 1, 1, 1, 1, 1, 1, 1,
+    """
+    assert full_attn.dtype == torch.bool, full_attn.dtype
+    assert full_attn.ndim == 1, full_attn.shape
+    assert full_attn.shape[0] == attn_mask.shape[-2], f"{full_attn.shape[0]}, {attn_mask.shape}"
+    if attn_mask.shape[-1] != attn_mask.shape[-2]:
+        raise NotImplementedError("Only self-attention supported right now.")
+    x = full_attn.view(-1, 1) & full_attn.view(1, -1)
+    x = x | make_causal_mask([full_attn.shape[0], full_attn.shape[0]])
+    x = torch.cumprod(x, dim=1).to(dtype=torch.bool)
+    x = x & x.permute(1, 0)
+    mask_positions = torch.sum(x, dim=0) == 1 & ~full_attn
+    mask_indices = torch.where(mask_positions)[0]
+    x[mask_indices, mask_indices] = 0
+    attn_mask = attn_mask | expand_dims(x, ndim=attn_mask.ndim, order=[-1, 1, 1])
+    return attn_mask
+IGNORE_INDEX = -100
+class PaliGemmaProcessor(VLMProcessor):
+    def __init__(
+        self,
+        config: PaliGemmaProcessorConfig,
+        hf_processor: transformers.models.paligemma.processing_paligemma.PaliGemmaProcessor,
+        **kwargs,
+    ):
+        del kwargs
+        super().__init__(config)
+        self.hf_processor = hf_processor
+        self.hf_processor.image_processor.size = dict(self.config.image_sizes["main"].as_json())
+        self.hf_processor.image_seq_length = self.config.num_image_tokens["main"]
+        self.hf_processor.image_processor.image_seq_length = self.config.num_image_tokens["main"]
+        self.bos_id: int = self.tokenizer.bos_token_id
+        self.eos_id: int = self.tokenizer.eos_token_id
+        self.sep_token = "\n"
+        self.sep_id: int = self.tokenizer(
+            self.sep_token,
+            padding=False,
+            add_special_tokens=False,
+            return_attention_mask=False,
+        )["input_ids"][0]
+        self.image_token_id: int = self.tokenizer(
+            self.config.image_token,
+            padding=False,
+            add_special_tokens=False,
+            return_attention_mask=False,
+        )["input_ids"][0]
+        self.image_tokens: list[int] = [self.image_token_id] * sum(self.config.num_image_tokens.values())
+        self.bbox_pattern = re.compile(
+            "\\[(\\d+\\.\\d+),\\s*(\\d+\\.\\d+),\\s*(\\d+\\.\\d+),\\s*(\\d+\\.\\d+)\\]"
+        )
+    def preprocess_inputs(
+        self, chat: List[str], images: Dict[str, List[PIL.Image.Image]]
+    ) -> Dict[str, torch.Tensor | Dict[str, torch.Tensor]]:
+        """
+        Based on PaliGemma paper https://arxiv.org/pdf/2407.07726 and example code at
+        https://ai.google.dev/gemma/docs/paligemma/fine-tuning-paligemma#create_model_inputs
+        Chat must be always made of separate messages from user and model, always starting with user
+        <image><image> ... <bos><instruction><sep><assistant><sep><instruction><sep><assistant>...<eos>
+        Args:
+            chat: List[str] of even size where each entry corresponds to a different turn in the conversation
+            images: Dict[str, List[PIL.Image.Image]] where different cameras correspond to different keys
+                in the Dict and the List corresponds to history of images
+        """
+        for key, value in images.items():
+            if not isinstance(value, list):
+                raise TypeError(f"Camera {key} contains values of type {type(value)} instead of list")
+        (input_ids, target_ids) = ([], [])
+        for i, text in enumerate(chat):
+            text = text.replace(self.sep_token, " ").replace("<image>", "")
+            text = self.bbox_pattern.sub(self._bbox_to_loc_tokens, text)
+            turn_input_ids: List[int] = self.tokenizer(
+                text,
+                padding=False,
+                add_special_tokens=False,
+                return_attention_mask=False,
+            )["input_ids"]
+            if i % 2 == 0:
+                turn_target_ids = [IGNORE_INDEX] * len(turn_input_ids)
+            else:
+                turn_target_ids = turn_input_ids
+            if i != len(chat) - 1:
+                turn_input_ids = turn_input_ids + [self.sep_id]
+                turn_target_ids = turn_target_ids + [IGNORE_INDEX]
+            input_ids = input_ids + turn_input_ids
+            target_ids = target_ids + turn_target_ids
+        input_ids = [self.bos_id] + input_ids + [self.eos_id]
+        target_ids = [IGNORE_INDEX] + target_ids + [self.eos_id]
+        image_tokens = self.image_tokens
+        if self.config.max_language_tokens > 0:
+            input_ids = input_ids[: self.config.max_language_tokens]
+            target_ids = target_ids[: self.config.max_language_tokens]
+        input_ids = image_tokens + input_ids
+        target_ids = [IGNORE_INDEX] * len(image_tokens) + target_ids
+        input_ids = torch.tensor(input_ids, dtype=torch.int64)
+        target_ids = torch.tensor(target_ids, dtype=torch.int64)
+        image_tensors: Dict[str, torch.Tensor] = {
+            f"{camera_name}.siglip": self.hf_processor.image_processor(
+                camera_images,
+                size=self.config.image_sizes[camera_name].as_json(),
+                return_tensors="pt",
+            )["pixel_values"]
+            for (camera_name, camera_images) in images.items()
+        }
+        attn_mask = make_causal_mask([len(input_ids), len(input_ids)])
+        attn_mask = enable_full_attn_blocks(attn_mask, full_attn=target_ids == IGNORE_INDEX)
+        return {
+            "input_ids": input_ids,
+            "target_ids": target_ids,
+            "images": image_tensors,
+            "attn_mask": attn_mask,
+        }
+    @property
+    def tokenizer(self) -> transformers.PreTrainedTokenizerBase:
+        return self.hf_processor.tokenizer
+    @staticmethod
+    def _bbox_to_loc_tokens(match: str) -> str:
+        """
+        https://developers.googleblog.com/en/gemma-explained-paligemma-architecture/
+        """
+        floats = list(map(float, match.groups()))
+        transformed = [f"<loc{np.clip(round(num * 1024), 0, 1023):04d}>" for num in floats]
+        return f"[{', '.join(transformed)}]"
+    @property
+    def image_sizes(self) -> Dict[str, ImageSizeConfig]:
+        return self.config.image_sizes
+class PaliGemmaDepthProcessor(PaliGemmaProcessor):
+    def __init__(
+        self,
+        config: PaliGemmaProcessorConfig,
+        hf_processor: transformers.models.paligemma.processing_paligemma.PaliGemmaProcessor,
+        depth_tokens: int,
+    ):
+        super().__init__(config, hf_processor)
+        vocab_size = len(self.tokenizer)
+        self.depth_token_ids = np.arange(vocab_size - depth_tokens, vocab_size)
+        self.depth_input_transforms = {
+            camera_name: torchvision.transforms.v2.Compose(
+                [
+                    torchvision.transforms.v2.Resize(
+                        size=(camera_image_size.height, camera_image_size.width),
+                        interpolation=torchvision.transforms.v2.InterpolationMode.BICUBIC,
+                        max_size=None,
+                        antialias=True,
+                    ),
+                    torchvision.transforms.v2.ToTensor(),
+                    torchvision.transforms.v2.Normalize(
+                        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
+                    ),
+                ]
+            )
+            for (camera_name, camera_image_size) in self.config.image_sizes.items()
+        }
+    def preprocess_inputs(
+        self, chat: List[str], images: Dict[str, List[PIL.Image.Image]]
+    ) -> Dict[str, torch.Tensor | Dict[str, torch.Tensor]]:
+        inputs = super().preprocess_inputs(chat=chat, images=images)
+        depth_images: Dict[str, torch.Tensor] = {
+            f"{camera_name}.depth": torch.stack(
+                self.depth_input_transforms[camera_name](camera_images), dim=0
+            )
+            for (camera_name, camera_images) in images.items()
+        }
+        inputs["images"] = {**inputs["images"], **depth_images}
+        return inputs
+    @property
+    def num_depth_tokens(self) -> int:
+        return len(self.depth_token_ids)