EQX55 commited on Sep 25, 2025

Commit

8c21234

verified ·

1 Parent(s): 0047d0a

Upload 24 files

Browse files

Files changed (24) hide show

.gitattributes +4 -0
LICENSE.md +25 -0
README.md +204 -0
config.json +13 -0
config.py +102 -0
hf_moondream.py +183 -0
image_crops.py +231 -0
layers.py +234 -0
lora.py +82 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +667 -0
moondream.py +1070 -0
open_vocab_detect.png +3 -0
point_count.png +3 -0
region.py +134 -0
rope.py +47 -0
structured_outputs.png +3 -0
text.py +234 -0
utils.py +41 -0
vision.py +147 -0
visual_reasoning.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+structured_outputs.png filter=lfs diff=lfs merge=lfs -text
+open_vocab_detect.png filter=lfs diff=lfs merge=lfs -text
+point_count.png filter=lfs diff=lfs merge=lfs -text
+visual_reasoning.png filter=lfs diff=lfs merge=lfs -text

LICENSE.md ADDED Viewed

	@@ -0,0 +1,25 @@

+| License | Business Source License (BSL 1.1) |
+| --- | --- |
+| Licensor | M87 Labs, Inc. |
+| Licensed Work | “Moondream 3 (Preview)” including Model Weights and any Derivatives (“Derivatives” include fine-tunes, merges, quantizations, weight deltas, and other weight-level modifications or conversions.) |
+| Additional Use Grant | You may use the Licensed Work and Derivatives for any purpose, including commercial use, and you may self-host them for your or your organization’s internal use. You may not provide the Licensed Work, Derivatives, or any service that exposes their functionality to third parties (including via API, website, application, model hub, or dataset/model redistribution) without a separate commercial agreement with the Licensor. |
+| Change Date | Two years after the first public release of this version of the Licensed Work |
+| Change License | Apache License, Version 2.0 |
+The text of the Business Source License 1.1 follows. License text copyright (c) 2020 MariaDB Corporation Ab, All Rights Reserved. “Business Source License” is a trademark of MariaDB Corporation Ab.
+## Terms
+The Licensor hereby grants you the right to copy, modify, create derivative works, redistribute, and make non-production use of the Licensed Work. The Licensor may make an Additional Use Grant, above, permitting limited production use.
+Effective on the Change Date, or the fourth anniversary of the first publicly available distribution of a specific version of the Licensed Work under this License, whichever comes first, the Licensor hereby grants you rights under the terms of the Change License, and the rights granted in the paragraph above terminate.
+If your use of the Licensed Work does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its affiliated entities, or authorized resellers, or you must refrain from using the Licensed Work.
+All copies of the original and modified Licensed Work, and derivative works of the Licensed Work, are subject to this License. This License applies separately for each version of the Licensed Work and the Change Date may vary for each version of the Licensed Work released by Licensor.
+You must conspicuously display this License on each original or modified copy of the Licensed Work. If you receive the Licensed Work in original or modified form from a third party, the terms and conditions set forth in this License apply to your use of that work.
+Any use of the Licensed Work in violation of this License will automatically terminate your rights under this License for the current and all other versions of the Licensed Work.
+This License does not grant you any right in any trademark or logo of Licensor or its affiliates (provided that you may use a trademark or logo of Licensor as expressly required by this License).TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON AN “AS IS” BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND TITLE.

README.md ADDED Viewed

	@@ -0,0 +1,204 @@

+---
+library_name: transformers
+pipeline_tag: image-text-to-text
+license: other
+---
+**Moondream 3 (Preview)** is an vision language model with a mixture-of-experts architecture (9B total parameters, 2B active). This model makes no compromises, delivering state-of-the-art visual reasoning while still retaining our efficient and deployment-friendly ethos.
+[✨ Demo](https://moondream.ai/c/playground) &#8201; · &#8201; [☁️ Cloud API](https://moondream.ai/c/docs/quickstart) &#8201; · &#8201; [📝 Release notes](https://moondream.ai/blog/moondream-3-preview)
+![](https://huggingface.co/moondream/moondream3-preview/resolve/main/open_vocab_detect.png)
+![](https://huggingface.co/moondream/moondream3-preview/resolve/main/visual_reasoning.png)
+![](https://huggingface.co/moondream/moondream3-preview/resolve/main/point_count.png)
+![](https://huggingface.co/moondream/moondream3-preview/resolve/main/structured_outputs.png)
+## Architecture
+1. 24 layers; the first four are dense, the rest have MoE FFNs with 64 experts, 8 activated per token
+2. MoE FFNs have GeGLU architecture, with inner/gate dim of 1024. The model's hidden dim is 2048.
+3. Usable context length increased to 32K, with [a custom efficient SuperBPE tokenizer](https://huggingface.co/moondream/starmie-v1)
+4. Multi-headed attention with learned position- and data-dependent temperature scaling
+5. SigLIP-based vision encoder, with multi-crop channel concatenation for token-efficient high resolution image processing
+For more details, please refer to the [release notes]((https://moondream.ai/blog/moondream-3-preview). Or try the model out in our [playground demo](https://moondream.ai/c/playground).
+The following instructions demonstrate how to run the model locally using Transformers. We also offer a [cloud API](https://moondream.ai/c/docs/quickstart) with a generous free tier that can help you get started quicker!
+## Usage
+Load the model and prepare it for inference. We use [FlexAttention for inference](https://pytorch.org/blog/flexattention-for-inference/), so calling `.compile()` is critical for fast decoding. Our `compile` implementation also handles warmup, so you can start making requests directly once it returns.
+```python
+import torch
+from transformers import AutoModelForCausalLM
+moondream = AutoModelForCausalLM.from_pretrained(
+    "moondream/moondream3-preview",
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+    device_map={"": "cuda"},
+)
+moondream.compile()
+```
+The model comes with four skills, tailored towards different visual understanding tasks.
+### Query
+The `query` skill can be used to ask open-ended questions about images.
+```python
+from PIL import Image
+# Simple VQA
+image = Image.open("photo.jpg")
+result = moondream.query(image=image, question="What's in this image?")
+print(result["answer"])
+```
+By default, `query` runs in reasoning mode, allowing the model to "think" about the question before generating an answer. This is helpful for more complicated tasks, but sometimes the task you're running is simple and doesn't benefit from reasoning. To save on inference cost when this is the case, you can disable reasoning:
+```python
+# Without reasoning for simple questions
+result = moondream.query(
+    image=image,
+    question="What color is the sky?",
+    reasoning=False
+)
+print(result["answer"])
+```
+If you want to stream outputs, pass in `stream=True`. You can control the temperature, top-p, and maximum number of tokens generated by passing in optional settings.
+```python
+# Streaming with custom settings
+settings = {
+    "temperature": 0.7,
+    "top_p": 0.95,
+    "max_tokens": 512
+}
+result = moondream.query(
+    image=image,
+    question="Describe what's happening in detail",
+    stream=True,
+    settings=settings
+)
+# Stream the answer
+for chunk in result["answer"]:
+    print(chunk, end="", flush=True)
+```
+Note that this isn't just for images; Moondream is also a strong general-purpose text model.
+```python
+# Text-only example (no image)
+result = moondream.query(
+    question="Explain the concept of machine learning in simple terms"
+)
+print(result["answer"])
+```
+### Caption
+Whether you want short, normal-sized or long descriptions of images, the `caption` skill has you covered.
+```python
+# Different caption lengths
+image = Image.open("landscape.jpg")
+# Short caption
+short = moondream.caption(image, length="short")
+print(f"Short: {short['caption']}")
+# Normal caption (default)
+normal = moondream.caption(image, length="normal")
+print(f"Normal: {normal['caption']}")
+# Long caption
+long = moondream.caption(image, length="long")
+print(f"Long: {long['caption']}")
+```
+It accepts the same streaming and temperature etc. settings as the `query` skill.
+```python
+# Streaming caption with custom settings
+result = moondream.caption(
+    image,
+    length="long",
+    stream=True,
+    settings={"temperature": 0.3}
+)
+for chunk in result["caption"]:
+    print(chunk, end="", flush=True)
+```
+### Point
+The `point` skill identifies specific points (x, y coordinates) for objects in an image.
+```python
+# Find points for specific objects
+image = Image.open("crowd.jpg")
+result = moondream.point(image, "person wearing a red shirt")
+# Points are normalized coordinates (0-1)
+for i, point in enumerate(result["points"]):
+    print(f"Point {i+1}: x={point['x']:.3f}, y={point['y']:.3f}")
+```
+### Detect
+The `detect` skill provides bounding boxes for objects in an image.
+```python
+# Detect objects with bounding boxes
+image = Image.open("street_scene.jpg")
+result = moondream.detect(image, "car")
+# Bounding boxes are normalized coordinates (0-1)
+for i, obj in enumerate(result["objects"]):
+    print(f"Object {i+1}: "
+          f"x_min={obj['x_min']:.3f}, y_min={obj['y_min']:.3f}, "
+          f"x_max={obj['x_max']:.3f}, y_max={obj['y_max']:.3f}")
+# Control maximum number of objects
+settings = {"max_objects": 10}
+result = moondream.detect(image, "person", settings=settings)
+```
+### Caching image encodings (advanced)
+If you're planning to run multiple inferences on the same image, you can pre-encode it once and reuse the encoding for better performance.
+```python
+# Encode image once
+image = Image.open("complex_scene.jpg")
+encoded = moondream.encode_image(image)
+# Reuse the encoding for multiple queries
+questions = [
+    "How many people are in this image?",
+    "What time of day was this taken?",
+    "What's the weather like?"
+]
+for q in questions:
+    result = moondream.query(image=encoded, question=q, reasoning=False)
+    print(f"Q: {q}")
+    print(f"A: {result['answer']}\n")
+# Also works with other skills
+caption = moondream.caption(encoded, length="normal")
+objects = moondream.detect(encoded, "vehicle")
+```
+---
+Copyright (c) 2025 M87 Labs, Inc.
+This distribution includes Model Weights licensed under the [Business Source License 1.1 with an Additional Use Grant (No Third-Party Service)](https://huggingface.co/moondream/moondream3-preview/blob/main/LICENSE.md). Commercial hosting or rehosting requires an agreement with <contact@m87.ai>.

config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "architectures": [
+    "HfMoondream"
+  ],
+  "auto_map": {
+    "AutoConfig": "hf_moondream.HfConfig",
+    "AutoModelForCausalLM": "hf_moondream.HfMoondream"
+  },
+  "config": {},
+  "model_type": "moondream1",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.1"
+}

config.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+@dataclass(frozen=True)
+class TextMoeConfig:
+    num_experts: int = 64
+    start_layer: int = 4
+    experts_per_token: int = 8
+    expert_inner_dim: int = 1024
+@dataclass(frozen=True)
+class TextConfig:
+    dim: int = 2048
+    ff_dim: int = 8192
+    n_layers: int = 24
+    vocab_size: int = 51200
+    max_context: int = 4096
+    n_heads: int = 32
+    n_kv_heads: int = 32
+    prefix_attn: int = 730
+    group_size: Optional[int] = None
+    moe: Optional[TextMoeConfig] = TextMoeConfig()
+@dataclass(frozen=True)
+class VisionConfig:
+    enc_dim: int = 1152
+    enc_patch_size: int = 14
+    enc_n_layers: int = 27
+    enc_ff_dim: int = 4304
+    enc_n_heads: int = 16
+    proj_out_dim: int = 2048
+    crop_size: int = 378
+    in_channels: int = 3
+    max_crops: int = 12
+    overlap_margin: int = 4
+    proj_inner_dim: int = 8192
+@dataclass(frozen=True)
+class RegionConfig:
+    dim: int = 2048
+    coord_feat_dim: int = 256
+    coord_out_dim: int = 1024
+    size_feat_dim: int = 512
+    size_out_dim: int = 2048
+    group_size: Optional[int] = None
+@dataclass(frozen=True)
+class TokenizerConfig:
+    bos_id: int = 0
+    eos_id: int = 0
+    answer_id: int = 3
+    thinking_id: int = 4
+    coord_id: int = 5
+    size_id: int = 6
+    start_ground_points_id: int = 7
+    end_ground_id: int = 9
+    templates: Dict[str, Optional[Dict[str, List[int]]]] = field(
+        default_factory=lambda: {
+            "caption": {
+                "short": [1, 32708, 2, 12492, 3],
+                "normal": [1, 32708, 2, 6382, 3],
+                "long": [1, 32708, 2, 4059, 3],
+            },
+            "query": {"prefix": [1, 15381, 2], "suffix": [3]},
+            "detect": {"prefix": [1, 7235, 476, 2], "suffix": [3]},
+            "point": {"prefix": [1, 2581, 2], "suffix": [3]},
+        }
+    )
+@dataclass(frozen=True)
+class MoondreamConfig:
+    text: TextConfig = TextConfig()
+    vision: VisionConfig = VisionConfig()
+    region: RegionConfig = RegionConfig()
+    tokenizer: TokenizerConfig = TokenizerConfig()
+    @classmethod
+    def from_dict(cls, config_dict: dict):
+        text_config = TextConfig(**config_dict.get("text", {}))
+        vision_config = VisionConfig(**config_dict.get("vision", {}))
+        region_config = RegionConfig(**config_dict.get("region", {}))
+        tokenizer_config = TokenizerConfig(**config_dict.get("tokenizer", {}))
+        return cls(
+            text=text_config,
+            vision=vision_config,
+            region=region_config,
+            tokenizer=tokenizer_config,
+        )
+    def to_dict(self):
+        return {
+            "text": self.text.__dict__,
+            "vision": self.vision.__dict__,
+            "region": self.region.__dict__,
+            "tokenizer": self.tokenizer.__dict__,
+        }

hf_moondream.py ADDED Viewed

	@@ -0,0 +1,183 @@

+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, PretrainedConfig
+from typing import Union
+from .config import MoondreamConfig
+from .moondream import MoondreamModel
+# Files sometimes don't get loaded without these...
+from .image_crops import *
+from .vision import *
+from .text import *
+from .region import *
+from .utils import *
+def extract_question(text):
+    prefix = "<image>\n\nQuestion: "
+    suffix = "\n\nAnswer:"
+    if text.startswith(prefix) and text.endswith(suffix):
+        return text[len(prefix) : -len(suffix)]
+    else:
+        return None
+class HfConfig(PretrainedConfig):
+    _auto_class = "AutoConfig"
+    model_type = "moondream1"
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.config = {}
+class HfMoondream(PreTrainedModel):
+    _auto_class = "AutoModelForCausalLM"
+    config_class = HfConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = MoondreamModel(
+            MoondreamConfig.from_dict(config.config), setup_caches=False
+        )
+        self._is_kv_cache_setup = False
+    def _setup_caches(self):
+        if not self._is_kv_cache_setup:
+            self.model._setup_caches()
+            self._is_kv_cache_setup = True
+    @property
+    def encode_image(self):
+        self._setup_caches()
+        return self.model.encode_image
+    @property
+    def query(self):
+        self._setup_caches()
+        return self.model.query
+    @property
+    def caption(self):
+        self._setup_caches()
+        return self.model.caption
+    @property
+    def detect(self):
+        self._setup_caches()
+        return self.model.detect
+    @property
+    def point(self):
+        self._setup_caches()
+        return self.model.point
+    @property
+    def detect_gaze(self):
+        self._setup_caches()
+        return self.model.detect_gaze
+    def answer_question(
+        self,
+        image_embeds,
+        question,
+        tokenizer=None,
+        chat_history="",
+        result_queue=None,
+        max_new_tokens=256,
+        **kwargs
+    ):
+        answer = self.query(image_embeds, question)["answer"].strip()
+        if result_queue is not None:
+            result_queue.put(answer)
+        return answer
+    def batch_answer(self, images, prompts, tokenizer=None, **kwargs):
+        answers = []
+        for image, prompt in zip(images, prompts):
+            answers.append(self.query(image, prompt)["answer"].strip())
+        return answers
+    def _unsupported_exception(self):
+        raise NotImplementedError(
+            "This method is not supported in the latest version of moondream. "
+            "Consider upgrading to the updated API spec, or alternately pin "
+            "to 'revision=2024-08-26'."
+        )
+    def generate(self, image_embeds, prompt, tokenizer, max_new_tokens=128, **kwargs):
+        """
+        Function definition remains unchanged for backwards compatibility.
+        Be aware that tokenizer, max_new_takens, and kwargs are ignored.
+        """
+        prompt_extracted = extract_question(prompt)
+        if prompt_extracted is not None:
+            answer = self.model.query(
+                image=image_embeds, question=prompt_extracted, stream=False
+            )["answer"]
+        else:
+            image_embeds = self.encode_image(image_embeds)
+            prompt_tokens = torch.tensor(
+                [self.model.tokenizer.encode(prompt).ids],
+                device=self.device,
+            )
+            def generator():
+                for token in self.model._generate_answer(
+                    prompt_tokens,
+                    image_embeds.kv_cache,
+                    image_embeds.pos,
+                    max_new_tokens,
+                ):
+                    yield token
+            answer = "".join(list(generator()))
+        return [answer]
+    def get_input_embeddings(self) -> nn.Embedding:
+        """
+        Lazily wrap the raw parameter `self.model.text.wte` in a real
+        `nn.Embedding` layer so that HF mix-ins recognise it.  The wrapper
+        **shares** the weight tensor—no copy is made.
+        """
+        if not hasattr(self, "_input_embeddings"):
+            self._input_embeddings = nn.Embedding.from_pretrained(
+                self.model.text.wte,  # tensor created in text.py
+                freeze=True,  # set to False if you need it trainable
+            )
+        return self._input_embeddings
+    def set_input_embeddings(self, value: Union[nn.Embedding, nn.Module]) -> None:
+        """
+        Lets HF functions (e.g. `resize_token_embeddings`) replace or resize the
+        embeddings and keeps everything tied to `self.model.text.wte`.
+        """
+        # 1. point the low-level parameter to the new weight matrix
+        self.model.text.wte = value.weight
+        # 2. keep a reference for get_input_embeddings()
+        self._input_embeddings = value
+    def input_embeds(
+        self,
+        input_ids: Union[torch.LongTensor, list, tuple],
+        *,
+        device: torch.device | None = None
+    ) -> torch.FloatTensor:
+        """
+        Back-compat wrapper that turns token IDs into embeddings.
+        Example:
+            ids = torch.tensor([[1, 2, 3]])
+            embeds = model.input_embeds(ids)      # (1, 3, hidden_dim)
+        """
+        if not torch.is_tensor(input_ids):
+            input_ids = torch.as_tensor(input_ids)
+        if device is not None:
+            input_ids = input_ids.to(device)
+        return self.get_input_embeddings()(input_ids)

image_crops.py ADDED Viewed

	@@ -0,0 +1,231 @@

+import math
+import numpy as np
+import torch
+from typing import TypedDict
+try:
+    import pyvips
+    HAS_VIPS = True
+except:
+    from PIL import Image
+    HAS_VIPS = False
+def select_tiling(
+    height: int, width: int, crop_size: int, max_crops: int
+) -> tuple[int, int]:
+    """
+    Determine the optimal number of tiles to cover an image with overlapping crops.
+    """
+    if height <= crop_size or width <= crop_size:
+        return (1, 1)
+    # Minimum required tiles in each dimension
+    min_h = math.ceil(height / crop_size)
+    min_w = math.ceil(width / crop_size)
+    # If minimum required tiles exceed max_crops, return proportional distribution
+    if min_h * min_w > max_crops:
+        ratio = math.sqrt(max_crops / (min_h * min_w))
+        return (max(1, math.floor(min_h * ratio)), max(1, math.floor(min_w * ratio)))
+    # Perfect aspect-ratio tiles that satisfy max_crops
+    h_tiles = math.floor(math.sqrt(max_crops * height / width))
+    w_tiles = math.floor(math.sqrt(max_crops * width / height))
+    # Ensure we meet minimum tile requirements
+    h_tiles = max(h_tiles, min_h)
+    w_tiles = max(w_tiles, min_w)
+    # If we exceeded max_crops, scale down the larger dimension
+    if h_tiles * w_tiles > max_crops:
+        if w_tiles > h_tiles:
+            w_tiles = math.floor(max_crops / h_tiles)
+        else:
+            h_tiles = math.floor(max_crops / w_tiles)
+    return (max(1, h_tiles), max(1, w_tiles))
+class OverlapCropOutput(TypedDict):
+    crops: np.ndarray
+    tiling: tuple[int, int]
+def overlap_crop_image(
+    image: np.ndarray,
+    overlap_margin: int,
+    max_crops: int,
+    base_size: tuple[int, int] = (378, 378),
+    patch_size: int = 14,
+) -> OverlapCropOutput:
+    """
+    Process an image using an overlap-and-resize cropping strategy with margin handling.
+    This function takes an input image and creates multiple overlapping crops with
+    consistent margins. It produces:
+    1. A single global crop resized to base_size
+    2. Multiple overlapping local crops that maintain high resolution details
+    3. A patch ordering matrix that tracks correspondence between crops
+    The overlap strategy ensures:
+    - Smooth transitions between adjacent crops
+    - No loss of information at crop boundaries
+    - Proper handling of features that cross crop boundaries
+    - Consistent patch indexing across the full image
+    Args:
+        image (np.ndarray): Input image as numpy array with shape (H,W,C)
+        base_size (tuple[int,int]): Target size for crops, default (378,378)
+        patch_size (int): Size of patches in pixels, default 14
+        overlap_margin (int): Margin size in patch units, default 4
+        max_crops (int): Maximum number of crops allowed, default 12
+    Returns:
+        OverlapCropOutput: Dictionary containing:
+            - crops: A numpy array containing the global crop of the full image (index 0)
+                followed by the overlapping cropped regions (indices 1+)
+            - tiling: Tuple of (height,width) tile counts
+    """
+    original_h, original_w = image.shape[:2]
+    # Convert margin from patch units to pixels
+    margin_pixels = patch_size * overlap_margin
+    total_margin_pixels = margin_pixels * 2  # Both sides
+    # Calculate crop parameters
+    crop_patches = base_size[0] // patch_size  # patches per crop dimension
+    crop_window_patches = crop_patches - (2 * overlap_margin)  # usable patches
+    crop_window_size = crop_window_patches * patch_size  # usable size in pixels
+    # Determine tiling
+    tiling = select_tiling(
+        original_h - total_margin_pixels,
+        original_w - total_margin_pixels,
+        crop_window_size,
+        max_crops,
+    )
+    # Pre-allocate crops.
+    n_crops = tiling[0] * tiling[1] + 1  # 1 = global crop
+    crops = np.zeros(
+        (n_crops, base_size[0], base_size[1], image.shape[2]), dtype=np.uint8
+    )
+    # Resize image to fit tiling
+    target_size = (
+        tiling[0] * crop_window_size + total_margin_pixels,
+        tiling[1] * crop_window_size + total_margin_pixels,
+    )
+    if HAS_VIPS:
+        # Convert to vips for resizing
+        vips_image = pyvips.Image.new_from_array(image)
+        scale_x = target_size[1] / image.shape[1]
+        scale_y = target_size[0] / image.shape[0]
+        resized = vips_image.resize(scale_x, vscale=scale_y)
+        image = resized.numpy()
+        # Create global crop
+        scale_x = base_size[1] / vips_image.width
+        scale_y = base_size[0] / vips_image.height
+        global_vips = vips_image.resize(scale_x, vscale=scale_y)
+        crops[0] = global_vips.numpy()
+    else:
+        # Fallback to PIL
+        pil_img = Image.fromarray(image)
+        resized = pil_img.resize(
+            (int(target_size[1]), int(target_size[0])),
+            resample=Image.Resampling.LANCZOS,
+        )
+        image = np.asarray(resized)
+        # Create global crop
+        global_pil = pil_img.resize(
+            (int(base_size[1]), int(base_size[0])), resample=Image.Resampling.LANCZOS
+        )
+        crops[0] = np.asarray(global_pil)
+    for i in range(tiling[0]):
+        for j in range(tiling[1]):
+            # Calculate crop coordinates
+            y0 = i * crop_window_size
+            x0 = j * crop_window_size
+            # Extract crop with padding if needed
+            y_end = min(y0 + base_size[0], image.shape[0])
+            x_end = min(x0 + base_size[1], image.shape[1])
+            crop_region = image[y0:y_end, x0:x_end]
+            crops[
+                1 + i * tiling[1] + j, : crop_region.shape[0], : crop_region.shape[1]
+            ] = crop_region
+    return {"crops": crops, "tiling": tiling}
+def reconstruct_from_crops(
+    crops: torch.Tensor,
+    tiling: tuple[int, int],
+    overlap_margin: int,
+    patch_size: int = 14,
+) -> torch.Tensor:
+    """
+    Reconstruct the original image from overlapping crops into a single seamless image.
+    Takes a list of overlapping image crops along with their positional metadata and
+    reconstructs them into a single coherent image by carefully stitching together
+    non-overlapping regions. Handles both numpy arrays and PyTorch tensors.
+    Args:
+        crops: List of image crops as numpy arrays or PyTorch tensors with shape
+            (H,W,C)
+        tiling: Tuple of (height,width) indicating crop grid layout
+        patch_size: Size in pixels of each patch, default 14
+        overlap_margin: Number of overlapping patches on each edge, default 4
+    Returns:
+        Reconstructed image as numpy array or PyTorch tensor matching input type,
+        with shape (H,W,C) where H,W are the original image dimensions
+    """
+    tiling_h, tiling_w = tiling
+    crop_height, crop_width = crops[0].shape[:2]
+    margin_pixels = overlap_margin * patch_size
+    # Calculate output size (only adding margins once)
+    output_h = (crop_height - 2 * margin_pixels) * tiling_h + 2 * margin_pixels
+    output_w = (crop_width - 2 * margin_pixels) * tiling_w + 2 * margin_pixels
+    reconstructed = torch.zeros(
+        (output_h, output_w, crops[0].shape[2]),
+        device=crops[0].device,
+        dtype=crops[0].dtype,
+    )
+    for i, crop in enumerate(crops):
+        tile_y = i // tiling_w
+        tile_x = i % tiling_w
+        # For each tile, determine which part to keep
+        # Keep left margin only for first column
+        x_start = 0 if tile_x == 0 else margin_pixels
+        # Keep right margin only for last column
+        x_end = crop_width if tile_x == tiling_w - 1 else crop_width - margin_pixels
+        # Keep top margin only for first row
+        y_start = 0 if tile_y == 0 else margin_pixels
+        # Keep bottom margin only for last row
+        y_end = crop_height if tile_y == tiling_h - 1 else crop_height - margin_pixels
+        # Calculate where this piece belongs in the output
+        out_x = tile_x * (crop_width - 2 * margin_pixels)
+        out_y = tile_y * (crop_height - 2 * margin_pixels)
+        # Place the piece
+        reconstructed[
+            out_y + y_start : out_y + y_end, out_x + x_start : out_x + x_end
+        ] = crop[y_start:y_end, x_start:x_end]
+    return reconstructed

layers.py ADDED Viewed

	@@ -0,0 +1,234 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from dataclasses import dataclass
+from typing import Literal, Optional
+try:
+    from torchao import quantize_
+    from torchao.quantization import int4_weight_only
+except ImportError:
+    def quantize_(model, quant_mode):
+        raise ImportError(
+            "torchao is not installed. Please install it with `pip install torchao`."
+        )
+    def int4_weight_only(group_size):
+        raise ImportError(
+            "torchao is not installed. Please install it with `pip install torchao`."
+        )
+def gelu_approx(x):
+    return F.gelu(x, approximate="tanh")
+@dataclass
+class LinearWeights:
+    weight: torch.Tensor
+    bias: torch.Tensor
+def linear(x: torch.Tensor, w: LinearWeights) -> torch.Tensor:
+    return F.linear(x, w.weight, w.bias)
+def dequantize_tensor(W_q, scale, zero, orig_shape, dtype=torch.bfloat16):
+    _step = W_q.shape[0]
+    W_r = torch.empty([2 * _step, W_q.shape[1]], dtype=dtype, device=W_q.device)
+    W_r[:_step] = (W_q & 0b11110000) >> 4
+    W_r[_step:] = W_q & 0b00001111
+    W_r.sub_(zero).mul_(scale)
+    return W_r.reshape(orig_shape)
+class QuantizedLinear(nn.Module):
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        dtype: torch.dtype,
+    ):
+        # TODO: Take group_size as an input instead of hardcoding it here.
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        self.weight = nn.ParameterDict(
+            {
+                "packed": nn.Parameter(
+                    torch.empty(
+                        out_features * in_features // (128 * 2), 128, dtype=torch.uint8
+                    ),
+                    requires_grad=False,
+                ),
+                "scale": nn.Parameter(
+                    torch.empty(out_features * in_features // 128, 1),
+                    requires_grad=False,
+                ),
+                "zero_point": nn.Parameter(
+                    torch.empty(out_features * in_features // 128, 1),
+                    requires_grad=False,
+                ),
+            }
+        )
+        self.bias = nn.Parameter(torch.empty(out_features), requires_grad=False)
+        self.unpacked = False
+    def unpack(self):
+        if self.unpacked:
+            return
+        self.weight = nn.Parameter(
+            dequantize_tensor(
+                self.weight["packed"],
+                self.weight["scale"],
+                self.weight["zero_point"],
+                (self.out_features, self.in_features),
+                torch.bfloat16,
+            )
+        )
+        with torch.device("meta"):
+            self.linear = nn.Linear(
+                self.in_features, self.out_features, dtype=torch.bfloat16
+            )
+        self.linear.weight = self.weight
+        self.linear.bias = nn.Parameter(
+            self.bias.to(torch.bfloat16), requires_grad=False
+        )
+        del self.weight, self.bias
+        quantize_(self, int4_weight_only(group_size=128))
+        self.unpacked = True
+        torch.cuda.empty_cache()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if not self.unpacked:
+            self.unpack()
+        return self.linear(x)
+@dataclass
+class LayerNormWeights:
+    weight: torch.Tensor
+    bias: torch.Tensor
+def layer_norm(x: torch.Tensor, w: LayerNormWeights) -> torch.Tensor:
+    return F.layer_norm(x, w.bias.shape, w.weight, w.bias)
+@dataclass
+class MLPWeights:
+    fc1: LinearWeights
+    fc2: LinearWeights
+    act: Literal["gelu_approx"] = "gelu_approx"
+def mlp(x: torch.Tensor, w: MLPWeights, lora: Optional[dict] = None) -> torch.Tensor:
+    x0 = w.fc1(x)
+    if lora is not None:
+        x1 = F.linear(F.linear(x, lora["fc1"]["A"]), lora["fc1"]["B"])
+        x = x0 + x1
+    else:
+        x = x0
+    x = gelu_approx(x)
+    x0 = w.fc2(x)
+    if lora is not None:
+        x1 = F.linear(F.linear(x, lora["fc2"]["A"]), lora["fc2"]["B"])
+        x = x0 + x1
+    else:
+        x = x0
+    return x
+def moe_mlp(
+    x: torch.Tensor, mlp_module: nn.Module, experts_per_token: int
+) -> torch.Tensor:
+    B, T, C = x.shape
+    x = x.reshape(-1, C)
+    # Router computation
+    router_logits = mlp_module.router(x)
+    topk_logits, topk_idxs = torch.topk(router_logits, experts_per_token, dim=-1)
+    topk_weights = F.softmax(topk_logits, dim=-1, dtype=torch.float32).to(x.dtype)
+    num_tokens, top_k = topk_idxs.shape
+    if T == 1:
+        w1_weight = mlp_module.fc1.weight
+        w2_weight = mlp_module.fc2.weight
+        # Flatten to process all token-expert pairs at once
+        flat_idxs = topk_idxs.view(-1)  # [T*A]
+        flat_weights = topk_weights.view(-1)  # [T*A]
+        # Select expert weights
+        w1_selected = w1_weight[flat_idxs]  # [T*A, H, D]
+        w2_selected = w2_weight[flat_idxs]  # [T*A, D, H]
+        # Expand input for all token-expert pairs
+        x_expanded = x.unsqueeze(1).expand(-1, top_k, -1).reshape(-1, C)  # [T*A, D]
+        # First linear layer with GeGLU: [T*A, H, D] @ [T*A, D, 1] -> [T*A, H]
+        x1_full = torch.bmm(w1_selected, x_expanded.unsqueeze(-1)).squeeze(
+            -1
+        )  # [T*A, H]
+        x1, g = x1_full.chunk(2, dim=-1)
+        x1 = F.gelu(x1) * (g + 1)
+        # Second linear layer: [T*A, D, H] @ [T*A, H, 1] -> [T*A, D]
+        expert_outs = torch.bmm(w2_selected, x1.unsqueeze(-1)).squeeze(-1)  # [T*A, D]
+        # Apply weights and reshape
+        weighted_outs = expert_outs * flat_weights.unsqueeze(-1)  # [T*A, D]
+        weighted_outs = weighted_outs.view(num_tokens, top_k, C)  # [T, A, D]
+        # Sum over experts
+        mlp_out = weighted_outs.sum(dim=1)  # [T, D]
+        mlp_out = mlp_out.view(B, T, C)
+        return mlp_out
+    else:
+        out = x.new_zeros(x.size())
+        for expert_id in range(mlp_module.fc1.weight.shape[0]):
+            token_pos, which_k = (topk_idxs == expert_id).nonzero(as_tuple=True)
+            if token_pos.numel() == 0:
+                continue
+            x_tok = x.index_select(0, token_pos)
+            gate_tok = topk_weights[token_pos, which_k]
+            h_full = F.linear(x_tok, mlp_module.fc1.weight[expert_id])
+            h, g = h_full.chunk(2, dim=-1)
+            h = F.gelu(h) * (g + 1)
+            y = F.linear(h, mlp_module.fc2.weight[expert_id])
+            y.mul_(gate_tok.unsqueeze(-1))
+            out.index_add_(0, token_pos, y)
+        return out.view(B, T, C)
+@dataclass
+class AttentionWeights:
+    qkv: LinearWeights
+    proj: LinearWeights
+def attn(x: torch.Tensor, w: AttentionWeights, n_heads: int) -> torch.Tensor:
+    bsz, q_len, d_model = x.shape
+    head_dim = d_model // n_heads
+    q, k, v = [
+        t.view(bsz, q_len, n_heads, head_dim).transpose(1, 2)
+        for t in linear(x, w.qkv).chunk(3, dim=-1)
+    ]
+    out = F.scaled_dot_product_attention(q, k, v)
+    out = out.transpose(1, 2).reshape(bsz, q_len, d_model)
+    out = linear(out, w.proj)
+    return out

lora.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import functools
+import os
+import shutil
+import torch
+from pathlib import Path
+from urllib.request import Request, urlopen
+from typing import Optional
+def variant_cache_dir():
+    hf_hub_cache = os.environ.get("HF_HUB_CACHE")
+    if hf_hub_cache is not None:
+        return Path(hf_hub_cache) / "md_variants"
+    hf_home = os.environ.get("HF_HOME")
+    if hf_home is not None:
+        return Path(hf_home) / "hub" / "md_variants"
+    return Path("~/.cache/huggingface/hub").expanduser() / "md_variants"
+def cached_variant_path(variant_id: str):
+    variant, *rest = variant_id.split("/", 1)
+    step = rest[0] if rest else "final"
+    cache_dir = variant_cache_dir() / variant
+    os.makedirs(cache_dir, exist_ok=True)
+    dest = cache_dir / f"{step}.pt"
+    if dest.exists():
+        return dest
+    md_endpoint = os.getenv("MOONDREAM_ENDPOINT", "https://api.moondream.ai")
+    headers = {"User-Agent": "moondream-torch"}
+    api_key = os.getenv("MOONDREAM_API_KEY")
+    if api_key is not None:
+        headers["X-Moondream-Auth"] = api_key
+    req = Request(f"{md_endpoint}/v1/variants/{variant_id}/download", headers=headers)
+    with urlopen(req) as r, open(dest, "wb") as f:
+        shutil.copyfileobj(r, f)
+    return dest
+def nest(flat):
+    tree = {}
+    for k, v in flat.items():
+        parts = k.split(".")
+        d = tree
+        for p in parts[:-1]:
+            d = d.setdefault(p, {})
+        d[parts[-1]] = v
+    return tree
+@functools.lru_cache(maxsize=5)
+def variant_state_dict(variant_id: Optional[str] = None, device: str = "cpu"):
+    if variant_id is None:
+        return None
+    state_dict = torch.load(
+        cached_variant_path(variant_id), map_location=device, weights_only=True
+    )
+    # TODO: Move these into the training code that saves checkpoints...
+    rename_rules = [
+        ("text_model.transformer.h", "text.blocks"),
+        (".mixer", ".attn"),
+        (".out_proj", ".proj"),
+        (".Wqkv", ".qkv"),
+        (".parametrizations.weight.0", ""),
+    ]
+    new_state_dict = {}
+    for key, tensor in state_dict.items():
+        new_key = key
+        for old, new in rename_rules:
+            if old in new_key:
+                new_key = new_key.replace(old, new)
+        new_state_dict[new_key] = tensor
+    return nest(new_state_dict)

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fd4b3d0d6daae9c4212056cd64f02f408ff083bbb0244114eecd05fcba30037e
+size 4907406296

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7cf6d17391db58801b61173510ba629875679dbcbe4bfd3cb38ac0958b3c70a0
+size 4736548872

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4391c6d6b46ed49aa00afddf1f7df9dd0845cbc681fdaf424e727b01ea2d3e4
+size 4502742464

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6af14858bdd7cdea5d19d786726e48434b02a3c0c52a771a0f25b6a8ca640187
+size 4390620392

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,667 @@

+{
+  "metadata": {
+    "total_size": 18537245664
+  },
+  "weight_map": {
+    "model.region.coord_decoder.bias": "model-00004-of-00004.safetensors",
+    "model.region.coord_decoder.weight": "model-00004-of-00004.safetensors",
+    "model.region.coord_encoder.bias": "model-00004-of-00004.safetensors",
+    "model.region.coord_encoder.weight": "model-00004-of-00004.safetensors",
+    "model.region.coord_features": "model-00004-of-00004.safetensors",
+    "model.region.size_decoder.bias": "model-00004-of-00004.safetensors",
+    "model.region.size_decoder.weight": "model-00004-of-00004.safetensors",
+    "model.region.size_encoder.bias": "model-00004-of-00004.safetensors",
+    "model.region.size_encoder.weight": "model-00004-of-00004.safetensors",
+    "model.region.size_features": "model-00004-of-00004.safetensors",
+    "model.text.blocks.0.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.10.attn.proj.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.attn.proj.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.attn.qkv.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.attn.qkv.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.attn.tau.alpha": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.attn.tau.wq": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.attn.tau.wv": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.ln.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.ln.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.mlp.fc1.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.mlp.fc2.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.mlp.router.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.10.mlp.router.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.proj.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.proj.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.qkv.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.qkv.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.tau.alpha": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.tau.wq": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.attn.tau.wv": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.ln.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.ln.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.mlp.fc1.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.mlp.fc2.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.mlp.router.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.11.mlp.router.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.proj.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.proj.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.qkv.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.qkv.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.tau.alpha": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.tau.wq": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.attn.tau.wv": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.ln.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.ln.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.mlp.fc1.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.mlp.fc2.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.mlp.router.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.12.mlp.router.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.proj.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.proj.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.qkv.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.qkv.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.tau.alpha": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.tau.wq": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.attn.tau.wv": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.ln.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.ln.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.mlp.fc1.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.mlp.fc2.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.13.mlp.router.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.13.mlp.router.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.14.attn.proj.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.attn.proj.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.attn.qkv.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.attn.qkv.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.attn.tau.alpha": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.attn.tau.wq": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.attn.tau.wv": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.ln.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.ln.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.mlp.fc1.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.mlp.fc2.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.mlp.router.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.14.mlp.router.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.proj.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.proj.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.qkv.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.qkv.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.tau.alpha": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.tau.wq": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.attn.tau.wv": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.ln.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.ln.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.mlp.fc1.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.mlp.fc2.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.mlp.router.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.15.mlp.router.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.proj.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.proj.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.qkv.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.qkv.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.tau.alpha": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.tau.wq": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.attn.tau.wv": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.ln.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.ln.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.mlp.fc1.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.mlp.fc2.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.mlp.router.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.16.mlp.router.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.proj.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.proj.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.qkv.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.qkv.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.tau.alpha": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.tau.wq": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.attn.tau.wv": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.ln.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.ln.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.mlp.fc1.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.mlp.fc2.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.mlp.router.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.17.mlp.router.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.proj.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.proj.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.qkv.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.qkv.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.tau.alpha": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.tau.wq": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.attn.tau.wv": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.ln.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.ln.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.mlp.fc1.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.mlp.fc2.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.mlp.router.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.18.mlp.router.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.proj.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.proj.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.qkv.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.qkv.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.tau.alpha": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.tau.wq": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.attn.tau.wv": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.ln.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.ln.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.mlp.fc1.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.19.mlp.fc2.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.19.mlp.router.bias": "model-00003-of-00004.safetensors",
+    "model.text.blocks.19.mlp.router.weight": "model-00003-of-00004.safetensors",
+    "model.text.blocks.2.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.20.attn.proj.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.attn.proj.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.attn.qkv.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.attn.qkv.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.attn.tau.alpha": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.attn.tau.wq": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.attn.tau.wv": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.ln.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.ln.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.mlp.fc1.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.mlp.fc2.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.mlp.router.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.20.mlp.router.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.proj.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.proj.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.qkv.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.qkv.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.tau.alpha": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.tau.wq": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.attn.tau.wv": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.ln.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.ln.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.mlp.fc1.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.mlp.fc2.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.mlp.router.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.21.mlp.router.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.proj.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.proj.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.qkv.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.qkv.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.tau.alpha": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.tau.wq": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.attn.tau.wv": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.ln.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.ln.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.mlp.fc1.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.mlp.fc2.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.mlp.router.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.22.mlp.router.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.proj.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.proj.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.qkv.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.qkv.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.tau.alpha": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.tau.wq": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.attn.tau.wv": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.ln.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.ln.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.mlp.fc1.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.mlp.fc2.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.mlp.router.bias": "model-00004-of-00004.safetensors",
+    "model.text.blocks.23.mlp.router.weight": "model-00004-of-00004.safetensors",
+    "model.text.blocks.3.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.3.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.mlp.router.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.4.mlp.router.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.mlp.router.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.5.mlp.router.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.mlp.router.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.6.mlp.router.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.mlp.router.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.7.mlp.router.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.tau.alpha": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.tau.wq": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.attn.tau.wv": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.ln.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.ln.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.mlp.fc1.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.8.mlp.fc2.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.8.mlp.router.bias": "model-00001-of-00004.safetensors",
+    "model.text.blocks.8.mlp.router.weight": "model-00001-of-00004.safetensors",
+    "model.text.blocks.9.attn.proj.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.attn.proj.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.attn.qkv.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.attn.qkv.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.attn.tau.alpha": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.attn.tau.wq": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.attn.tau.wv": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.ln.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.ln.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.mlp.fc1.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.mlp.fc2.weight": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.mlp.router.bias": "model-00002-of-00004.safetensors",
+    "model.text.blocks.9.mlp.router.weight": "model-00002-of-00004.safetensors",
+    "model.text.lm_head.bias": "model-00004-of-00004.safetensors",
+    "model.text.lm_head.weight": "model-00004-of-00004.safetensors",
+    "model.text.post_ln.bias": "model-00004-of-00004.safetensors",
+    "model.text.post_ln.weight": "model-00004-of-00004.safetensors",
+    "model.text.wte": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.10.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.11.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.12.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.13.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.14.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.15.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.16.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.17.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.18.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.19.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.20.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.21.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.22.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.23.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.24.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.25.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.26.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.3.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.4.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.5.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.6.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.7.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.8.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.attn.proj.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.attn.proj.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.attn.qkv.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.attn.qkv.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.ln1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.ln1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.ln2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.ln2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.blocks.9.mlp.fc2.weight": "model-00001-of-00004.safetensors",
+    "model.vision.patch_emb.bias": "model-00001-of-00004.safetensors",
+    "model.vision.patch_emb.weight": "model-00001-of-00004.safetensors",
+    "model.vision.pos_emb": "model-00001-of-00004.safetensors",
+    "model.vision.post_ln.bias": "model-00001-of-00004.safetensors",
+    "model.vision.post_ln.weight": "model-00001-of-00004.safetensors",
+    "model.vision.proj_mlp.fc1.bias": "model-00001-of-00004.safetensors",
+    "model.vision.proj_mlp.fc1.weight": "model-00001-of-00004.safetensors",
+    "model.vision.proj_mlp.fc2.bias": "model-00001-of-00004.safetensors",
+    "model.vision.proj_mlp.fc2.weight": "model-00001-of-00004.safetensors"
+  }
+}

moondream.py ADDED Viewed

	@@ -0,0 +1,1070 @@

+import torch
+import torch.nn as nn
+import random
+from typing import Literal, Tuple, TypedDict, Union, Dict, Any, Optional, List
+from PIL import Image
+from dataclasses import dataclass
+from tokenizers import Tokenizer
+from torch.nn.attention.flex_attention import create_block_mask
+from .config import MoondreamConfig
+from .image_crops import reconstruct_from_crops
+from .vision import vision_encoder, vision_projection, prepare_crops, build_vision_model
+from .text import build_text_model, text_encoder, lm_head, text_decoder
+from .region import (
+    decode_coordinate,
+    encode_coordinate,
+    decode_size,
+    encode_size,
+    encode_spatial_refs,
+    SpatialRefs,
+)
+from .layers import QuantizedLinear
+from .lora import variant_state_dict
+from .utils import remove_outlier_points
+ImageEncodingSettings = TypedDict(
+    "ImageEncodingSettings",
+    {"variant": str},
+    total=False,
+)
+TextSamplingSettings = TypedDict(
+    "TextSamplingSettings",
+    {
+        "max_tokens": int,
+        "temperature": float,
+        "top_p": float,
+        "variant": str,
+    },
+    total=False,
+)
+ObjectSamplingSettings = TypedDict(
+    "ObjectSamplingSettings",
+    {"max_objects": int, "variant": str},
+    total=False,
+)
+DEFAULT_MAX_TOKENS = 768
+DEFAULT_TEMPERATURE = 0.5
+DEFAULT_TOP_P = 0.9
+DEFAULT_MAX_OBJECTS = 150
+@dataclass(frozen=True)
+class EncodedImage:
+    pos: int
+    caches: List[Tuple[torch.Tensor, torch.Tensor]]
+class KVCache(nn.Module):
+    def __init__(self, n_heads, n_kv_heads, max_context, dim, device, dtype):
+        super().__init__()
+        cache_shape = (1, n_kv_heads, max_context, dim // n_heads)
+        self.register_buffer(
+            "k_cache", torch.zeros(*cache_shape, device=device, dtype=dtype)
+        )
+        self.register_buffer(
+            "v_cache", torch.zeros(*cache_shape, device=device, dtype=dtype)
+        )
+    def update(self, pos_ids, k, v):
+        kout, vout = self.k_cache, self.v_cache
+        kout[:, :, pos_ids, :] = k
+        vout[:, :, pos_ids, :] = v
+        return kout, vout
+def causal_mask(b, h, q_idx, kv_idx):
+    return q_idx >= kv_idx
+def get_mask_mod(mask_mod, offset):
+    def _mask_mod(b, h, q, kv):
+        return mask_mod(b, h, q + offset, kv)
+    return _mask_mod
+class MoondreamModel(nn.Module):
+    def __init__(
+        self, config: MoondreamConfig, dtype=torch.bfloat16, setup_caches=True
+    ):
+        super().__init__()
+        self.config = config
+        self.tokenizer = Tokenizer.from_pretrained("moondream/starmie-v1")
+        self.vision = build_vision_model(config.vision, dtype)
+        self.text = build_text_model(config.text, dtype)
+        # Region Model
+        linear_cls = (
+            QuantizedLinear if config.region.group_size is not None else nn.Linear
+        )
+        self.region = nn.ModuleDict(
+            {
+                "coord_encoder": linear_cls(
+                    config.region.coord_feat_dim, config.region.dim, dtype=dtype
+                ),
+                "coord_decoder": linear_cls(
+                    config.region.dim, config.region.coord_out_dim, dtype=dtype
+                ),
+                "size_encoder": linear_cls(
+                    config.region.size_feat_dim, config.region.dim, dtype=dtype
+                ),
+                "size_decoder": linear_cls(
+                    config.region.dim, config.region.size_out_dim, dtype=dtype
+                ),
+            }
+        )
+        self.region.coord_features = nn.Parameter(
+            torch.empty(config.region.coord_feat_dim // 2, 1, dtype=dtype).T
+        )
+        self.region.size_features = nn.Parameter(
+            torch.empty(config.region.size_feat_dim // 2, 2, dtype=dtype).T
+        )
+        attn_mask = torch.tril(
+            torch.ones(
+                1, 1, config.text.max_context, config.text.max_context, dtype=torch.bool
+            )
+        )
+        patch_w = config.vision.crop_size // config.vision.enc_patch_size
+        prefix_attn_len = 1 + patch_w**2
+        attn_mask[..., :prefix_attn_len, :prefix_attn_len] = 1
+        self.register_buffer("attn_mask", attn_mask, persistent=False)
+        self.use_flex_decoding = True
+        self._causal_block_mask = None
+        self._point_gen_indices = None
+        # Initialize KV caches.
+        if setup_caches:
+            self._setup_caches()
+    @property
+    def causal_block_mask(self):
+        # The things we do to deal with ZeroGPU...
+        if self._causal_block_mask is None:
+            self._causal_block_mask = create_block_mask(
+                causal_mask,
+                B=None,
+                H=None,
+                Q_LEN=self.config.text.max_context,
+                KV_LEN=self.config.text.max_context,
+            )
+        return self._causal_block_mask
+    @property
+    def point_gen_indices(self):
+        if self._point_gen_indices is None:
+            self._point_gen_indices = torch.tensor(
+                [self.config.tokenizer.coord_id, self.config.tokenizer.eos_id],
+                device=self.device,
+            )
+        return self._point_gen_indices
+    def _setup_caches(self):
+        c = self.config.text
+        for b in self.text.blocks:
+            b.kv_cache = KVCache(
+                c.n_heads,
+                c.n_kv_heads,
+                c.max_context,
+                c.dim,
+                device=self.device,
+                dtype=self.vision.pos_emb.dtype,
+            )
+    @property
+    def device(self):
+        return self.vision.pos_emb.device
+    def _vis_enc(self, x: torch.Tensor):
+        return vision_encoder(x, self.vision, self.config.vision)
+    def _vis_proj(self, g: torch.Tensor, r: torch.Tensor):
+        return vision_projection(g, r, self.vision, self.config.vision)
+    def _prefill(
+        self,
+        x: torch.Tensor,
+        attn_mask: torch.Tensor,
+        pos_ids: torch.Tensor,
+        lora: Optional[torch.Tensor],
+    ):
+        return text_decoder(x, self.text, attn_mask, pos_ids, self.config.text, lora)
+    def _decode_one_tok(
+        self,
+        x: torch.Tensor,
+        attn_mask: torch.Tensor,
+        pos_ids: torch.Tensor,
+        lora: Optional[torch.Tensor],
+        lm_head_indices: Optional[torch.Tensor] = None,
+    ):
+        if self.use_flex_decoding:
+            torch._assert(pos_ids.shape[-1] == 1, "Invalid position ID shape")
+            block_index = pos_ids // self.causal_block_mask.BLOCK_SIZE[0]
+            mask = self.causal_block_mask[:, :, block_index]
+            mask.seq_lengths = (1, mask.seq_lengths[1])
+            mask.mask_mod = get_mask_mod(self.causal_block_mask.mask_mod, pos_ids[0])
+        else:
+            mask = None
+        hidden = text_decoder(
+            x,
+            self.text,
+            attn_mask,
+            pos_ids,
+            self.config.text,
+            lora=lora,
+            flex_block_mask_slice=mask,
+        )
+        logits = lm_head(hidden, self.text, indices=lm_head_indices)
+        return logits, hidden
+    def compile(self):
+        for module in self.modules():
+            if isinstance(module, QuantizedLinear):
+                module.unpack()
+        # Initialize lazy properties to avoid first-call overhead
+        self.causal_block_mask
+        self.point_gen_indices
+        # TODO: vision_projection and _prefill is not being compiled
+        self._vis_enc = torch.compile(self._vis_enc, fullgraph=True)
+        self._decode_one_tok = torch.compile(
+            self._decode_one_tok, fullgraph=True, mode="reduce-overhead"
+        )
+        # Warm up compiled methods with dummy forward passes
+        device = self.device
+        dtype = self.vision.pos_emb.dtype
+        with torch.no_grad():
+            # Warmup vision encoder
+            dummy_crops = torch.randn(1, 3, 378, 378, device=device, dtype=dtype)
+            self._vis_enc(dummy_crops)
+            # Warmup _decode_one_tok (both normal and point generation modes)
+            dummy_emb = torch.randn(
+                1, 1, self.config.text.dim, device=device, dtype=dtype
+            )
+            dummy_mask = torch.ones(
+                1, 1, self.config.text.max_context, device=device, dtype=torch.bool
+            )
+            dummy_pos_ids = torch.tensor([100], device=device, dtype=torch.long)
+            self._decode_one_tok(dummy_emb, dummy_mask, dummy_pos_ids, None)
+            self._decode_one_tok(
+                dummy_emb,
+                dummy_mask,
+                dummy_pos_ids,
+                None,
+                lm_head_indices=self.point_gen_indices,
+            )
+    def _run_vision_encoder(self, image: Image.Image) -> torch.Tensor:
+        all_crops, tiling = prepare_crops(image, self.config.vision, device=self.device)
+        torch._dynamo.mark_dynamic(all_crops, 0)
+        outputs = self._vis_enc(all_crops)
+        global_features = outputs[0]
+        local_features = outputs[1:].view(
+            -1,
+            self.config.vision.enc_n_layers,
+            self.config.vision.enc_n_layers,
+            self.config.vision.enc_dim,
+        )
+        reconstructed = reconstruct_from_crops(
+            local_features,
+            tiling,
+            patch_size=1,
+            overlap_margin=self.config.vision.overlap_margin,
+        )
+        return self._vis_proj(global_features, reconstructed)
+    def encode_image(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        settings: Optional[ImageEncodingSettings] = None,
+    ) -> EncodedImage:
+        if isinstance(image, EncodedImage):
+            return image
+        elif not isinstance(image, Image.Image):
+            raise ValueError("image must be a PIL Image or EncodedImage")
+        lora = (
+            variant_state_dict(settings["variant"], device=self.device)
+            if settings is not None and "variant" in settings
+            else None
+        )
+        # Run through text model in addition to the vision encoder, to minimize
+        # re-computation if multiple queries are performed on this image.
+        with torch.inference_mode():
+            img_emb = self._run_vision_encoder(image)
+            bos_emb = text_encoder(
+                torch.tensor([[self.config.tokenizer.bos_id]], device=self.device),
+                self.text,
+            )
+            inputs_embeds = torch.cat([bos_emb, img_emb[None]], dim=1)
+            mask = self.attn_mask[:, :, 0 : inputs_embeds.size(1), :]
+            pos_ids = torch.arange(
+                inputs_embeds.size(1), dtype=torch.long, device=self.device
+            )
+            self._prefill(inputs_embeds, mask, pos_ids, lora)
+        return EncodedImage(
+            pos=inputs_embeds.size(1),
+            caches=[
+                (
+                    b.kv_cache.k_cache[:, :, : inputs_embeds.size(1), :].clone(),
+                    b.kv_cache.v_cache[:, :, : inputs_embeds.size(1), :].clone(),
+                )
+                for b in self.text.blocks
+            ],
+        )
+    def _apply_top_p(self, probs: torch.Tensor, top_p: float):
+        probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
+        probs_sum = torch.cumsum(probs_sort, dim=-1)
+        mask = probs_sum - probs_sort > top_p
+        probs_sort[mask] = 0.0
+        probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
+        next_probs = torch.zeros_like(probs)
+        next_probs.scatter_(dim=-1, index=probs_idx, src=probs_sort)
+        return next_probs
+    def _prefill_prompt(
+        self,
+        prompt_tokens: torch.Tensor,
+        pos: int,
+        temperature: float,
+        top_p: float,
+        spatial_refs: Optional[SpatialRefs] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+        lora: Optional[dict] = None,
+    ):
+        with torch.inference_mode():
+            prompt_emb = text_encoder(prompt_tokens, self.text)
+            if spatial_refs:
+                encoded_refs = encode_spatial_refs(spatial_refs, self.region)
+                prompt_emb[prompt_tokens == self.config.tokenizer.coord_id] = (
+                    encoded_refs["coords"]
+                )
+                if encoded_refs["sizes"] is not None:
+                    prompt_emb[prompt_tokens == self.config.tokenizer.size_id] = (
+                        encoded_refs["sizes"]
+                    )
+            torch._dynamo.mark_dynamic(prompt_emb, 1)
+            if attn_mask is None:
+                attn_mask = self.attn_mask
+            mask = attn_mask[:, :, pos : pos + prompt_emb.size(1), :]
+            pos_ids = torch.arange(
+                pos, pos + prompt_emb.size(1), dtype=torch.long, device=self.device
+            )
+            hidden_BC = self._prefill(prompt_emb, mask, pos_ids, lora)
+            logits_BV = lm_head(hidden_BC, self.text)
+            if temperature == 0:
+                next_token = torch.argmax(logits_BV, dim=-1).unsqueeze(1)
+            else:
+                probs = torch.softmax(logits_BV / temperature, dim=-1)
+                probs = self._apply_top_p(probs, top_p)
+                next_token = torch.multinomial(probs, num_samples=1)
+        pos = pos + prompt_emb.size(1)
+        return logits_BV, hidden_BC, next_token, pos
+    def _generate_reasoning(
+        self,
+        prompt_tokens,
+        pos,
+        settings: Optional[TextSamplingSettings] = None,
+        spatial_refs: Optional[SpatialRefs] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+    ) -> Tuple[int, str, List[dict]]:
+        max_tokens = (
+            settings.get("max_tokens", DEFAULT_MAX_TOKENS)
+            if settings
+            else DEFAULT_MAX_TOKENS
+        )
+        temperature = (
+            settings.get("temperature", DEFAULT_TEMPERATURE)
+            if settings
+            else DEFAULT_TEMPERATURE
+        )
+        lora = (
+            variant_state_dict(settings["variant"], device=self.device)
+            if settings is not None and "variant" in settings
+            else None
+        )
+        top_p = settings.get("top_p", DEFAULT_TOP_P) if settings else DEFAULT_TOP_P
+        eos_id = self.config.tokenizer.answer_id
+        _, last_hidden_BC, next_token, pos = self._prefill_prompt(
+            prompt_tokens,
+            pos,
+            temperature,
+            top_p,
+            spatial_refs,
+            attn_mask=attn_mask,
+            lora=lora,
+        )
+        text_token_chunks = [[]]
+        grounding_chunks = [[]]
+        mask = torch.zeros(
+            1, 1, self.config.text.max_context, device=self.device, dtype=torch.bool
+        )
+        mask[:, :, :pos] = 1
+        pos_ids = torch.tensor([pos], device=self.device, dtype=torch.long)
+        generated_tokens = 0
+        while (
+            next_token_id := next_token.item()
+        ) != eos_id and generated_tokens < max_tokens:
+            if (
+                next_token_id == self.config.tokenizer.start_ground_points_id
+                or next_token_id == self.config.tokenizer.end_ground_id
+            ):
+                text_token_chunks.append([])
+                grounding_chunks.append([])
+            text_token_chunks[-1].append(next_token_id)
+            with torch.inference_mode():
+                if next_token_id == self.config.tokenizer.coord_id:
+                    coord_logits = decode_coordinate(last_hidden_BC, self.region)
+                    coord = torch.argmax(coord_logits, dim=-1) / coord_logits.size(-1)
+                    grounding_chunks[-1].append(coord.item())
+                    next_emb = encode_coordinate(
+                        coord.to(dtype=coord_logits.dtype), self.region
+                    ).unsqueeze(0)
+                else:
+                    next_emb = text_encoder(next_token, self.text)
+                mask[:, :, pos], pos_ids[0] = 1, pos
+                logits_BV, last_hidden_BC = self._decode_one_tok(
+                    next_emb, mask, pos_ids, lora
+                )
+                logits_BV[:, self.config.tokenizer.eos_id] = float("-inf")
+                logits_BV[:, self.config.tokenizer.size_id] = float("-inf")
+                pos += 1
+                if temperature == 0:
+                    next_token = torch.argmax(logits_BV, dim=-1).unsqueeze(1)  # (1, 1)
+                else:
+                    probs = torch.softmax(logits_BV / temperature, dim=-1)  # (1, V)
+                    probs = self._apply_top_p(probs, top_p)
+                    next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)
+                generated_tokens += 1
+        text_chunks = [
+            self.tokenizer.decode(chunk_tokens) for chunk_tokens in text_token_chunks
+        ]
+        text = "".join(text_chunks)
+        start_idx = 0
+        grounding = []
+        for text_chunk, grounding_chunk in zip(text_chunks, grounding_chunks):
+            if len(grounding_chunk) > 1:
+                points = []
+                for i in range(0, len(grounding_chunk) - (len(grounding_chunk) % 2), 2):
+                    points.append((grounding_chunk[i], grounding_chunk[i + 1]))
+                grounding.append(
+                    {
+                        "start_idx": start_idx,
+                        "end_idx": start_idx + len(text_chunk),
+                        "points": points,
+                    }
+                )
+            start_idx += len(text_chunk)
+        return pos, text, grounding
+    def _generate_answer(
+        self,
+        prompt_tokens: torch.Tensor,
+        pos: int,
+        settings: Optional[TextSamplingSettings] = None,
+        spatial_refs: Optional[SpatialRefs] = None,
+        eos_id: Optional[int] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+    ):
+        max_tokens = (
+            settings.get("max_tokens", DEFAULT_MAX_TOKENS)
+            if settings
+            else DEFAULT_MAX_TOKENS
+        )
+        temperature = (
+            settings.get("temperature", DEFAULT_TEMPERATURE)
+            if settings
+            else DEFAULT_TEMPERATURE
+        )
+        top_p = settings.get("top_p", DEFAULT_TOP_P) if settings else DEFAULT_TOP_P
+        eos_id = eos_id if eos_id is not None else self.config.tokenizer.eos_id
+        lora = (
+            variant_state_dict(settings["variant"], device=self.device)
+            if settings is not None and "variant" in settings
+            else None
+        )
+        _, _, next_token, pos = self._prefill_prompt(
+            prompt_tokens,
+            pos,
+            temperature,
+            top_p,
+            spatial_refs,
+            attn_mask=attn_mask,
+            lora=lora,
+        )
+        def generator(next_token, pos):
+            mask = torch.zeros(
+                1, 1, self.config.text.max_context, device=self.device, dtype=torch.bool
+            )
+            mask[:, :, :pos] = 1
+            pos_ids = torch.tensor([pos], device=self.device, dtype=torch.long)
+            generated_tokens = 0
+            # For properly handling token streaming with Unicode
+            token_cache = []
+            print_len = 0
+            while (
+                next_token_id := next_token.item()
+            ) != eos_id and generated_tokens < max_tokens:
+                # Add token to our cache
+                token_cache.append(next_token_id)
+                # Decode all tokens collected so far
+                text = self.tokenizer.decode(token_cache)
+                # After a newline, we flush the cache completely
+                if text.endswith("\n"):
+                    printable_text = text[print_len:]
+                    token_cache = []
+                    print_len = 0
+                    if printable_text:
+                        yield printable_text
+                # If the last token is a CJK character, we can safely print it
+                elif len(text) > 0 and _is_cjk_char(ord(text[-1])):
+                    printable_text = text[print_len:]
+                    print_len += len(printable_text)
+                    if printable_text:
+                        yield printable_text
+                # Otherwise, only yield up to the last space to avoid cutting words
+                else:
+                    last_space_idx = text.rfind(" ", print_len)
+                    if last_space_idx >= print_len:
+                        printable_text = text[print_len : last_space_idx + 1]
+                        print_len += len(printable_text)
+                        if printable_text:
+                            yield printable_text
+                with torch.inference_mode():
+                    next_emb = text_encoder(next_token, self.text)
+                    mask[:, :, pos], pos_ids[0] = 1, pos
+                    logits_BV, _ = self._decode_one_tok(next_emb, mask, pos_ids, lora)
+                    logits_BV[:, self.config.tokenizer.answer_id] = float("-inf")
+                    pos += 1
+                    if temperature == 0:
+                        next_token = torch.argmax(logits_BV, dim=-1).unsqueeze(
+                            1
+                        )  # (1, 1)
+                    else:
+                        probs = torch.softmax(logits_BV / temperature, dim=-1)  # (1, V)
+                        probs = self._apply_top_p(probs, top_p)
+                        next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)
+                    generated_tokens += 1
+            # Flush any remaining text in the cache
+            if token_cache:
+                text = self.tokenizer.decode(token_cache)
+                printable_text = text[print_len:]
+                if printable_text:
+                    yield printable_text
+        return generator(next_token, pos)
+    def query(
+        self,
+        image: Optional[Union[Image.Image, EncodedImage]] = None,
+        question: str = None,
+        reasoning: bool = True,
+        spatial_refs: Optional[SpatialRefs] = None,
+        stream: bool = False,
+        settings: Optional[TextSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["query"] is None:
+            raise NotImplementedError("Model does not support querying.")
+        if question is None:
+            raise ValueError("question must be provided.")
+        if spatial_refs and image is None:
+            raise ValueError("spatial_refs can only be used with an image.")
+        attn_mask = self.attn_mask
+        if image is not None:
+            image = self.encode_image(image, settings)
+            self.load_encoded_image(image)
+            pos = image.pos
+            prompt_toks = self.config.tokenizer.templates["query"]["prefix"]
+        else:
+            self._setup_caches()
+            pos = 0
+            prompt_toks = [
+                self.config.tokenizer.bos_id
+            ] + self.config.tokenizer.templates["query"]["prefix"]
+            max_context = self.config.text.max_context
+            attn_mask = torch.tril(
+                torch.ones(1, 1, max_context, max_context, dtype=torch.bool)
+            ).to(self.device)
+        spatial_toks = []
+        if spatial_refs:
+            for ref in spatial_refs:
+                coord_id = self.config.tokenizer.coord_id
+                size_id = self.config.tokenizer.size_id
+                if len(ref) == 2:
+                    spatial_toks.extend([coord_id, coord_id])
+                else:
+                    spatial_toks.extend([coord_id, coord_id, size_id])
+        prompt_tokens = [
+            prompt_toks + spatial_toks + self.tokenizer.encode(question).ids
+        ]
+        if reasoning:
+            prompt_tokens[0] += [self.config.tokenizer.thinking_id]
+            prompt_tokens = torch.tensor(prompt_tokens, device=self.device)
+            pos, reasoning_text, reasoning_grounding = self._generate_reasoning(
+                prompt_tokens, pos, settings, spatial_refs, attn_mask=attn_mask
+            )
+            prompt_tokens = [self.config.tokenizer.templates["query"]["suffix"]]
+            reasoning_dict = {
+                "reasoning": {"text": reasoning_text, "grounding": reasoning_grounding}
+            }
+        else:
+            prompt_tokens[0] += self.config.tokenizer.templates["query"]["suffix"]
+            reasoning_dict = {}
+        prompt_tokens = torch.tensor(prompt_tokens, device=self.device)
+        def generator():
+            for token in self._generate_answer(
+                prompt_tokens, pos, settings, spatial_refs, attn_mask=attn_mask
+            ):
+                yield token
+        if stream:
+            return {**reasoning_dict, "answer": generator()}
+        else:
+            return {**reasoning_dict, "answer": "".join(list(generator()))}
+    def load_encoded_image(self, encoded_image: EncodedImage):
+        for b, (k, v) in zip(self.text.blocks, encoded_image.caches):
+            b.kv_cache.k_cache[:, :, : k.size(2), :] = k
+            b.kv_cache.v_cache[:, :, : v.size(2), :] = v
+    def caption(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        length: Literal["normal", "short", "long"] = "normal",
+        stream: bool = False,
+        settings: Optional[TextSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["caption"] is None:
+            raise NotImplementedError("Model does not support captioning.")
+        if length not in self.config.tokenizer.templates["caption"]:
+            raise ValueError(f"Model does not support caption length '{length}'.")
+        image = self.encode_image(image, settings)
+        self.load_encoded_image(image)
+        prompt_tokens = torch.tensor(
+            [self.config.tokenizer.templates["caption"][length]], device=self.device
+        )
+        def generator():
+            for token in self._generate_answer(prompt_tokens, image.pos, settings):
+                yield token
+        if stream:
+            return {"caption": generator()}
+        else:
+            return {"caption": "".join(list(generator()))}
+    def _generate_points(
+        self,
+        hidden: torch.Tensor,
+        next_token: torch.Tensor,
+        pos: int,
+        include_size: bool = True,
+        max_objects: int = DEFAULT_MAX_OBJECTS,
+        lora: Optional[dict] = None,
+    ):
+        out = []
+        mask = torch.zeros(
+            1, 1, self.config.text.max_context, device=self.device, dtype=torch.bool
+        )
+        mask[:, :, :pos] = 1
+        pos_ids = torch.tensor([pos], device=self.device, dtype=torch.long)
+        with torch.inference_mode():
+            while (
+                next_token.item() != self.config.tokenizer.eos_id
+                and len(out) < max_objects
+            ):
+                x_logits = decode_coordinate(hidden, self.region)
+                x_center = torch.argmax(x_logits, dim=-1) / x_logits.size(-1)
+                next_emb = encode_coordinate(
+                    x_center.to(dtype=x_logits.dtype), self.region
+                ).unsqueeze(0)
+                # Decode y-coordinate
+                mask[:, :, pos], pos_ids[0] = 1, pos
+                _, hidden = self._decode_one_tok(next_emb, mask, pos_ids, lora)
+                pos += 1
+                y_logits = decode_coordinate(hidden, self.region)
+                y_center = torch.argmax(y_logits, dim=-1) / y_logits.size(-1)
+                next_emb = encode_coordinate(
+                    y_center.to(dtype=y_logits.dtype), self.region
+                ).unsqueeze(0)
+                # Decode size
+                if include_size:
+                    mask[:, :, pos], pos_ids[0] = 1, pos
+                    logits, hidden = self._decode_one_tok(next_emb, mask, pos_ids, lora)
+                    pos += 1
+                    size_logits = decode_size(hidden, self.region)
+                    # Get bin indices from the logits
+                    w_bin = torch.argmax(size_logits[0], dim=-1)
+                    h_bin = torch.argmax(size_logits[1], dim=-1)
+                    # Convert from bin indices to actual size values using the inverse of the log-scale mapping
+                    # Formula: size = 2^((bin / 1023.0) * 10.0 - 10.0)
+                    w = torch.pow(2.0, (w_bin.float() / 1023.0) * 10.0 - 10.0)
+                    h = torch.pow(2.0, (h_bin.float() / 1023.0) * 10.0 - 10.0)
+                    next_emb = (
+                        encode_size(
+                            torch.tensor(
+                                [w, h], device=self.device, dtype=size_logits.dtype
+                            ),
+                            self.region,
+                        )
+                        .unsqueeze(0)
+                        .unsqueeze(0)
+                    )
+                    # Add object
+                    out.append(
+                        {
+                            "x_min": x_center.item() - w.item() / 2,
+                            "y_min": y_center.item() - h.item() / 2,
+                            "x_max": x_center.item() + w.item() / 2,
+                            "y_max": y_center.item() + h.item() / 2,
+                        }
+                    )
+                else:
+                    out.append({"x": x_center.item(), "y": y_center.item()})
+                # Decode next token (x-coordinate, or eos)
+                mask[:, :, pos], pos_ids[0] = 1, pos
+                logits, hidden = self._decode_one_tok(
+                    next_emb,
+                    mask,
+                    pos_ids,
+                    lora,
+                    lm_head_indices=self.point_gen_indices,
+                )
+                pos += 1
+                # Map back: index 0 -> coord_id, index 1 -> eos_id
+                next_token_idx = torch.argmax(logits, dim=-1)
+                next_token = self.point_gen_indices[next_token_idx]
+        return out
+    def detect(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        object: str,
+        settings: Optional[ObjectSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["detect"] is None:
+            raise NotImplementedError("Model does not support object detection.")
+        image = self.encode_image(image, settings)
+        self.load_encoded_image(image)
+        prompt_tokens = torch.tensor(
+            [
+                self.config.tokenizer.templates["detect"]["prefix"]
+                + self.tokenizer.encode(" " + object).ids
+                + self.config.tokenizer.templates["detect"]["suffix"]
+            ],
+            device=self.device,
+        )
+        lora = (
+            variant_state_dict(settings["variant"], device=self.device)
+            if settings is not None and "variant" in settings
+            else None
+        )
+        _, hidden, next_token, pos = self._prefill_prompt(
+            prompt_tokens, image.pos, temperature=0, top_p=0, lora=lora
+        )
+        hidden = hidden[:, -1:, :]
+        max_objects = (
+            settings.get("max_objects", DEFAULT_MAX_OBJECTS)
+            if settings
+            else DEFAULT_MAX_OBJECTS
+        )
+        objects = self._generate_points(
+            hidden,
+            next_token,
+            pos,
+            include_size=True,
+            max_objects=max_objects,
+            lora=lora,
+        )
+        return {"objects": objects}
+    def point(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        object: str,
+        settings: Optional[ObjectSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["point"] is None:
+            raise NotImplementedError("Model does not support pointing.")
+        image = self.encode_image(image, settings)
+        self.load_encoded_image(image)
+        prompt_tokens = torch.tensor(
+            [
+                self.config.tokenizer.templates["point"]["prefix"]
+                + self.tokenizer.encode(" " + object).ids
+                + self.config.tokenizer.templates["point"]["suffix"]
+            ],
+            device=self.device,
+        )
+        lora = (
+            variant_state_dict(settings["variant"], device=self.device)
+            if settings is not None and "variant" in settings
+            else None
+        )
+        _, hidden, next_token, pos = self._prefill_prompt(
+            prompt_tokens, image.pos, temperature=0, top_p=0, lora=lora
+        )
+        hidden = hidden[:, -1:, :]
+        max_objects = (
+            settings.get("max_objects", DEFAULT_MAX_OBJECTS)
+            if settings
+            else DEFAULT_MAX_OBJECTS
+        )
+        objects = self._generate_points(
+            hidden,
+            next_token,
+            pos,
+            include_size=False,
+            max_objects=max_objects,
+            lora=lora,
+        )
+        return {"points": objects}
+    def _detect_gaze(
+        self,
+        image: EncodedImage,
+        source: Tuple[float, float],
+        force_detect: bool = False,
+    ):
+        with torch.inference_mode():
+            before_emb = text_encoder(
+                torch.tensor(
+                    [self.tokenizer.encode("\n\nPoint:").ids], device=self.device
+                ),
+                self.text,
+            )
+            after_emb = text_encoder(
+                torch.tensor(
+                    [self.tokenizer.encode(" gaze\n\n").ids], device=self.device
+                ),
+                self.text,
+            )
+            x_emb = encode_coordinate(
+                torch.tensor([[[source[0]]]], device=self.device, dtype=torch.bfloat16),
+                self.region,
+            )
+            y_emb = encode_coordinate(
+                torch.tensor([[[source[1]]]], device=self.device, dtype=torch.bfloat16),
+                self.region,
+            )
+            prompt_emb = torch.cat([before_emb, x_emb, y_emb, after_emb], dim=1)
+            self.load_encoded_image(image)
+            mask = self.attn_mask[:, :, image.pos : image.pos + prompt_emb.size(1), :]
+            pos_ids = torch.arange(
+                image.pos,
+                image.pos + prompt_emb.size(1),
+                dtype=torch.long,
+                device=self.device,
+            )
+            hidden = self._prefill(prompt_emb, mask, pos_ids, lora=None)
+            logits = lm_head(hidden, self.text)
+            next_token = torch.argmax(logits, dim=-1)
+            pos = image.pos + prompt_emb.size(1)
+            hidden = hidden[:, -1:, :]
+            if force_detect:
+                next_token = torch.tensor([[0]], device=self.device)
+            if next_token.item() == self.config.tokenizer.eos_id:
+                return None
+            gaze = self._generate_points(
+                hidden, next_token, pos, include_size=False, max_objects=1
+            )
+            return gaze[0]
+    def detect_gaze(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        eye: Optional[Tuple[float, float]] = None,
+        face: Optional[Dict[str, float]] = None,
+        unstable_settings: Dict[str, Any] = {},
+    ):
+        if "force_detect" in unstable_settings:
+            force_detect = unstable_settings["force_detect"]
+        else:
+            force_detect = False
+        if "prioritize_accuracy" in unstable_settings:
+            prioritize_accuracy = unstable_settings["prioritize_accuracy"]
+        else:
+            prioritize_accuracy = False
+        if not prioritize_accuracy:
+            if eye is None:
+                raise ValueError("eye must be provided when prioritize_accuracy=False")
+            image = self.encode_image(image)
+            return {"gaze": self._detect_gaze(image, eye, force_detect=force_detect)}
+        else:
+            if (
+                not isinstance(image, Image.Image)
+                and "flip_enc_img" not in unstable_settings
+            ):
+                raise ValueError(
+                    "image must be a PIL Image when prioritize_accuracy=True, "
+                    "or flip_enc_img must be provided"
+                )
+            if face is None:
+                raise ValueError("face must be provided when prioritize_accuracy=True")
+            encoded_image = self.encode_image(image)
+            if (
+                isinstance(image, Image.Image)
+                and "flip_enc_img" not in unstable_settings
+            ):
+                flipped_pil = image.copy()
+                flipped_pil = flipped_pil.transpose(method=Image.FLIP_LEFT_RIGHT)
+                encoded_flipped_image = self.encode_image(flipped_pil)
+            else:
+                encoded_flipped_image = unstable_settings["flip_enc_img"]
+            N = 10
+            detections = [
+                self._detect_gaze(
+                    encoded_image,
+                    (
+                        random.uniform(face["x_min"], face["x_max"]),
+                        random.uniform(face["y_min"], face["y_max"]),
+                    ),
+                    force_detect=force_detect,
+                )
+                for _ in range(N)
+            ]
+            detections = [
+                (gaze["x"], gaze["y"]) for gaze in detections if gaze is not None
+            ]
+            flipped_detections = [
+                self._detect_gaze(
+                    encoded_flipped_image,
+                    (
+                        1 - random.uniform(face["x_min"], face["x_max"]),
+                        random.uniform(face["y_min"], face["y_max"]),
+                    ),
+                    force_detect=force_detect,
+                )
+                for _ in range(N)
+            ]
+            detections.extend(
+                [
+                    (1 - gaze["x"], gaze["y"])
+                    for gaze in flipped_detections
+                    if gaze is not None
+                ]
+            )
+            if len(detections) < N:
+                return {"gaze": None}
+            detections = remove_outlier_points(detections)
+            mean_gaze = (
+                sum(gaze[0] for gaze in detections) / len(detections),
+                sum(gaze[1] for gaze in detections) / len(detections),
+            )
+            return {"gaze": {"x": mean_gaze[0], "y": mean_gaze[1]}}
+def _is_cjk_char(cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    if (
+        (cp >= 0x4E00 and cp <= 0x9FFF)
+        or (cp >= 0x3400 and cp <= 0x4DBF)
+        or (cp >= 0x2F800 and cp <= 0x2FA1F)
+    ):
+        return True
+    return False

open_vocab_detect.png ADDED Viewed

Git LFS Details

SHA256: ea98c762d53ff5b619b51c538fda3e9dce2c895d32bcb38675591e79b55e8054
Pointer size: 132 Bytes
Size of remote file: 1.53 MB

point_count.png ADDED Viewed

Git LFS Details

SHA256: fa1d46a37291c46a006abd3579a241fe0fa5995d735883e90f5d5210a843b397
Pointer size: 132 Bytes
Size of remote file: 2 MB

region.py ADDED Viewed

	@@ -0,0 +1,134 @@

+import torch
+import torch.nn as nn
+import math
+from typing import List, Tuple, Union
+SpatialRefs = List[Union[Tuple[float, float], Tuple[float, float, float, float]]]
+def fourier_features(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
+    """
+    Applies Fourier feature mapping to input tensor x using frequency matrix w. This
+    projects inputs through sinusoidal functions to create higher dimensional features
+    that help mitigate spectral bias - the tendency of neural networks to learn
+    low-frequency functions more easily than high-frequency ones. By explicitly
+    mapping inputs to higher frequencies through sin/cos transformations, we enable
+    better learning of fine details and higher frequency patterns.
+    Args:
+        x: Input tensor to transform
+        w: Matrix of frequencies for the Fourier features transformation
+    Returns:
+        Concatenated cosine and sine transformed features as a tensor
+    """
+    f = 2 * math.pi * x @ w
+    return torch.cat([f.cos(), f.sin()], dim=-1)
+def encode_coordinate(coord: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes as input a tensor containing a single float coordinate value (x or y)
+    and encodes it into hidden states for input to the text model.
+    Args:
+        coord: Tensor with single float coordinate value
+    Returns:
+        Encoded hidden states tensor for input to text model
+    """
+    return w.coord_encoder(fourier_features(coord, w.coord_features))
+def decode_coordinate(hidden_state: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes as input the last hidden state from the text model and outputs a single logit
+    representing either an x or y coordinate prediction.
+    Args:
+        hidden_state: The final hidden state tensor from the text model.
+    Returns:
+        A single logit representing the predicted coordinate value (x or y)
+    """
+    return w.coord_decoder(hidden_state)
+def encode_size(size: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes a tensor containing width and height values and encodes them into
+    hidden states for input to the text model.
+    Args:
+        size: Tensor with two floats for width and height
+    Returns:
+        Encoded hidden states tensor for input to text model
+    """
+    return w.size_encoder(fourier_features(size, w.size_features))
+def decode_size(hidden_state: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes as input the last hidden state from the text model and outputs logits
+    for 1024 bins representing width and height in log-scale.
+    The bins are distributed according to the formula:
+    bin = (log2(size) + 10.0) / 10.0 * 1023.0
+    where size values are clamped to be at least 1/1024.
+    To convert from bin back to size:
+    size = 2^((bin / 1023.0) * 10.0 - 10.0)
+    Args:
+        hidden_state: The final hidden state tensor from the text model.
+    Returns:
+        A tensor containing logits for 1024 bins for width and height.
+        Shape is (2, 1024) where the first dimension corresponds to width and height.
+    """
+    return w.size_decoder(hidden_state).view(2, -1)
+def encode_spatial_refs(spatial_refs: SpatialRefs, w: nn.Module) -> torch.Tensor:
+    """
+    Takes a list of spatial references (points or regions) and encodes them into
+    hidden states for input to the text model.
+    Args:
+        spatial_refs: List of spatial references (points or boxes)
+            - Points are represented as normalized (x, y) tuples
+            - Boxes are represented as normalized (x_min, y_min, x_max, y_max) tuples
+    Returns:
+        {"coords": torch.Tensor, "sizes": Optional[torch.Tensor]}
+    """
+    coords, sizes = [], []
+    for ref in spatial_refs:
+        if len(ref) == 2:
+            coords.append(ref[0])
+            coords.append(ref[1])
+        else:
+            x_c = (ref[0] + ref[2]) / 2
+            y_c = (ref[1] + ref[3]) / 2
+            width = ref[2] - ref[0]
+            height = ref[3] - ref[1]
+            coords.append(x_c)
+            coords.append(y_c)
+            sizes.append([width, height])
+    coords = torch.tensor(
+        coords, device=w.coord_features.device, dtype=w.coord_features.dtype
+    ).view(-1, 1)
+    coords = encode_coordinate(coords, w)
+    if sizes:
+        sizes = torch.tensor(
+            sizes, device=w.size_features.device, dtype=w.size_features.dtype
+        )
+        sizes = encode_size(sizes, w)
+    else:
+        sizes = None
+    return {"coords": coords, "sizes": sizes}

rope.py ADDED Viewed

	@@ -0,0 +1,47 @@

+# Ethically sourced from https://github.com/xjdr-alt/entropix
+import torch
+def precompute_freqs_cis(
+    dim: int,
+    end: int,
+    theta: float = 1500000.0,
+    dtype: torch.dtype = torch.float32,
+) -> torch.Tensor:
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=dtype)[: (dim // 2)] / dim))
+    t = torch.arange(end, dtype=dtype).unsqueeze(1)
+    freqs = t * freqs.unsqueeze(0)
+    freqs = torch.exp(1j * freqs)
+    return torch.stack([freqs.real, freqs.imag], dim=-1)
+def apply_rotary_emb(
+    x: torch.Tensor,
+    freqs_cis: torch.Tensor,
+    position_ids: torch.Tensor,
+    num_heads: int,
+    rot_dim: int = 32,
+    interleave: bool = False,
+) -> torch.Tensor:
+    assert rot_dim == freqs_cis.shape[-2] * 2
+    assert num_heads == x.shape[1]
+    x_rot, x_pass = x[..., :rot_dim], x[..., rot_dim:]
+    if interleave:
+        xq_r = x_rot.float().reshape(*x_rot.shape[:-1], -1, 2)[..., 0]
+        xq_i = x_rot.float().reshape(*x_rot.shape[:-1], -1, 2)[..., 1]
+    else:
+        d_q = x_rot.shape[-1] // 2
+        xq_r, xq_i = x_rot[..., :d_q], x_rot[..., d_q:]
+    freqs_cos = freqs_cis[..., 0][position_ids, :].unsqueeze(0).unsqueeze(0)
+    freqs_sin = freqs_cis[..., 1][position_ids, :].unsqueeze(0).unsqueeze(0)
+    # Complex multiplication: (a + bi) * (c + di) = (ac - bd) + (ad + bc)i
+    xq_out_r = xq_r * freqs_cos - xq_i * freqs_sin
+    xq_out_i = xq_r * freqs_sin + xq_i * freqs_cos
+    xq_out = torch.stack((xq_out_r, xq_out_i), dim=-1).flatten(-2)
+    return torch.cat([xq_out.to(x.dtype), x_pass], dim=-1)

structured_outputs.png ADDED Viewed

Git LFS Details

SHA256: 655bc4f2897175ca479b40308c3a0ae8e368b943b92bfbe73075821906a83b0f
Pointer size: 132 Bytes
Size of remote file: 1.87 MB

text.py ADDED Viewed

	@@ -0,0 +1,234 @@

+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from torch.nn.attention.flex_attention import flex_attention
+from typing import Optional
+from .layers import layer_norm, mlp, QuantizedLinear, moe_mlp
+from .rope import apply_rotary_emb, precompute_freqs_cis
+from .config import TextConfig
+def text_encoder(input_ids: torch.Tensor, w: nn.Module):
+    return F.embedding(input_ids, w.wte)
+def attn(
+    x: torch.Tensor,
+    w: nn.Module,
+    freqs_cis: torch.Tensor,
+    kv_cache: nn.Module,
+    attn_mask: torch.Tensor,
+    n_heads: int,
+    n_kv_heads: int,
+    position_ids: torch.Tensor,
+    lora: Optional[dict] = None,
+    flex_block_mask_slice=None,
+):
+    bsz, q_len, d_model = x.shape
+    head_dim = d_model // n_heads
+    qkv_out = w.qkv(x)  # shape: (bsz, q_len, (n_heads + 2*n_kv_heads)*head_dim)
+    if lora is not None:
+        qkv_out += F.linear(F.linear(x, lora["qkv"]["A"]), lora["qkv"]["B"])
+    q_dim = n_heads * head_dim
+    kv_dim = n_kv_heads * head_dim
+    q, k, v = qkv_out.split([q_dim, kv_dim, kv_dim], dim=-1)
+    q = q.view(bsz, q_len, n_heads, head_dim).transpose(1, 2)
+    k = k.view(bsz, q_len, n_kv_heads, head_dim).transpose(1, 2)
+    v = v.view(bsz, q_len, n_kv_heads, head_dim).transpose(1, 2)
+    if hasattr(w, "tau") and w.tau is not None:
+        tok_feat = F.gelu(qkv_out)
+        tok_q = torch.tanh(torch.matmul(tok_feat, w.tau["wq"].t())).permute(0, 2, 1)
+        tok_v = torch.tanh(torch.matmul(tok_feat, w.tau["wv"].t())).permute(0, 2, 1)
+        pos = position_ids.to(q.dtype) + 1
+        tau_pos = 1 + (
+            torch.sigmoid(w.tau["alpha"][:, None] * pos.log()) - 0.5
+        )  # (H,S)
+        tau_q = (tok_q + tau_pos[None]).unsqueeze(-1)  # (B,H,S,1)
+        tau_v = (tok_v + tau_pos[None]).unsqueeze(-1)
+        q = q * tau_q
+        v = v * tau_v
+    q = apply_rotary_emb(q, freqs_cis, position_ids, n_heads)
+    k = apply_rotary_emb(k, freqs_cis, position_ids, n_kv_heads)
+    if kv_cache is not None:
+        k, v = kv_cache.update(position_ids, k, v)
+    if flex_block_mask_slice is not None:
+        torch._assert(n_heads == n_kv_heads, "gqa not supported yet")
+        out = flex_attention(q, k, v, block_mask=flex_block_mask_slice)
+    else:
+        out = F.scaled_dot_product_attention(
+            q, k, v, attn_mask=attn_mask, enable_gqa=n_heads != n_kv_heads
+        )
+    out = out.transpose(1, 2).reshape(bsz, q_len, d_model)
+    out0 = w.proj(out)
+    if lora is not None:
+        out1 = F.linear(F.linear(x, lora["proj"]["A"]), lora["proj"]["B"])
+        out = out0 + out1
+    else:
+        out = out0
+    return out
+def text_decoder(
+    x: torch.Tensor,
+    w: nn.Module,
+    attn_mask: torch.Tensor,
+    position_ids: torch.Tensor,
+    config: TextConfig,
+    lora: Optional[dict] = None,
+    flex_block_mask_slice=None,
+):
+    for i, block in enumerate(w.blocks):
+        if lora is not None:
+            layer_lora = lora["text"]["blocks"][str(i)]
+            mlp_lora = layer_lora["mlp"]
+            attn_lora = layer_lora["attn"]
+        else:
+            mlp_lora = None
+            attn_lora = None
+        l_in = layer_norm(x, block.ln)
+        l_attn = attn(
+            l_in,
+            block.attn,
+            freqs_cis=w.freqs_cis,
+            kv_cache=block.kv_cache,
+            attn_mask=attn_mask,
+            n_heads=config.n_heads,
+            n_kv_heads=config.n_kv_heads,
+            position_ids=position_ids,
+            lora=attn_lora,
+            flex_block_mask_slice=flex_block_mask_slice,
+        )
+        if config.moe is not None and i >= config.moe.start_layer:
+            l_mlp = moe_mlp(l_in, block.mlp, config.moe.experts_per_token)
+        else:
+            l_mlp = mlp(l_in, block.mlp, lora=mlp_lora)
+        x = x + l_attn + l_mlp
+    return x
+def lm_head(
+    hidden_BTC: torch.Tensor, w: nn.Module, indices: Optional[torch.Tensor] = None
+):
+    hidden_BC = hidden_BTC[:, -1, :]
+    hidden_BC = layer_norm(hidden_BC, w.post_ln)
+    if indices is not None:
+        # Only compute logits for specified token indices
+        logits = hidden_BC @ w.lm_head.weight[indices].T + w.lm_head.bias[indices]
+    else:
+        logits = w.lm_head(hidden_BC)
+    return logits
+def build_dense_mlp(d_model, d_ffn, dtype, linear_cls):
+    return nn.ModuleDict(
+        {
+            "fc1": linear_cls(d_model, d_ffn, dtype=dtype),
+            "fc2": linear_cls(d_ffn, d_model, dtype=dtype),
+        }
+    )
+def build_moe_mlp(d_model, d_ffn, n_experts, dtype):
+    # For GeGLU, fc1 needs to output 2 * d_ffn (for gating)
+    return nn.ModuleDict(
+        {
+            "router": nn.Linear(d_model, n_experts, dtype=dtype),
+            "fc1": nn.ParameterDict(
+                {
+                    "weight": nn.Parameter(
+                        torch.empty(n_experts, 2 * d_ffn, d_model, dtype=dtype)
+                    )
+                }
+            ),
+            "fc2": nn.ParameterDict(
+                {
+                    "weight": nn.Parameter(
+                        torch.empty(n_experts, d_model, d_ffn, dtype=dtype)
+                    )
+                }
+            ),
+        }
+    )
+def build_text_model(config: TextConfig, dtype: torch.dtype) -> nn.Module:
+    qkv_dim = int(config.dim * (1 + 2 * config.n_kv_heads / config.n_heads))
+    linear_cls = QuantizedLinear if config.group_size is not None else nn.Linear
+    text = nn.ModuleDict(
+        {
+            "blocks": nn.ModuleList(
+                [
+                    nn.ModuleDict(
+                        {
+                            "ln": nn.LayerNorm(config.dim, dtype=dtype),
+                            "attn": nn.ModuleDict(
+                                {
+                                    "qkv": linear_cls(config.dim, qkv_dim, dtype=dtype),
+                                    "proj": linear_cls(
+                                        config.dim, config.dim, dtype=dtype
+                                    ),
+                                    "tau": nn.ParameterDict(
+                                        {
+                                            "wq": nn.Parameter(
+                                                torch.empty(
+                                                    config.n_heads, qkv_dim, dtype=dtype
+                                                )
+                                            ),
+                                            "wv": nn.Parameter(
+                                                torch.empty(
+                                                    config.n_heads, qkv_dim, dtype=dtype
+                                                )
+                                            ),
+                                            "alpha": nn.Parameter(
+                                                torch.empty(config.n_heads, dtype=dtype)
+                                            ),
+                                        }
+                                    ),
+                                }
+                            ),
+                            "mlp": (
+                                build_moe_mlp(
+                                    config.dim,
+                                    config.moe.expert_inner_dim,
+                                    config.moe.num_experts,
+                                    dtype,
+                                )
+                                if config.moe is not None
+                                and layer_idx >= config.moe.start_layer
+                                else build_dense_mlp(
+                                    config.dim, config.ff_dim, dtype, linear_cls
+                                )
+                            ),
+                        }
+                    )
+                    for layer_idx in range(config.n_layers)
+                ]
+            ),
+            "post_ln": nn.LayerNorm(config.dim, dtype=dtype),
+            "lm_head": nn.Linear(config.dim, config.vocab_size, dtype=dtype),
+        }
+    )
+    text.wte = nn.Parameter(torch.empty(config.vocab_size, config.dim, dtype=dtype))
+    text.register_buffer(
+        "freqs_cis",
+        precompute_freqs_cis(config.dim // (2 * config.n_heads), config.max_context),
+        persistent=False,
+    )
+    return text

utils.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import numpy as np
+def remove_outlier_points(points_tuples, k_nearest=2, threshold=2.0):
+    """
+    Robust outlier detection for list of (x,y) tuples.
+    Only requires numpy.
+    Args:
+        points_tuples: list of (x,y) tuples
+        k_nearest: number of neighbors to consider
+        threshold: multiplier for median distance
+    Returns:
+        list: filtered list of (x,y) tuples with outliers removed
+        list: list of booleans indicating which points were kept (True = kept)
+    """
+    points = np.array(points_tuples)
+    n_points = len(points)
+    # Calculate pairwise distances manually
+    dist_matrix = np.zeros((n_points, n_points))
+    for i in range(n_points):
+        for j in range(i + 1, n_points):
+            # Euclidean distance between points i and j
+            dist = np.sqrt(np.sum((points[i] - points[j]) ** 2))
+            dist_matrix[i, j] = dist
+            dist_matrix[j, i] = dist
+    # Get k nearest neighbors' distances
+    k = min(k_nearest, n_points - 1)
+    neighbor_distances = np.partition(dist_matrix, k, axis=1)[:, :k]
+    avg_neighbor_dist = np.mean(neighbor_distances, axis=1)
+    # Calculate mask using median distance
+    median_dist = np.median(avg_neighbor_dist)
+    mask = avg_neighbor_dist <= threshold * median_dist
+    # Return filtered tuples and mask
+    filtered_tuples = [t for t, m in zip(points_tuples, mask) if m]
+    return filtered_tuples

vision.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from typing import Union, Tuple
+from PIL import Image
+from .layers import attn, layer_norm, mlp
+from .image_crops import overlap_crop_image
+from .config import VisionConfig
+if torch.backends.mps.is_available():
+    # Non-divisible input sizes are not implemented on MPS device yet.
+    # https://github.com/pytorch/pytorch/issues/96056
+    def adaptive_avg_pool2d(input, output_size):
+        return F.adaptive_avg_pool2d(input.to("cpu"), output_size).to("mps")
+else:
+    adaptive_avg_pool2d = F.adaptive_avg_pool2d
+DeviceLike = Union[str, torch.device, int]
+def prepare_crops(
+    image: Image.Image, config: VisionConfig, device: DeviceLike
+) -> Tuple[torch.Tensor, Tuple[int, int]]:
+    np_image = np.array(image.convert("RGB"))
+    overlap_crops = overlap_crop_image(
+        np_image, max_crops=config.max_crops, overlap_margin=config.overlap_margin
+    )
+    all_crops = overlap_crops["crops"]
+    all_crops = np.transpose(all_crops, (0, 3, 1, 2))
+    all_crops = (
+        torch.from_numpy(all_crops)
+        .to(device=device, dtype=torch.bfloat16)
+        .div_(255.0)
+        .sub_(0.5)
+        .div_(0.5)
+    )
+    return all_crops, overlap_crops["tiling"]
+def create_patches(x, patch_size):
+    # Original shape: [B, C, H, W]
+    B, C, H, W = x.shape
+    P1 = P2 = patch_size
+    # Step 1: Split H and W dimensions into patches
+    # [B, C, H/P1, P1, W/P2, P2]
+    x = x.reshape(B, C, H // P1, P1, W // P2, P2)
+    # Step 2: Rearrange dimensions to match target shape
+    # [B, H/P1, W/P2, C, P1, P2]
+    x = x.permute(0, 2, 4, 1, 3, 5)
+    # Step 3: Combine dimensions to get final shape
+    # [B, (H/P1)*(W/P2), C*P1*P2]
+    x = x.reshape(B, (H // P1) * (W // P2), C * P1 * P2)
+    return x
+def vision_encoder(input_BCHW: torch.Tensor, w: nn.Module, config: VisionConfig):
+    x = create_patches(input_BCHW, config.enc_patch_size)
+    x = w.patch_emb(x)
+    x = x + w.pos_emb
+    for block in w.blocks:
+        x = x + attn(layer_norm(x, block.ln1), block.attn, n_heads=config.enc_n_heads)
+        x = x + mlp(layer_norm(x, block.ln2), block.mlp)
+    x = layer_norm(x, w.post_ln)
+    return x
+def vision_projection(
+    global_features: torch.Tensor,
+    reconstructed: torch.Tensor,
+    w: nn.Module,
+    config: VisionConfig,
+):
+    reconstructed = reconstructed.permute(2, 0, 1)
+    reconstructed = adaptive_avg_pool2d(
+        reconstructed, output_size=(config.enc_n_layers, config.enc_n_layers)
+    )
+    reconstructed = reconstructed.permute(1, 2, 0).view(729, config.enc_dim)
+    final_features = torch.cat([global_features, reconstructed], dim=-1)
+    return mlp(final_features, w.proj_mlp)
+def build_vision_model(config: VisionConfig, dtype: torch.dtype):
+    patch_dim = config.enc_patch_size * config.enc_patch_size * config.in_channels
+    grid_size = config.crop_size // config.enc_patch_size
+    num_patches = grid_size * grid_size
+    vision = nn.ModuleDict(
+        {
+            "patch_emb": nn.Linear(patch_dim, config.enc_dim, dtype=dtype),
+            "blocks": nn.ModuleList(
+                [
+                    nn.ModuleDict(
+                        {
+                            "ln1": nn.LayerNorm(config.enc_dim, dtype=dtype),
+                            "attn": nn.ModuleDict(
+                                {
+                                    "qkv": nn.Linear(
+                                        config.enc_dim, 3 * config.enc_dim, dtype=dtype
+                                    ),
+                                    "proj": nn.Linear(
+                                        config.enc_dim, config.enc_dim, dtype=dtype
+                                    ),
+                                }
+                            ),
+                            "ln2": nn.LayerNorm(config.enc_dim, dtype=dtype),
+                            "mlp": nn.ModuleDict(
+                                {
+                                    "fc1": nn.Linear(
+                                        config.enc_dim, config.enc_ff_dim, dtype=dtype
+                                    ),
+                                    "fc2": nn.Linear(
+                                        config.enc_ff_dim, config.enc_dim, dtype=dtype
+                                    ),
+                                }
+                            ),
+                        }
+                    )
+                    for _ in range(config.enc_n_layers)
+                ]
+            ),
+            "post_ln": nn.LayerNorm(config.enc_dim, dtype=dtype),
+            "proj_mlp": nn.ModuleDict(
+                {
+                    "fc1": nn.Linear(
+                        config.enc_dim * 2, config.proj_inner_dim, dtype=dtype
+                    ),
+                    "fc2": nn.Linear(
+                        config.proj_inner_dim, config.proj_out_dim, dtype=dtype
+                    ),
+                }
+            ),
+        }
+    )
+    vision.pos_emb = nn.Parameter(
+        torch.zeros(1, num_patches, config.enc_dim, dtype=dtype)
+    )
+    return vision

visual_reasoning.png ADDED Viewed

Git LFS Details

SHA256: 43f9b5eb351a66dd8a05957813beb5e18367f6b3ad232b2bde8545d90132759d
Pointer size: 131 Bytes
Size of remote file: 434 kB