beshkenadze commited on 3 days ago

Commit

5d99344

verified ·

1 Parent(s): b30ea75

Add 8-bit MLX quant of moondream3-preview (mlx-vlm)

Browse files

Files changed (17) hide show

LICENSE.md +27 -0
README.md +42 -0
config.json +30 -0
config.py +102 -0
hf_moondream.py +190 -0
image_crops.py +231 -0
layers.py +259 -0
lora.py +437 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +1059 -0
moondream.py +1097 -0
region.py +136 -0
rope.py +47 -0
text.py +223 -0
utils.py +41 -0
vision.py +147 -0

LICENSE.md ADDED Viewed

	@@ -0,0 +1,27 @@

+| License | Business Source License (BSL 1.1) |
+| --- | --- |
+| Licensor | M87 Labs, Inc. |
+| Licensed Work | “Moondream 3 (Preview)” including Model Weights and any Derivatives (“Derivatives” include fine-tunes, merges, quantizations, weight deltas, and other weight-level modifications or conversions.) |
+| Additional Use Grant | You may make production use of the Licensed Work, provided Your use does not include offering the Licensed Work to third parties on a hosted or embedded basis in order to compete with M87 Labs’s paid version(s) of the Licensed Work. For purposes of this license:<br><br>A “competitive offering” is a Product that is offered to third parties on a paid basis, including through paid support arrangements, that significantly overlaps with the capabilities of M87 Labs’s paid version(s) of the Licensed Work. If Your Product is not a competitive offering when You first make it generally available, it will not become a competitive offering later due to M87 Labs releasing a new version of the Licensed Work with additional capabilities. In addition, Products that are not provided on a paid basis are not competitive.<br><br>“Product” means software that is offered to end users to manage in their own environments or offered as a service on a hosted basis.<br><br>“Embedded” means including the source code or executable code from the Licensed Work in a competitive offering. “Embedded” also means packaging the competitive offering in such a way that the Licensed Work must be accessed or downloaded for the competitive offering to operate.<br><br>Hosting or using the Licensed Work(s) for internal purposes within an organization is not considered a competitive offering. M87 Labs considers your organization to include all of your affiliates under common control. |
+| Change Date | Two years after the first public release of this version of the Licensed Work |
+| Change License | Apache License, Version 2.0 |
+For information about alternative licensing arrangements for the Licensed Work, please contact [contact@m87.ai](mailto:contact@m87.ai).
+The text of the Business Source License 1.1 follows. License text copyright (c) 2020 MariaDB Corporation Ab, All Rights Reserved. “Business Source License” is a trademark of MariaDB Corporation Ab.
+## Terms
+The Licensor hereby grants you the right to copy, modify, create derivative works, redistribute, and make non-production use of the Licensed Work. The Licensor may make an Additional Use Grant, above, permitting limited production use.
+Effective on the Change Date, or the fourth anniversary of the first publicly available distribution of a specific version of the Licensed Work under this License, whichever comes first, the Licensor hereby grants you rights under the terms of the Change License, and the rights granted in the paragraph above terminate.
+If your use of the Licensed Work does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its affiliated entities, or authorized resellers, or you must refrain from using the Licensed Work.
+All copies of the original and modified Licensed Work, and derivative works of the Licensed Work, are subject to this License. This License applies separately for each version of the Licensed Work and the Change Date may vary for each version of the Licensed Work released by Licensor.
+You must conspicuously display this License on each original or modified copy of the Licensed Work. If you receive the Licensed Work in original or modified form from a third party, the terms and conditions set forth in this License apply to your use of that work.
+Any use of the Licensed Work in violation of this License will automatically terminate your rights under this License for the current and all other versions of the Licensed Work.
+This License does not grant you any right in any trademark or logo of Licensor or its affiliates (provided that you may use a trademark or logo of Licensor as expressly required by this License).TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON AN “AS IS” BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND TITLE.

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+library_name: mlx
+pipeline_tag: image-text-to-text
+license: other
+license_name: bsl-1.1
+license_link: LICENSE.md
+base_model: moondream/moondream3-preview
+tags:
+- mlx
+- moondream
+- moondream3
+- vision-language
+- image-text-to-text
+---
+# moondream3-preview-mlx-8bit
+An **8-bit** MLX quantization of [moondream/moondream3-preview](https://huggingface.co/moondream/moondream3-preview)
+for running on Apple Silicon with [mlx-vlm](https://github.com/Blaizzy/mlx-vlm).
+| | |
+|---|---|
+| Quantization | affine, **8 bits**, group size 64 (vision tower included) |
+| On-disk size | ~9.5 GB |
+| Peak memory | ~11 GB |
+| Tokenizer | loaded from [`moondream/starmie-v1`](https://huggingface.co/moondream/starmie-v1) at runtime (not bundled) |
+## Usage
+```bash
+pip install mlx-vlm
+python -m mlx_vlm.generate \
+  --model beshkenadze/moondream3-preview-mlx-8bit \
+  --image path/to/image.jpg \
+  --prompt "Describe this image." \
+  --max-tokens 128 --temperature 0.0
+```
+## License
+moondream3 is released under the **Business Source License 1.1 (BSL 1.1)** — see [LICENSE.md](LICENSE.md). This quantization is a derivative redistribution under the same terms.

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "architectures": [
+    "HfMoondream"
+  ],
+  "auto_map": {
+    "AutoConfig": "hf_moondream.HfConfig",
+    "AutoModelForCausalLM": "hf_moondream.HfMoondream"
+  },
+  "config": {
+    "skills": [
+      "query",
+      "caption",
+      "detect",
+      "point"
+    ]
+  },
+  "model_type": "moondream3",
+  "quantization": {
+    "group_size": 64,
+    "bits": 8,
+    "mode": "affine"
+  },
+  "quantization_config": {
+    "group_size": 64,
+    "bits": 8,
+    "mode": "affine"
+  },
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.1"
+}

config.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+@dataclass(frozen=True)
+class TextMoeConfig:
+    num_experts: int = 64
+    start_layer: int = 4
+    experts_per_token: int = 8
+    expert_inner_dim: int = 1024
+@dataclass(frozen=True)
+class TextConfig:
+    dim: int = 2048
+    ff_dim: int = 8192
+    n_layers: int = 24
+    vocab_size: int = 51200
+    max_context: int = 4096
+    n_heads: int = 32
+    n_kv_heads: int = 32
+    prefix_attn: int = 730
+    group_size: Optional[int] = None
+    moe: Optional[TextMoeConfig] = TextMoeConfig()
+@dataclass(frozen=True)
+class VisionConfig:
+    enc_dim: int = 1152
+    enc_patch_size: int = 14
+    enc_n_layers: int = 27
+    enc_ff_dim: int = 4304
+    enc_n_heads: int = 16
+    proj_out_dim: int = 2048
+    crop_size: int = 378
+    in_channels: int = 3
+    max_crops: int = 12
+    overlap_margin: int = 4
+    proj_inner_dim: int = 8192
+@dataclass(frozen=True)
+class RegionConfig:
+    dim: int = 2048
+    coord_feat_dim: int = 256
+    coord_out_dim: int = 1024
+    size_feat_dim: int = 512
+    size_out_dim: int = 2048
+    group_size: Optional[int] = None
+@dataclass(frozen=True)
+class TokenizerConfig:
+    bos_id: int = 0
+    eos_id: int = 0
+    answer_id: int = 3
+    thinking_id: int = 4
+    coord_id: int = 5
+    size_id: int = 6
+    start_ground_points_id: int = 7
+    end_ground_id: int = 9
+    templates: Dict[str, Optional[Dict[str, List[int]]]] = field(
+        default_factory=lambda: {
+            "caption": {
+                "short": [1, 32708, 2, 12492, 3],
+                "normal": [1, 32708, 2, 6382, 3],
+                "long": [1, 32708, 2, 4059, 3],
+            },
+            "query": {"prefix": [1, 15381, 2], "suffix": [3]},
+            "detect": {"prefix": [1, 7235, 476, 2], "suffix": [3]},
+            "point": {"prefix": [1, 2581, 2], "suffix": [3]},
+        }
+    )
+@dataclass(frozen=True)
+class MoondreamConfig:
+    text: TextConfig = TextConfig()
+    vision: VisionConfig = VisionConfig()
+    region: RegionConfig = RegionConfig()
+    tokenizer: TokenizerConfig = TokenizerConfig()
+    @classmethod
+    def from_dict(cls, config_dict: dict):
+        text_config = TextConfig(**config_dict.get("text", {}))
+        vision_config = VisionConfig(**config_dict.get("vision", {}))
+        region_config = RegionConfig(**config_dict.get("region", {}))
+        tokenizer_config = TokenizerConfig(**config_dict.get("tokenizer", {}))
+        return cls(
+            text=text_config,
+            vision=vision_config,
+            region=region_config,
+            tokenizer=tokenizer_config,
+        )
+    def to_dict(self):
+        return {
+            "text": self.text.__dict__,
+            "vision": self.vision.__dict__,
+            "region": self.region.__dict__,
+            "tokenizer": self.tokenizer.__dict__,
+        }

hf_moondream.py ADDED Viewed

	@@ -0,0 +1,190 @@

+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, PretrainedConfig
+from typing import Union
+from .config import MoondreamConfig
+from .moondream import MoondreamModel
+# Files sometimes don't get loaded without these...
+from .image_crops import *
+from .vision import *
+from .text import *
+from .region import *
+from .utils import *
+def extract_question(text):
+    prefix = "<image>\n\nQuestion: "
+    suffix = "\n\nAnswer:"
+    if text.startswith(prefix) and text.endswith(suffix):
+        return text[len(prefix) : -len(suffix)]
+    else:
+        return None
+class HfConfig(PretrainedConfig):
+    _auto_class = "AutoConfig"
+    model_type = "moondream3"
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.config = {"skills": ["query", "caption", "detect", "point"]}
+class HfMoondream(PreTrainedModel):
+    _auto_class = "AutoModelForCausalLM"
+    config_class = HfConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = MoondreamModel(
+            MoondreamConfig.from_dict(config.config), setup_caches=False
+        )
+        self._is_kv_cache_setup = False
+        self.post_init()
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        output = super().from_pretrained(*args, **kwargs)
+        model = output[0] if isinstance(output, tuple) else output
+        model.model._refresh_runtime_buffers()
+        return output
+    def _setup_caches(self):
+        if not self._is_kv_cache_setup:
+            self.model._setup_caches()
+            self._is_kv_cache_setup = True
+    @property
+    def encode_image(self):
+        self._setup_caches()
+        return self.model.encode_image
+    @property
+    def query(self):
+        self._setup_caches()
+        return self.model.query
+    @property
+    def caption(self):
+        self._setup_caches()
+        return self.model.caption
+    @property
+    def detect(self):
+        self._setup_caches()
+        return self.model.detect
+    @property
+    def point(self):
+        self._setup_caches()
+        return self.model.point
+    @property
+    def detect_gaze(self):
+        self._setup_caches()
+        return self.model.detect_gaze
+    def answer_question(
+        self,
+        image_embeds,
+        question,
+        tokenizer=None,
+        chat_history="",
+        result_queue=None,
+        max_new_tokens=256,
+        **kwargs
+    ):
+        answer = self.query(image_embeds, question)["answer"].strip()
+        if result_queue is not None:
+            result_queue.put(answer)
+        return answer
+    def batch_answer(self, images, prompts, tokenizer=None, **kwargs):
+        answers = []
+        for image, prompt in zip(images, prompts):
+            answers.append(self.query(image, prompt)["answer"].strip())
+        return answers
+    def _unsupported_exception(self):
+        raise NotImplementedError(
+            "This method is not supported in the latest version of moondream. "
+            "Consider upgrading to the updated API spec, or alternately pin "
+            "to 'revision=2024-08-26'."
+        )
+    def generate(self, image_embeds, prompt, tokenizer, max_new_tokens=128, **kwargs):
+        """
+        Function definition remains unchanged for backwards compatibility.
+        Be aware that tokenizer, max_new_takens, and kwargs are ignored.
+        """
+        prompt_extracted = extract_question(prompt)
+        if prompt_extracted is not None:
+            answer = self.model.query(
+                image=image_embeds, question=prompt_extracted, stream=False
+            )["answer"]
+        else:
+            image_embeds = self.encode_image(image_embeds)
+            prompt_tokens = torch.tensor(
+                [self.model.tokenizer.encode(prompt).ids],
+                device=self.device,
+            )
+            def generator():
+                for token in self.model._generate_answer(
+                    prompt_tokens,
+                    image_embeds.kv_cache,
+                    image_embeds.pos,
+                    max_new_tokens,
+                ):
+                    yield token
+            answer = "".join(list(generator()))
+        return [answer]
+    def get_input_embeddings(self) -> nn.Embedding:
+        """
+        Lazily wrap the raw parameter `self.model.text.wte` in a real
+        `nn.Embedding` layer so that HF mix-ins recognise it.  The wrapper
+        **shares** the weight tensor—no copy is made.
+        """
+        if not hasattr(self, "_input_embeddings"):
+            self._input_embeddings = nn.Embedding.from_pretrained(
+                self.model.text.wte,  # tensor created in text.py
+                freeze=True,  # set to False if you need it trainable
+            )
+        return self._input_embeddings
+    def set_input_embeddings(self, value: Union[nn.Embedding, nn.Module]) -> None:
+        """
+        Lets HF functions (e.g. `resize_token_embeddings`) replace or resize the
+        embeddings and keeps everything tied to `self.model.text.wte`.
+        """
+        # 1. point the low-level parameter to the new weight matrix
+        self.model.text.wte = value.weight
+        # 2. keep a reference for get_input_embeddings()
+        self._input_embeddings = value
+    def input_embeds(
+        self,
+        input_ids: Union[torch.LongTensor, list, tuple],
+        *,
+        device: torch.device | None = None
+    ) -> torch.FloatTensor:
+        """
+        Back-compat wrapper that turns token IDs into embeddings.
+        Example:
+            ids = torch.tensor([[1, 2, 3]])
+            embeds = model.input_embeds(ids)      # (1, 3, hidden_dim)
+        """
+        if not torch.is_tensor(input_ids):
+            input_ids = torch.as_tensor(input_ids)
+        if device is not None:
+            input_ids = input_ids.to(device)
+        return self.get_input_embeddings()(input_ids)

image_crops.py ADDED Viewed

	@@ -0,0 +1,231 @@

+import math
+import numpy as np
+import torch
+from typing import TypedDict
+try:
+    import pyvips
+    HAS_VIPS = True
+except:
+    from PIL import Image
+    HAS_VIPS = False
+def select_tiling(
+    height: int, width: int, crop_size: int, max_crops: int
+) -> tuple[int, int]:
+    """
+    Determine the optimal number of tiles to cover an image with overlapping crops.
+    """
+    if height <= crop_size or width <= crop_size:
+        return (1, 1)
+    # Minimum required tiles in each dimension
+    min_h = math.ceil(height / crop_size)
+    min_w = math.ceil(width / crop_size)
+    # If minimum required tiles exceed max_crops, return proportional distribution
+    if min_h * min_w > max_crops:
+        ratio = math.sqrt(max_crops / (min_h * min_w))
+        return (max(1, math.floor(min_h * ratio)), max(1, math.floor(min_w * ratio)))
+    # Perfect aspect-ratio tiles that satisfy max_crops
+    h_tiles = math.floor(math.sqrt(max_crops * height / width))
+    w_tiles = math.floor(math.sqrt(max_crops * width / height))
+    # Ensure we meet minimum tile requirements
+    h_tiles = max(h_tiles, min_h)
+    w_tiles = max(w_tiles, min_w)
+    # If we exceeded max_crops, scale down the larger dimension
+    if h_tiles * w_tiles > max_crops:
+        if w_tiles > h_tiles:
+            w_tiles = math.floor(max_crops / h_tiles)
+        else:
+            h_tiles = math.floor(max_crops / w_tiles)
+    return (max(1, h_tiles), max(1, w_tiles))
+class OverlapCropOutput(TypedDict):
+    crops: np.ndarray
+    tiling: tuple[int, int]
+def overlap_crop_image(
+    image: np.ndarray,
+    overlap_margin: int,
+    max_crops: int,
+    base_size: tuple[int, int] = (378, 378),
+    patch_size: int = 14,
+) -> OverlapCropOutput:
+    """
+    Process an image using an overlap-and-resize cropping strategy with margin handling.
+    This function takes an input image and creates multiple overlapping crops with
+    consistent margins. It produces:
+    1. A single global crop resized to base_size
+    2. Multiple overlapping local crops that maintain high resolution details
+    3. A patch ordering matrix that tracks correspondence between crops
+    The overlap strategy ensures:
+    - Smooth transitions between adjacent crops
+    - No loss of information at crop boundaries
+    - Proper handling of features that cross crop boundaries
+    - Consistent patch indexing across the full image
+    Args:
+        image (np.ndarray): Input image as numpy array with shape (H,W,C)
+        base_size (tuple[int,int]): Target size for crops, default (378,378)
+        patch_size (int): Size of patches in pixels, default 14
+        overlap_margin (int): Margin size in patch units, default 4
+        max_crops (int): Maximum number of crops allowed, default 12
+    Returns:
+        OverlapCropOutput: Dictionary containing:
+            - crops: A numpy array containing the global crop of the full image (index 0)
+                followed by the overlapping cropped regions (indices 1+)
+            - tiling: Tuple of (height,width) tile counts
+    """
+    original_h, original_w = image.shape[:2]
+    # Convert margin from patch units to pixels
+    margin_pixels = patch_size * overlap_margin
+    total_margin_pixels = margin_pixels * 2  # Both sides
+    # Calculate crop parameters
+    crop_patches = base_size[0] // patch_size  # patches per crop dimension
+    crop_window_patches = crop_patches - (2 * overlap_margin)  # usable patches
+    crop_window_size = crop_window_patches * patch_size  # usable size in pixels
+    # Determine tiling
+    tiling = select_tiling(
+        original_h - total_margin_pixels,
+        original_w - total_margin_pixels,
+        crop_window_size,
+        max_crops,
+    )
+    # Pre-allocate crops.
+    n_crops = tiling[0] * tiling[1] + 1  # 1 = global crop
+    crops = np.zeros(
+        (n_crops, base_size[0], base_size[1], image.shape[2]), dtype=np.uint8
+    )
+    # Resize image to fit tiling
+    target_size = (
+        tiling[0] * crop_window_size + total_margin_pixels,
+        tiling[1] * crop_window_size + total_margin_pixels,
+    )
+    if HAS_VIPS:
+        # Convert to vips for resizing
+        vips_image = pyvips.Image.new_from_array(image)
+        scale_x = target_size[1] / image.shape[1]
+        scale_y = target_size[0] / image.shape[0]
+        resized = vips_image.resize(scale_x, vscale=scale_y)
+        image = resized.numpy()
+        # Create global crop
+        scale_x = base_size[1] / vips_image.width
+        scale_y = base_size[0] / vips_image.height
+        global_vips = vips_image.resize(scale_x, vscale=scale_y)
+        crops[0] = global_vips.numpy()
+    else:
+        # Fallback to PIL
+        pil_img = Image.fromarray(image)
+        resized = pil_img.resize(
+            (int(target_size[1]), int(target_size[0])),
+            resample=Image.Resampling.LANCZOS,
+        )
+        image = np.asarray(resized)
+        # Create global crop
+        global_pil = pil_img.resize(
+            (int(base_size[1]), int(base_size[0])), resample=Image.Resampling.LANCZOS
+        )
+        crops[0] = np.asarray(global_pil)
+    for i in range(tiling[0]):
+        for j in range(tiling[1]):
+            # Calculate crop coordinates
+            y0 = i * crop_window_size
+            x0 = j * crop_window_size
+            # Extract crop with padding if needed
+            y_end = min(y0 + base_size[0], image.shape[0])
+            x_end = min(x0 + base_size[1], image.shape[1])
+            crop_region = image[y0:y_end, x0:x_end]
+            crops[
+                1 + i * tiling[1] + j, : crop_region.shape[0], : crop_region.shape[1]
+            ] = crop_region
+    return {"crops": crops, "tiling": tiling}
+def reconstruct_from_crops(
+    crops: torch.Tensor,
+    tiling: tuple[int, int],
+    overlap_margin: int,
+    patch_size: int = 14,
+) -> torch.Tensor:
+    """
+    Reconstruct the original image from overlapping crops into a single seamless image.
+    Takes a list of overlapping image crops along with their positional metadata and
+    reconstructs them into a single coherent image by carefully stitching together
+    non-overlapping regions. Handles both numpy arrays and PyTorch tensors.
+    Args:
+        crops: List of image crops as numpy arrays or PyTorch tensors with shape
+            (H,W,C)
+        tiling: Tuple of (height,width) indicating crop grid layout
+        patch_size: Size in pixels of each patch, default 14
+        overlap_margin: Number of overlapping patches on each edge, default 4
+    Returns:
+        Reconstructed image as numpy array or PyTorch tensor matching input type,
+        with shape (H,W,C) where H,W are the original image dimensions
+    """
+    tiling_h, tiling_w = tiling
+    crop_height, crop_width = crops[0].shape[:2]
+    margin_pixels = overlap_margin * patch_size
+    # Calculate output size (only adding margins once)
+    output_h = (crop_height - 2 * margin_pixels) * tiling_h + 2 * margin_pixels
+    output_w = (crop_width - 2 * margin_pixels) * tiling_w + 2 * margin_pixels
+    reconstructed = torch.zeros(
+        (output_h, output_w, crops[0].shape[2]),
+        device=crops[0].device,
+        dtype=crops[0].dtype,
+    )
+    for i, crop in enumerate(crops):
+        tile_y = i // tiling_w
+        tile_x = i % tiling_w
+        # For each tile, determine which part to keep
+        # Keep left margin only for first column
+        x_start = 0 if tile_x == 0 else margin_pixels
+        # Keep right margin only for last column
+        x_end = crop_width if tile_x == tiling_w - 1 else crop_width - margin_pixels
+        # Keep top margin only for first row
+        y_start = 0 if tile_y == 0 else margin_pixels
+        # Keep bottom margin only for last row
+        y_end = crop_height if tile_y == tiling_h - 1 else crop_height - margin_pixels
+        # Calculate where this piece belongs in the output
+        out_x = tile_x * (crop_width - 2 * margin_pixels)
+        out_y = tile_y * (crop_height - 2 * margin_pixels)
+        # Place the piece
+        reconstructed[
+            out_y + y_start : out_y + y_end, out_x + x_start : out_x + x_end
+        ] = crop[y_start:y_end, x_start:x_end]
+    return reconstructed

layers.py ADDED Viewed

	@@ -0,0 +1,259 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from dataclasses import dataclass
+from typing import Literal, Optional
+from .lora import (
+    DenseLoRALayer,
+    MoELoRALayer,
+    apply_dense_lora,
+    apply_moe_lora_fc1_flat,
+    apply_moe_lora_fc2_flat,
+)
+try:
+    from torchao import quantize_
+    from torchao.quantization import int4_weight_only
+except ImportError:
+    def quantize_(model, quant_mode):
+        raise ImportError(
+            "torchao is not installed. Please install it with `pip install torchao`."
+        )
+    def int4_weight_only(group_size):
+        raise ImportError(
+            "torchao is not installed. Please install it with `pip install torchao`."
+        )
+def gelu_approx(x):
+    return F.gelu(x, approximate="tanh")
+@dataclass
+class LinearWeights:
+    weight: torch.Tensor
+    bias: torch.Tensor
+def linear(x: torch.Tensor, w: LinearWeights) -> torch.Tensor:
+    return F.linear(x, w.weight, w.bias)
+def dequantize_tensor(W_q, scale, zero, orig_shape, dtype=torch.bfloat16):
+    _step = W_q.shape[0]
+    W_r = torch.empty([2 * _step, W_q.shape[1]], dtype=dtype, device=W_q.device)
+    W_r[:_step] = (W_q & 0b11110000) >> 4
+    W_r[_step:] = W_q & 0b00001111
+    W_r.sub_(zero).mul_(scale)
+    return W_r.reshape(orig_shape)
+class QuantizedLinear(nn.Module):
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        dtype: torch.dtype,
+    ):
+        # TODO: Take group_size as an input instead of hardcoding it here.
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        self.weight = nn.ParameterDict(
+            {
+                "packed": nn.Parameter(
+                    torch.empty(
+                        out_features * in_features // (128 * 2), 128, dtype=torch.uint8
+                    ),
+                    requires_grad=False,
+                ),
+                "scale": nn.Parameter(
+                    torch.empty(out_features * in_features // 128, 1),
+                    requires_grad=False,
+                ),
+                "zero_point": nn.Parameter(
+                    torch.empty(out_features * in_features // 128, 1),
+                    requires_grad=False,
+                ),
+            }
+        )
+        self.bias = nn.Parameter(torch.empty(out_features), requires_grad=False)
+        self.unpacked = False
+    def unpack(self):
+        if self.unpacked:
+            return
+        self.weight = nn.Parameter(
+            dequantize_tensor(
+                self.weight["packed"],
+                self.weight["scale"],
+                self.weight["zero_point"],
+                (self.out_features, self.in_features),
+                torch.bfloat16,
+            )
+        )
+        with torch.device("meta"):
+            self.linear = nn.Linear(
+                self.in_features, self.out_features, dtype=torch.bfloat16
+            )
+        self.linear.weight = self.weight
+        self.linear.bias = nn.Parameter(
+            self.bias.to(torch.bfloat16), requires_grad=False
+        )
+        del self.weight, self.bias
+        quantize_(self, int4_weight_only(group_size=128))
+        self.unpacked = True
+        torch.cuda.empty_cache()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if not self.unpacked:
+            self.unpack()
+        return self.linear(x)
+@dataclass
+class LayerNormWeights:
+    weight: torch.Tensor
+    bias: torch.Tensor
+def layer_norm(x: torch.Tensor, w: LayerNormWeights) -> torch.Tensor:
+    return F.layer_norm(x, w.bias.shape, w.weight, w.bias)
+@dataclass
+class MLPWeights:
+    fc1: LinearWeights
+    fc2: LinearWeights
+    act: Literal["gelu_approx"] = "gelu_approx"
+def mlp(
+    x: torch.Tensor, w: MLPWeights, lora: Optional[DenseLoRALayer] = None
+) -> torch.Tensor:
+    x0 = w.fc1(x)
+    if lora is not None:
+        x = x0 + apply_dense_lora(x, lora.up_a, lora.up_b)
+    else:
+        x = x0
+    x = gelu_approx(x)
+    x0 = w.fc2(x)
+    if lora is not None:
+        x = x0 + apply_dense_lora(x, lora.down_a, lora.down_b)
+    else:
+        x = x0
+    return x
+def moe_mlp(
+    x: torch.Tensor,
+    mlp_module: nn.Module,
+    experts_per_token: int,
+    lora: Optional[MoELoRALayer] = None,
+) -> torch.Tensor:
+    B, T, C = x.shape
+    x = x.reshape(-1, C)
+    # Router computation
+    router_logits = mlp_module.router(x)
+    topk_logits, topk_idxs = torch.topk(router_logits, experts_per_token, dim=-1)
+    topk_weights = F.softmax(topk_logits, dim=-1, dtype=torch.float32).to(x.dtype)
+    num_tokens, top_k = topk_idxs.shape
+    if T == 1:
+        w1_weight = mlp_module.fc1.weight
+        w2_weight = mlp_module.fc2.weight
+        # Flatten to process all token-expert pairs at once
+        flat_idxs = topk_idxs.view(-1)  # [T*A]
+        flat_weights = topk_weights.view(-1)  # [T*A]
+        # Select expert weights
+        w1_selected = w1_weight[flat_idxs]
+        w2_selected = w2_weight[flat_idxs]
+        # Expand input for all token-expert pairs
+        x_expanded = x.unsqueeze(1).expand(-1, top_k, -1).reshape(-1, C)  # [T*A, D]
+        # First linear layer with GeGLU: [T*A, H, D] @ [T*A, D, 1] -> [T*A, H]
+        x1_full = torch.bmm(w1_selected, x_expanded.unsqueeze(-1)).squeeze(-1)  # [T*A, H]
+        if lora is not None:
+            x1_full = x1_full + apply_moe_lora_fc1_flat(x_expanded, lora, flat_idxs)
+        x1, g = x1_full.chunk(2, dim=-1)
+        x1 = F.gelu(x1) * (g + 1)
+        # Second linear layer: [T*A, D, H] @ [T*A, H, 1] -> [T*A, D]
+        expert_outs = torch.bmm(w2_selected, x1.unsqueeze(-1)).squeeze(-1)  # [T*A, D]
+        if lora is not None:
+            expert_outs = expert_outs + apply_moe_lora_fc2_flat(x1, lora, flat_idxs)
+        # Apply weights and reshape
+        weighted_outs = expert_outs * flat_weights.unsqueeze(-1)  # [T*A, D]
+        weighted_outs = weighted_outs.view(num_tokens, top_k, C)  # [T, A, D]
+        # Sum over experts
+        mlp_out = weighted_outs.sum(dim=1)  # [T, D]
+        mlp_out = mlp_out.view(B, T, C)
+        return mlp_out
+    else:
+        out = x.new_zeros(x.size())
+        for expert_id in range(mlp_module.fc1.weight.shape[0]):
+            token_pos, which_k = (topk_idxs == expert_id).nonzero(as_tuple=True)
+            if token_pos.numel() == 0:
+                continue
+            x_tok = x.index_select(0, token_pos)
+            gate_tok = topk_weights[token_pos, which_k]
+            w1 = mlp_module.fc1.weight[expert_id]
+            h_full = F.linear(x_tok, w1)
+            if lora is not None:
+                lora_up_a = lora.up_a[expert_id]
+                lora_up_b = lora.up_b[expert_id]
+                lora_mid = F.linear(x_tok, lora_up_a)
+                h_full = h_full + F.linear(lora_mid, lora_up_b)
+            h, g = h_full.chunk(2, dim=-1)
+            h = F.gelu(h) * (g + 1)
+            w2 = mlp_module.fc2.weight[expert_id]
+            y = F.linear(h, w2)
+            if lora is not None:
+                lora_down_a = lora.down_a[expert_id]
+                lora_down_b = lora.down_b[expert_id]
+                lora_mid = F.linear(h, lora_down_a)
+                y = y + F.linear(lora_mid, lora_down_b)
+            y.mul_(gate_tok.unsqueeze(-1))
+            out.index_add_(0, token_pos, y)
+        return out.view(B, T, C)
+@dataclass
+class AttentionWeights:
+    qkv: LinearWeights
+    proj: LinearWeights
+def attn(x: torch.Tensor, w: AttentionWeights, n_heads: int) -> torch.Tensor:
+    bsz, q_len, d_model = x.shape
+    head_dim = d_model // n_heads
+    q, k, v = [
+        t.view(bsz, q_len, n_heads, head_dim).transpose(1, 2)
+        for t in linear(x, w.qkv).chunk(3, dim=-1)
+    ]
+    out = F.scaled_dot_product_attention(q, k, v)
+    out = out.transpose(1, 2).reshape(bsz, q_len, d_model)
+    out = linear(out, w.proj)
+    return out

lora.py ADDED Viewed

	@@ -0,0 +1,437 @@

+import json
+import os
+import re
+import shutil
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple
+from urllib.request import Request, urlopen
+import torch
+from .config import TextConfig
+class AdapterLoadError(RuntimeError):
+    pass
+def _cache_root() -> Path:
+    hf_hub_cache = os.environ.get("HF_HUB_CACHE")
+    if hf_hub_cache:
+        return Path(hf_hub_cache)
+    hf_home = os.environ.get("HF_HOME")
+    if hf_home:
+        return Path(hf_home) / "hub"
+    return Path("~/.cache/huggingface/hub").expanduser()
+def adapter_cache_dir() -> Path:
+    return _cache_root() / "md_finetunes"
+def normalize_adapter_id(value: Optional[str]) -> Optional[str]:
+    if not value:
+        return None
+    tail = value.split("/")[-1].strip()
+    if "@" not in tail:
+        return None
+    return tail
+def parse_adapter_id(adapter_id: str) -> Tuple[str, str]:
+    if not adapter_id or "@" not in adapter_id:
+        raise AdapterLoadError(
+            f"Invalid adapter id '{adapter_id}'. Expected 'finetune_id@step'."
+        )
+    finetune_id, step = adapter_id.split("@", 1)
+    if not finetune_id or not step:
+        raise AdapterLoadError(
+            f"Invalid adapter id '{adapter_id}'. Expected 'finetune_id@step'."
+        )
+    return finetune_id, step
+def _fetch_presigned_url(finetune_id: str, step: str) -> str:
+    endpoint = os.getenv("MOONDREAM_ENDPOINT", "https://api.moondream.ai").rstrip("/")
+    api_key = os.getenv("MOONDREAM_API_KEY")
+    if not api_key:
+        raise AdapterLoadError("MOONDREAM_API_KEY is required to load finetune adapters.")
+    headers = {"User-Agent": "moondream-torch", "X-Moondream-Auth": api_key}
+    url = f"{endpoint}/v1/tuning/finetunes/{finetune_id}/checkpoints/{step}/download"
+    req = Request(url, headers=headers)
+    try:
+        with urlopen(req) as r:
+            payload = json.loads(r.read().decode("utf-8"))
+    except Exception as e:
+        raise AdapterLoadError(f"Failed to fetch adapter URL: {e}") from e
+    presigned = payload.get("url")
+    if not presigned:
+        raise AdapterLoadError("Adapter URL response missing 'url' field.")
+    return presigned
+def cached_adapter_path(adapter_id: str) -> Path:
+    finetune_id, step = parse_adapter_id(adapter_id)
+    cache_dir = adapter_cache_dir() / finetune_id / step
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    for name in ("adapter.pt", "adapter.safetensors"):
+        path = cache_dir / name
+        if path.exists() and path.stat().st_size > 0:
+            return path
+    presigned_url = _fetch_presigned_url(finetune_id, step)
+    dest = cache_dir / "adapter.pt"
+    try:
+        with urlopen(presigned_url) as r, open(dest, "wb") as f:
+            shutil.copyfileobj(r, f)
+    except Exception as e:
+        raise AdapterLoadError(f"Failed to download adapter: {e}") from e
+    return dest
+def _load_state_dict(path: Path, device: torch.device) -> Dict[str, Any]:
+    if path.suffix == ".safetensors":
+        try:
+            from safetensors.torch import safe_open
+        except Exception as e:
+            raise AdapterLoadError(
+                "safetensors is required to load .safetensors adapters."
+            ) from e
+        data = {}
+        with safe_open(str(path), framework="pt") as f:
+            for key in f.keys():
+                data[key] = f.get_tensor(key).to(device=device)
+        return data
+    try:
+        return torch.load(path, map_location=device, weights_only=True)
+    except TypeError:
+        return torch.load(path, map_location=device)
+@dataclass
+class DenseLoRALayer:
+    up_a: torch.Tensor
+    up_b: torch.Tensor
+    down_a: torch.Tensor
+    down_b: torch.Tensor
+@dataclass
+class MoELoRALayer:
+    up_a: torch.Tensor
+    up_b: torch.Tensor
+    down_a: torch.Tensor
+    down_b: torch.Tensor
+class TextLoRA:
+    def __init__(
+        self,
+        text_config: TextConfig,
+        *,
+        rank: int,
+        max_rank: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        adapter_id: Optional[str] = None,
+    ) -> None:
+        if rank <= 0:
+            raise AdapterLoadError("LoRA rank must be positive.")
+        if max_rank < rank:
+            raise AdapterLoadError("max_rank must be >= rank.")
+        self.text_config = text_config
+        self.rank = rank
+        self.max_rank = max_rank
+        self.adapter_id = adapter_id
+        moe_cfg = text_config.moe
+        self.start_layer = moe_cfg.start_layer if moe_cfg else text_config.n_layers
+        if moe_cfg is not None:
+            self.rank_per_expert = rank // moe_cfg.experts_per_token
+            if self.rank_per_expert < 1:
+                raise AdapterLoadError(
+                    f"rank ({rank}) must be >= experts_per_token ({moe_cfg.experts_per_token})"
+                )
+            self.max_rank_per_expert = max_rank // moe_cfg.experts_per_token
+            if self.max_rank_per_expert < 1:
+                raise AdapterLoadError(
+                    f"max_rank ({max_rank}) must be >= experts_per_token ({moe_cfg.experts_per_token})"
+                )
+        else:
+            self.rank_per_expert = 0
+            self.max_rank_per_expert = 0
+        d_model = text_config.dim
+        d_ffn = text_config.ff_dim
+        self.dense: list[DenseLoRALayer] = []
+        for _ in range(self.start_layer):
+            self.dense.append(
+                DenseLoRALayer(
+                    up_a=torch.zeros((max_rank, d_model), device=device, dtype=dtype),
+                    up_b=torch.zeros((d_ffn, max_rank), device=device, dtype=dtype),
+                    down_a=torch.zeros((max_rank, d_ffn), device=device, dtype=dtype),
+                    down_b=torch.zeros((d_model, max_rank), device=device, dtype=dtype),
+                )
+            )
+        self.moe: list[MoELoRALayer] = []
+        if moe_cfg is not None:
+            num_experts = moe_cfg.num_experts
+            d_expert = moe_cfg.expert_inner_dim
+            for _ in range(text_config.n_layers - self.start_layer):
+                self.moe.append(
+                    MoELoRALayer(
+                        up_a=torch.zeros(
+                            (num_experts, self.max_rank_per_expert, d_model),
+                            device=device,
+                            dtype=dtype,
+                        ),
+                        up_b=torch.zeros(
+                            (num_experts, d_expert * 2, self.max_rank_per_expert),
+                            device=device,
+                            dtype=dtype,
+                        ),
+                        down_a=torch.zeros(
+                            (num_experts, self.max_rank_per_expert, d_expert),
+                            device=device,
+                            dtype=dtype,
+                        ),
+                        down_b=torch.zeros(
+                            (num_experts, d_model, self.max_rank_per_expert),
+                            device=device,
+                            dtype=dtype,
+                        ),
+                    )
+                )
+    def dense_layer(self, layer_idx: int) -> Optional[DenseLoRALayer]:
+        if layer_idx < len(self.dense):
+            return self.dense[layer_idx]
+        return None
+    def moe_layer(self, layer_idx: int) -> Optional[MoELoRALayer]:
+        moe_idx = layer_idx - self.start_layer
+        if 0 <= moe_idx < len(self.moe):
+            return self.moe[moe_idx]
+        return None
+    @staticmethod
+    def _pad_axis(tensor: torch.Tensor, target: int, axis: int) -> torch.Tensor:
+        if tensor.shape[axis] == target:
+            return tensor
+        if tensor.shape[axis] > target:
+            raise AdapterLoadError(
+                f"LoRA tensor rank {tensor.shape[axis]} exceeds max {target}"
+            )
+        pad_shape = list(tensor.shape)
+        pad_shape[axis] = target - tensor.shape[axis]
+        pad = torch.zeros(pad_shape, device=tensor.device, dtype=tensor.dtype)
+        return torch.cat([tensor, pad], dim=axis)
+    @staticmethod
+    def detect_rank(state_dict: Dict[str, Any], text_config: TextConfig) -> int:
+        for key, tensor in state_dict.items():
+            if "dense" in key and "up_a" in key:
+                return int(tensor.shape[0])
+        for key, tensor in state_dict.items():
+            if "moe" in key and "up_a" in key:
+                rank_per_expert = int(tensor.shape[1])
+                moe_cfg = text_config.moe
+                if moe_cfg:
+                    return rank_per_expert * moe_cfg.experts_per_token
+                return rank_per_expert
+        raise AdapterLoadError("Could not detect LoRA rank from state dict.")
+    @classmethod
+    def from_state_dict(
+        cls,
+        state_dict: Dict[str, Any],
+        *,
+        text_config: TextConfig,
+        max_rank: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        adapter_id: Optional[str] = None,
+    ) -> "TextLoRA":
+        rank = cls.detect_rank(state_dict, text_config)
+        if rank > max_rank:
+            raise AdapterLoadError(
+                f"Adapter rank ({rank}) exceeds max_rank ({max_rank})."
+            )
+        lora = cls(
+            text_config,
+            rank=rank,
+            max_rank=max_rank,
+            dtype=dtype,
+            device=device,
+            adapter_id=adapter_id,
+        )
+        dense_seen = set()
+        moe_seen = set()
+        pattern = re.compile(r"(dense|moe)\.(\d+)\.(up_a|up_b|down_a|down_b)$")
+        for key, tensor in state_dict.items():
+            match = pattern.search(key)
+            if not match:
+                continue
+            kind, idx_str, name = match.group(1), match.group(2), match.group(3)
+            idx = int(idx_str)
+            arr = tensor.to(device=device, dtype=dtype)
+            if kind == "dense":
+                if idx >= len(lora.dense):
+                    raise AdapterLoadError(f"Dense LoRA layer index {idx} out of range.")
+                layer = lora.dense[idx]
+                if name in ("up_a", "down_a"):
+                    arr = cls._pad_axis(arr, lora.max_rank, axis=0)
+                else:
+                    arr = cls._pad_axis(arr, lora.max_rank, axis=1)
+                setattr(layer, name, arr)
+                dense_seen.add((idx, name))
+            else:
+                if idx >= len(lora.moe):
+                    raise AdapterLoadError(f"MoE LoRA layer index {idx} out of range.")
+                layer = lora.moe[idx]
+                if name in ("up_a", "down_a"):
+                    arr = cls._pad_axis(arr, lora.max_rank_per_expert, axis=1)
+                else:
+                    arr = cls._pad_axis(arr, lora.max_rank_per_expert, axis=2)
+                setattr(layer, name, arr)
+                moe_seen.add((idx, name))
+        for layer_idx in range(len(lora.dense)):
+            for name in ("up_a", "up_b", "down_a", "down_b"):
+                if (layer_idx, name) not in dense_seen:
+                    raise AdapterLoadError(
+                        f"Adapter missing dense LoRA for layer {layer_idx} ({name})."
+                    )
+        for layer_idx in range(len(lora.moe)):
+            for name in ("up_a", "up_b", "down_a", "down_b"):
+                if (layer_idx, name) not in moe_seen:
+                    raise AdapterLoadError(
+                        f"Adapter missing MoE LoRA for layer {layer_idx} ({name})."
+                    )
+        return lora
+def select_layer_lora(
+    lora: Optional[TextLoRA], layer_idx: int, *, is_moe: bool
+) -> Optional[object]:
+    if lora is None:
+        return None
+    return lora.moe_layer(layer_idx) if is_moe else lora.dense_layer(layer_idx)
+def apply_dense_lora(
+    x: torch.Tensor, lora_a: torch.Tensor, lora_b: torch.Tensor
+) -> torch.Tensor:
+    b, t, c = x.shape
+    x_flat = x.reshape(-1, c)
+    lora_mid = torch.matmul(x_flat, lora_a.t())
+    lora_out = torch.matmul(lora_mid, lora_b.t())
+    return lora_out.reshape(b, t, -1)
+def apply_moe_lora_fc1_flat(
+    x_expanded: torch.Tensor, lora: MoELoRALayer, flat_idxs: torch.Tensor
+) -> torch.Tensor:
+    lora_up_a = lora.up_a[flat_idxs]
+    lora_up_b = lora.up_b[flat_idxs]
+    lora_mid = torch.bmm(lora_up_a, x_expanded.unsqueeze(-1)).squeeze(-1)
+    lora_up = torch.bmm(lora_up_b, lora_mid.unsqueeze(-1)).squeeze(-1)
+    return lora_up
+def apply_moe_lora_fc2_flat(
+    h: torch.Tensor, lora: MoELoRALayer, flat_idxs: torch.Tensor
+) -> torch.Tensor:
+    lora_down_a = lora.down_a[flat_idxs]
+    lora_down_b = lora.down_b[flat_idxs]
+    lora_mid = torch.bmm(lora_down_a, h.unsqueeze(-1)).squeeze(-1)
+    lora_down = torch.bmm(lora_down_b, lora_mid.unsqueeze(-1)).squeeze(-1)
+    return lora_down
+_ADAPTER_CACHE: Dict[Tuple[str, str, str, Tuple], TextLoRA] = {}
+_CACHE_ORDER: list[Tuple[str, str, str, Tuple]] = []
+_CACHE_SIZE = 8
+def _config_key(text_config: TextConfig) -> Tuple:
+    moe = text_config.moe
+    moe_key = None
+    if moe is not None:
+        moe_key = (
+            moe.num_experts,
+            moe.start_layer,
+            moe.experts_per_token,
+            moe.expert_inner_dim,
+        )
+    return (
+        text_config.dim,
+        text_config.ff_dim,
+        text_config.n_layers,
+        moe_key,
+    )
+def load_adapter(
+    adapter_id: Optional[str],
+    *,
+    text_config: TextConfig,
+    device: torch.device,
+    dtype: torch.dtype,
+    max_rank: int = 16,
+) -> Optional[TextLoRA]:
+    if adapter_id is None:
+        return None
+    adapter_id = normalize_adapter_id(adapter_id)
+    if adapter_id is None:
+        return None
+    key = (adapter_id, str(device), str(dtype), _config_key(text_config))
+    cached = _ADAPTER_CACHE.get(key)
+    if cached is not None:
+        return cached
+    path = cached_adapter_path(adapter_id)
+    checkpoint = _load_state_dict(path, device)
+    if not isinstance(checkpoint, dict):
+        raise AdapterLoadError("Invalid adapter checkpoint format.")
+    state_dict = checkpoint.get("lora_state_dict", checkpoint)
+    if not isinstance(state_dict, dict):
+        raise AdapterLoadError("Adapter checkpoint missing lora_state_dict.")
+    lora = TextLoRA.from_state_dict(
+        state_dict,
+        text_config=text_config,
+        max_rank=max_rank,
+        dtype=dtype,
+        device=device,
+        adapter_id=adapter_id,
+    )
+    _ADAPTER_CACHE[key] = lora
+    _CACHE_ORDER.append(key)
+    if len(_CACHE_ORDER) > _CACHE_SIZE:
+        old = _CACHE_ORDER.pop(0)
+        _ADAPTER_CACHE.pop(old, None)
+    return lora

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3b120e9db73fb74c13016f96e2a6217b25b3ec8b866a614456406b555feaba90
+size 5256085089

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84a2b3dba409a84b1e847cb25ca6b84956b17518b33e471a58d06590c97bb454
+size 4720107260

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,1059 @@

+{
+    "metadata": {
+        "total_size": 9976074848
+    },
+    "weight_map": {
+        "text.lm_head.bias": "model-00002-of-00002.safetensors",
+        "text.lm_head.biases": "model-00002-of-00002.safetensors",
+        "text.lm_head.scales": "model-00002-of-00002.safetensors",
+        "text.lm_head.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.10.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.11.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.12.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.13.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.13.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.13.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.13.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.14.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.14.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.15.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.16.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.17.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.18.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.19.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.20.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.20.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.21.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.22.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.proj.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.proj.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.proj.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.proj.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.qkv.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.qkv.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.qkv.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.qkv.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.tau.alpha": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.tau.wq": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.attn.tau.wv": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.fc1.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.fc1.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.fc2.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.fc2.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.router.bias": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.router.biases": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.router.scales": "model-00002-of-00002.safetensors",
+        "text.model.blocks.23.mlp.router.weight": "model-00002-of-00002.safetensors",
+        "text.model.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.4.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.5.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.6.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.7.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.8.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.tau.alpha": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.tau.wq": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.attn.tau.wv": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.ln.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.ln.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.router.bias": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.router.biases": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.router.scales": "model-00001-of-00002.safetensors",
+        "text.model.blocks.9.mlp.router.weight": "model-00001-of-00002.safetensors",
+        "text.model.post_ln.bias": "model-00002-of-00002.safetensors",
+        "text.model.post_ln.weight": "model-00002-of-00002.safetensors",
+        "text.model.wte.biases": "model-00001-of-00002.safetensors",
+        "text.model.wte.scales": "model-00001-of-00002.safetensors",
+        "text.model.wte.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.24.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.25.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.26.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.proj.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.proj.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.qkv.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.qkv.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.ln1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.ln1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.ln2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.ln2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.blocks.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.patch_emb.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.patch_emb.weight": "model-00001-of-00002.safetensors",
+        "vision.encoder.pos_emb": "model-00001-of-00002.safetensors",
+        "vision.encoder.post_ln.bias": "model-00001-of-00002.safetensors",
+        "vision.encoder.post_ln.weight": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc1.bias": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc1.biases": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc1.scales": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc1.weight": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc2.bias": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc2.biases": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc2.scales": "model-00001-of-00002.safetensors",
+        "vision.proj_mlp.fc2.weight": "model-00001-of-00002.safetensors"
+    }
+}

moondream.py ADDED Viewed

	@@ -0,0 +1,1097 @@

+import torch
+import torch.nn as nn
+import random
+from typing import Literal, Tuple, TypedDict, Union, Dict, Any, Optional, List
+from PIL import Image
+from dataclasses import dataclass
+from tokenizers import Tokenizer
+from torch.nn.attention.flex_attention import create_block_mask
+from .config import MoondreamConfig
+from .image_crops import reconstruct_from_crops
+from .vision import vision_encoder, vision_projection, prepare_crops, build_vision_model
+from .text import build_text_model, text_encoder, lm_head, text_decoder
+from .region import (
+    decode_coordinate,
+    encode_coordinate,
+    decode_size,
+    encode_size,
+    encode_spatial_refs,
+    SpatialRefs,
+)
+from .layers import QuantizedLinear
+from .lora import load_adapter, normalize_adapter_id
+from .rope import precompute_freqs_cis
+from .utils import remove_outlier_points
+ImageEncodingSettings = TypedDict(
+    "ImageEncodingSettings",
+    {"adapter": str, "model": str},
+    total=False,
+)
+TextSamplingSettings = TypedDict(
+    "TextSamplingSettings",
+    {
+        "max_tokens": int,
+        "temperature": float,
+        "top_p": float,
+        "adapter": str,
+        "model": str,
+    },
+    total=False,
+)
+ObjectSamplingSettings = TypedDict(
+    "ObjectSamplingSettings",
+    {"max_objects": int, "adapter": str, "model": str},
+    total=False,
+)
+DEFAULT_MAX_TOKENS = 768
+DEFAULT_TEMPERATURE = 0.5
+DEFAULT_TOP_P = 0.9
+DEFAULT_MAX_OBJECTS = 150
+@dataclass(frozen=True)
+class EncodedImage:
+    pos: int
+    caches: List[Tuple[torch.Tensor, torch.Tensor]]
+class KVCache(nn.Module):
+    def __init__(self, n_heads, n_kv_heads, max_context, dim, device, dtype):
+        super().__init__()
+        cache_shape = (1, n_kv_heads, max_context, dim // n_heads)
+        self.register_buffer(
+            "k_cache", torch.zeros(*cache_shape, device=device, dtype=dtype)
+        )
+        self.register_buffer(
+            "v_cache", torch.zeros(*cache_shape, device=device, dtype=dtype)
+        )
+    def update(self, pos_ids, k, v):
+        kout, vout = self.k_cache, self.v_cache
+        kout[:, :, pos_ids, :] = k
+        vout[:, :, pos_ids, :] = v
+        return kout, vout
+def causal_mask(b, h, q_idx, kv_idx):
+    return q_idx >= kv_idx
+def get_mask_mod(mask_mod, offset):
+    def _mask_mod(b, h, q, kv):
+        return mask_mod(b, h, q + offset, kv)
+    return _mask_mod
+class MoondreamModel(nn.Module):
+    def __init__(
+        self, config: MoondreamConfig, dtype=torch.bfloat16, setup_caches=True
+    ):
+        super().__init__()
+        self.config = config
+        self.tokenizer = Tokenizer.from_pretrained("moondream/starmie-v1")
+        self.vision = build_vision_model(config.vision, dtype)
+        self.text = build_text_model(config.text, dtype)
+        # Region Model
+        linear_cls = (
+            QuantizedLinear if config.region.group_size is not None else nn.Linear
+        )
+        self.region = nn.ModuleDict(
+            {
+                "coord_encoder": linear_cls(
+                    config.region.coord_feat_dim, config.region.dim, dtype=dtype
+                ),
+                "coord_decoder": linear_cls(
+                    config.region.dim, config.region.coord_out_dim, dtype=dtype
+                ),
+                "size_encoder": linear_cls(
+                    config.region.size_feat_dim, config.region.dim, dtype=dtype
+                ),
+                "size_decoder": linear_cls(
+                    config.region.dim, config.region.size_out_dim, dtype=dtype
+                ),
+                "ln": nn.LayerNorm(config.region.dim, dtype=dtype),
+            }
+        )
+        self.region.coord_features = nn.Parameter(
+            torch.empty(config.region.coord_feat_dim // 2, 1, dtype=dtype).T
+        )
+        self.region.size_features = nn.Parameter(
+            torch.empty(config.region.size_feat_dim // 2, 2, dtype=dtype).T
+        )
+        attn_mask = torch.tril(
+            torch.ones(
+                1, 1, config.text.max_context, config.text.max_context, dtype=torch.bool
+            )
+        )
+        patch_w = config.vision.crop_size // config.vision.enc_patch_size
+        prefix_attn_len = 1 + patch_w**2
+        attn_mask[..., :prefix_attn_len, :prefix_attn_len] = 1
+        self.register_buffer("attn_mask", attn_mask, persistent=False)
+        self.use_flex_decoding = True
+        self._causal_block_mask = None
+        self._point_gen_indices = None
+        # Initialize KV caches.
+        if setup_caches:
+            self._setup_caches()
+    @property
+    def causal_block_mask(self):
+        # The things we do to deal with ZeroGPU...
+        if self._causal_block_mask is None:
+            self._causal_block_mask = create_block_mask(
+                causal_mask,
+                B=None,
+                H=None,
+                Q_LEN=self.config.text.max_context,
+                KV_LEN=self.config.text.max_context,
+            )
+        return self._causal_block_mask
+    @property
+    def point_gen_indices(self):
+        if self._point_gen_indices is None:
+            self._point_gen_indices = torch.tensor(
+                [self.config.tokenizer.coord_id, self.config.tokenizer.eos_id],
+                device=self.device,
+            )
+        return self._point_gen_indices
+    def _refresh_runtime_buffers(self):
+        attn_mask = torch.tril(
+            torch.ones(
+                1,
+                1,
+                self.config.text.max_context,
+                self.config.text.max_context,
+                dtype=torch.bool,
+                device=self.device,
+            )
+        )
+        patch_w = self.config.vision.crop_size // self.config.vision.enc_patch_size
+        prefix_attn_len = 1 + patch_w**2
+        attn_mask[..., :prefix_attn_len, :prefix_attn_len] = 1
+        self.attn_mask = attn_mask
+        self.text.freqs_cis = precompute_freqs_cis(
+            self.config.text.dim // (2 * self.config.text.n_heads),
+            self.config.text.max_context,
+        ).to(device=self.device)
+    def _setup_caches(self):
+        c = self.config.text
+        for b in self.text.blocks:
+            b.kv_cache = KVCache(
+                c.n_heads,
+                c.n_kv_heads,
+                c.max_context,
+                c.dim,
+                device=self.device,
+                dtype=self.vision.pos_emb.dtype,
+            )
+    def _adapter_id_from_settings(self, settings: Optional[dict]) -> Optional[str]:
+        if settings is None:
+            return None
+        adapter = settings.get("adapter")
+        if adapter is not None:
+            return normalize_adapter_id(adapter)
+        model_value = settings.get("model")
+        if isinstance(model_value, str):
+            return normalize_adapter_id(model_value)
+        return None
+    def _resolve_lora(self, settings: Optional[dict]) -> Optional[object]:
+        adapter_id = self._adapter_id_from_settings(settings)
+        if adapter_id is None:
+            return None
+        return load_adapter(
+            adapter_id,
+            text_config=self.config.text,
+            device=self.device,
+            dtype=self.vision.pos_emb.dtype,
+        )
+    @property
+    def device(self):
+        return self.vision.pos_emb.device
+    def _vis_enc(self, x: torch.Tensor):
+        return vision_encoder(x, self.vision, self.config.vision)
+    def _vis_proj(self, g: torch.Tensor, r: torch.Tensor):
+        return vision_projection(g, r, self.vision, self.config.vision)
+    def _prefill(
+        self,
+        x: torch.Tensor,
+        attn_mask: torch.Tensor,
+        pos_ids: torch.Tensor,
+        lora: Optional[torch.Tensor],
+    ):
+        return text_decoder(x, self.text, attn_mask, pos_ids, self.config.text, lora)
+    def _decode_one_tok(
+        self,
+        x: torch.Tensor,
+        attn_mask: torch.Tensor,
+        pos_ids: torch.Tensor,
+        lora: Optional[torch.Tensor],
+        lm_head_indices: Optional[torch.Tensor] = None,
+    ):
+        if self.use_flex_decoding:
+            torch._assert(pos_ids.shape[-1] == 1, "Invalid position ID shape")
+            block_index = pos_ids // self.causal_block_mask.BLOCK_SIZE[0]
+            mask = self.causal_block_mask[:, :, block_index]
+            mask.seq_lengths = (1, mask.seq_lengths[1])
+            mask.mask_mod = get_mask_mod(self.causal_block_mask.mask_mod, pos_ids[0])
+        else:
+            mask = None
+        hidden = text_decoder(
+            x,
+            self.text,
+            attn_mask,
+            pos_ids,
+            self.config.text,
+            lora=lora,
+            flex_block_mask_slice=mask,
+        )
+        logits = lm_head(hidden, self.text, indices=lm_head_indices)
+        return logits, hidden
+    def compile(self):
+        for module in self.modules():
+            if isinstance(module, QuantizedLinear):
+                module.unpack()
+        # Initialize lazy properties to avoid first-call overhead
+        self.causal_block_mask
+        self.point_gen_indices
+        # TODO: vision_projection and _prefill is not being compiled
+        self._vis_enc = torch.compile(self._vis_enc, fullgraph=True)
+        self._decode_one_tok = torch.compile(
+            self._decode_one_tok, fullgraph=True, mode="reduce-overhead"
+        )
+        # Warm up compiled methods with dummy forward passes
+        device = self.device
+        dtype = self.vision.pos_emb.dtype
+        with torch.no_grad():
+            # Warmup vision encoder
+            dummy_crops = torch.randn(1, 3, 378, 378, device=device, dtype=dtype)
+            self._vis_enc(dummy_crops)
+            # Warmup _decode_one_tok (both normal and point generation modes)
+            dummy_emb = torch.randn(
+                1, 1, self.config.text.dim, device=device, dtype=dtype
+            )
+            dummy_mask = torch.ones(
+                1, 1, self.config.text.max_context, device=device, dtype=torch.bool
+            )
+            dummy_pos_ids = torch.tensor([100], device=device, dtype=torch.long)
+            self._decode_one_tok(dummy_emb, dummy_mask, dummy_pos_ids, None)
+            self._decode_one_tok(
+                dummy_emb,
+                dummy_mask,
+                dummy_pos_ids,
+                None,
+                lm_head_indices=self.point_gen_indices,
+            )
+    def _run_vision_encoder(self, image: Image.Image) -> torch.Tensor:
+        all_crops, tiling = prepare_crops(image, self.config.vision, device=self.device)
+        torch._dynamo.mark_dynamic(all_crops, 0)
+        outputs = self._vis_enc(all_crops)
+        global_features = outputs[0]
+        local_features = outputs[1:].view(
+            -1,
+            self.config.vision.enc_n_layers,
+            self.config.vision.enc_n_layers,
+            self.config.vision.enc_dim,
+        )
+        reconstructed = reconstruct_from_crops(
+            local_features,
+            tiling,
+            patch_size=1,
+            overlap_margin=self.config.vision.overlap_margin,
+        )
+        return self._vis_proj(global_features, reconstructed)
+    def encode_image(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        settings: Optional[ImageEncodingSettings] = None,
+    ) -> EncodedImage:
+        if isinstance(image, EncodedImage):
+            return image
+        elif not isinstance(image, Image.Image):
+            raise ValueError("image must be a PIL Image or EncodedImage")
+        lora = self._resolve_lora(settings)
+        # Run through text model in addition to the vision encoder, to minimize
+        # re-computation if multiple queries are performed on this image.
+        with torch.inference_mode():
+            img_emb = self._run_vision_encoder(image)
+            bos_emb = text_encoder(
+                torch.tensor([[self.config.tokenizer.bos_id]], device=self.device),
+                self.text,
+            )
+            inputs_embeds = torch.cat([bos_emb, img_emb[None]], dim=1)
+            mask = self.attn_mask[:, :, 0 : inputs_embeds.size(1), :]
+            pos_ids = torch.arange(
+                inputs_embeds.size(1), dtype=torch.long, device=self.device
+            )
+            self._prefill(inputs_embeds, mask, pos_ids, lora)
+        return EncodedImage(
+            pos=inputs_embeds.size(1),
+            caches=[
+                (
+                    b.kv_cache.k_cache[:, :, : inputs_embeds.size(1), :].clone(),
+                    b.kv_cache.v_cache[:, :, : inputs_embeds.size(1), :].clone(),
+                )
+                for b in self.text.blocks
+            ],
+        )
+    def _apply_top_p(self, probs: torch.Tensor, top_p: float):
+        probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
+        probs_sum = torch.cumsum(probs_sort, dim=-1)
+        mask = probs_sum - probs_sort > top_p
+        probs_sort[mask] = 0.0
+        probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
+        next_probs = torch.zeros_like(probs)
+        next_probs.scatter_(dim=-1, index=probs_idx, src=probs_sort)
+        return next_probs
+    def _prefill_prompt(
+        self,
+        prompt_tokens: torch.Tensor,
+        pos: int,
+        temperature: float,
+        top_p: float,
+        spatial_refs: Optional[SpatialRefs] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+        lora: Optional[dict] = None,
+    ):
+        with torch.inference_mode():
+            prompt_emb = text_encoder(prompt_tokens, self.text)
+            if spatial_refs:
+                encoded_refs = encode_spatial_refs(spatial_refs, self.region)
+                prompt_emb[prompt_tokens == self.config.tokenizer.coord_id] = (
+                    encoded_refs["coords"]
+                )
+                if encoded_refs["sizes"] is not None:
+                    prompt_emb[prompt_tokens == self.config.tokenizer.size_id] = (
+                        encoded_refs["sizes"]
+                    )
+            torch._dynamo.mark_dynamic(prompt_emb, 1)
+            if attn_mask is None:
+                attn_mask = self.attn_mask
+            mask = attn_mask[:, :, pos : pos + prompt_emb.size(1), :]
+            pos_ids = torch.arange(
+                pos, pos + prompt_emb.size(1), dtype=torch.long, device=self.device
+            )
+            hidden_BC = self._prefill(prompt_emb, mask, pos_ids, lora)
+            logits_BV = lm_head(hidden_BC, self.text)
+            if temperature == 0:
+                next_token = torch.argmax(logits_BV, dim=-1).unsqueeze(1)
+            else:
+                probs = torch.softmax(logits_BV / temperature, dim=-1)
+                probs = self._apply_top_p(probs, top_p)
+                next_token = torch.multinomial(probs, num_samples=1)
+        pos = pos + prompt_emb.size(1)
+        return logits_BV, hidden_BC, next_token, pos
+    def _generate_reasoning(
+        self,
+        prompt_tokens,
+        pos,
+        settings: Optional[TextSamplingSettings] = None,
+        spatial_refs: Optional[SpatialRefs] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+    ) -> Tuple[int, str, List[dict]]:
+        max_tokens = (
+            settings.get("max_tokens", DEFAULT_MAX_TOKENS)
+            if settings
+            else DEFAULT_MAX_TOKENS
+        )
+        temperature = (
+            settings.get("temperature", DEFAULT_TEMPERATURE)
+            if settings
+            else DEFAULT_TEMPERATURE
+        )
+        lora = self._resolve_lora(settings)
+        top_p = settings.get("top_p", DEFAULT_TOP_P) if settings else DEFAULT_TOP_P
+        eos_id = self.config.tokenizer.answer_id
+        _, last_hidden_BC, next_token, pos = self._prefill_prompt(
+            prompt_tokens,
+            pos,
+            temperature,
+            top_p,
+            spatial_refs,
+            attn_mask=attn_mask,
+            lora=lora,
+        )
+        text_token_chunks = [[]]
+        grounding_chunks = [[]]
+        mask = torch.zeros(
+            1, 1, self.config.text.max_context, device=self.device, dtype=torch.bool
+        )
+        mask[:, :, :pos] = 1
+        pos_ids = torch.tensor([pos], device=self.device, dtype=torch.long)
+        generated_tokens = 0
+        while (
+            next_token_id := next_token.item()
+        ) != eos_id and generated_tokens < max_tokens:
+            if (
+                next_token_id == self.config.tokenizer.start_ground_points_id
+                or next_token_id == self.config.tokenizer.end_ground_id
+            ):
+                text_token_chunks.append([])
+                grounding_chunks.append([])
+            text_token_chunks[-1].append(next_token_id)
+            with torch.inference_mode():
+                if next_token_id == self.config.tokenizer.coord_id:
+                    coord_logits = decode_coordinate(last_hidden_BC, self.region)
+                    coord = torch.argmax(coord_logits, dim=-1) / coord_logits.size(-1)
+                    grounding_chunks[-1].append(coord.item())
+                    next_emb = encode_coordinate(
+                        coord.to(dtype=coord_logits.dtype), self.region
+                    ).unsqueeze(0)
+                else:
+                    next_emb = text_encoder(next_token, self.text)
+                mask[:, :, pos], pos_ids[0] = 1, pos
+                logits_BV, last_hidden_BC = self._decode_one_tok(
+                    next_emb, mask, pos_ids, lora
+                )
+                logits_BV[:, self.config.tokenizer.eos_id] = float("-inf")
+                logits_BV[:, self.config.tokenizer.size_id] = float("-inf")
+                pos += 1
+                if temperature == 0:
+                    next_token = torch.argmax(logits_BV, dim=-1).unsqueeze(1)  # (1, 1)
+                else:
+                    probs = torch.softmax(logits_BV / temperature, dim=-1)  # (1, V)
+                    probs = self._apply_top_p(probs, top_p)
+                    next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)
+                generated_tokens += 1
+        text_chunks = [
+            self.tokenizer.decode(chunk_tokens) for chunk_tokens in text_token_chunks
+        ]
+        text = "".join(text_chunks)
+        start_idx = 0
+        grounding = []
+        for text_chunk, grounding_chunk in zip(text_chunks, grounding_chunks):
+            if len(grounding_chunk) > 1:
+                points = []
+                for i in range(0, len(grounding_chunk) - (len(grounding_chunk) % 2), 2):
+                    points.append((grounding_chunk[i], grounding_chunk[i + 1]))
+                grounding.append(
+                    {
+                        "start_idx": start_idx,
+                        "end_idx": start_idx + len(text_chunk),
+                        "points": points,
+                    }
+                )
+            start_idx += len(text_chunk)
+        return pos, text, grounding
+    def _generate_answer(
+        self,
+        prompt_tokens: torch.Tensor,
+        pos: int,
+        settings: Optional[TextSamplingSettings] = None,
+        spatial_refs: Optional[SpatialRefs] = None,
+        eos_id: Optional[int] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+    ):
+        max_tokens = (
+            settings.get("max_tokens", DEFAULT_MAX_TOKENS)
+            if settings
+            else DEFAULT_MAX_TOKENS
+        )
+        temperature = (
+            settings.get("temperature", DEFAULT_TEMPERATURE)
+            if settings
+            else DEFAULT_TEMPERATURE
+        )
+        top_p = settings.get("top_p", DEFAULT_TOP_P) if settings else DEFAULT_TOP_P
+        eos_id = eos_id if eos_id is not None else self.config.tokenizer.eos_id
+        lora = self._resolve_lora(settings)
+        _, _, next_token, pos = self._prefill_prompt(
+            prompt_tokens,
+            pos,
+            temperature,
+            top_p,
+            spatial_refs,
+            attn_mask=attn_mask,
+            lora=lora,
+        )
+        def generator(next_token, pos):
+            mask = torch.zeros(
+                1, 1, self.config.text.max_context, device=self.device, dtype=torch.bool
+            )
+            mask[:, :, :pos] = 1
+            pos_ids = torch.tensor([pos], device=self.device, dtype=torch.long)
+            generated_tokens = 0
+            # For properly handling token streaming with Unicode
+            token_cache = []
+            print_len = 0
+            while (
+                next_token_id := next_token.item()
+            ) != eos_id and generated_tokens < max_tokens:
+                # Add token to our cache
+                token_cache.append(next_token_id)
+                # Decode all tokens collected so far
+                text = self.tokenizer.decode(token_cache)
+                # After a newline, we flush the cache completely
+                if text.endswith("\n"):
+                    printable_text = text[print_len:]
+                    token_cache = []
+                    print_len = 0
+                    if printable_text:
+                        yield printable_text
+                # If the last token is a CJK character, we can safely print it
+                elif len(text) > 0 and _is_cjk_char(ord(text[-1])):
+                    printable_text = text[print_len:]
+                    print_len += len(printable_text)
+                    if printable_text:
+                        yield printable_text
+                # Otherwise, only yield up to the last space to avoid cutting words
+                else:
+                    last_space_idx = text.rfind(" ", print_len)
+                    if last_space_idx >= print_len:
+                        printable_text = text[print_len : last_space_idx + 1]
+                        print_len += len(printable_text)
+                        if printable_text:
+                            yield printable_text
+                with torch.inference_mode():
+                    next_emb = text_encoder(next_token, self.text)
+                    mask[:, :, pos], pos_ids[0] = 1, pos
+                    logits_BV, _ = self._decode_one_tok(next_emb, mask, pos_ids, lora)
+                    logits_BV[:, self.config.tokenizer.answer_id] = float("-inf")
+                    pos += 1
+                    if temperature == 0:
+                        next_token = torch.argmax(logits_BV, dim=-1).unsqueeze(
+                            1
+                        )  # (1, 1)
+                    else:
+                        probs = torch.softmax(logits_BV / temperature, dim=-1)  # (1, V)
+                        probs = self._apply_top_p(probs, top_p)
+                        next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)
+                    generated_tokens += 1
+            # Flush any remaining text in the cache
+            if token_cache:
+                text = self.tokenizer.decode(token_cache)
+                printable_text = text[print_len:]
+                if printable_text:
+                    yield printable_text
+        return generator(next_token, pos)
+    def query(
+        self,
+        image: Optional[Union[Image.Image, EncodedImage]] = None,
+        question: str = None,
+        reasoning: bool = True,
+        spatial_refs: Optional[SpatialRefs] = None,
+        stream: bool = False,
+        settings: Optional[TextSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["query"] is None:
+            raise NotImplementedError("Model does not support querying.")
+        if question is None:
+            raise ValueError("question must be provided.")
+        if spatial_refs and image is None:
+            raise ValueError("spatial_refs can only be used with an image.")
+        attn_mask = self.attn_mask
+        if image is not None:
+            image = self.encode_image(image, settings)
+            self.load_encoded_image(image)
+            pos = image.pos
+            prompt_toks = self.config.tokenizer.templates["query"]["prefix"]
+        else:
+            self._setup_caches()
+            pos = 0
+            prompt_toks = [
+                self.config.tokenizer.bos_id
+            ] + self.config.tokenizer.templates["query"]["prefix"]
+            max_context = self.config.text.max_context
+            attn_mask = torch.tril(
+                torch.ones(1, 1, max_context, max_context, dtype=torch.bool)
+            ).to(self.device)
+        spatial_toks = []
+        if spatial_refs:
+            for ref in spatial_refs:
+                coord_id = self.config.tokenizer.coord_id
+                size_id = self.config.tokenizer.size_id
+                if len(ref) == 2:
+                    spatial_toks.extend([coord_id, coord_id])
+                else:
+                    spatial_toks.extend([coord_id, coord_id, size_id])
+        prompt_tokens = [
+            prompt_toks + spatial_toks + self.tokenizer.encode(question).ids
+        ]
+        if reasoning:
+            prompt_tokens[0] += [self.config.tokenizer.thinking_id]
+            prompt_tokens = torch.tensor(prompt_tokens, device=self.device)
+            pos, reasoning_text, reasoning_grounding = self._generate_reasoning(
+                prompt_tokens, pos, settings, spatial_refs, attn_mask=attn_mask
+            )
+            prompt_tokens = [self.config.tokenizer.templates["query"]["suffix"]]
+            reasoning_dict = {
+                "reasoning": {"text": reasoning_text, "grounding": reasoning_grounding}
+            }
+            spatial_refs = None
+        else:
+            prompt_tokens[0] += self.config.tokenizer.templates["query"]["suffix"]
+            reasoning_dict = {}
+        prompt_tokens = torch.tensor(prompt_tokens, device=self.device)
+        def generator():
+            for token in self._generate_answer(
+                prompt_tokens, pos, settings, spatial_refs, attn_mask=attn_mask
+            ):
+                yield token
+        if stream:
+            return {**reasoning_dict, "answer": generator()}
+        else:
+            return {**reasoning_dict, "answer": "".join(list(generator()))}
+    def load_encoded_image(self, encoded_image: EncodedImage):
+        for b, (k, v) in zip(self.text.blocks, encoded_image.caches):
+            b.kv_cache.k_cache[:, :, : k.size(2), :] = k
+            b.kv_cache.v_cache[:, :, : v.size(2), :] = v
+    def caption(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        length: Literal["normal", "short", "long"] = "normal",
+        stream: bool = False,
+        settings: Optional[TextSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["caption"] is None:
+            raise NotImplementedError("Model does not support captioning.")
+        if length not in self.config.tokenizer.templates["caption"]:
+            raise ValueError(f"Model does not support caption length '{length}'.")
+        image = self.encode_image(image, settings)
+        self.load_encoded_image(image)
+        prompt_tokens = torch.tensor(
+            [self.config.tokenizer.templates["caption"][length]], device=self.device
+        )
+        def generator():
+            for token in self._generate_answer(prompt_tokens, image.pos, settings):
+                yield token
+        if stream:
+            return {"caption": generator()}
+        else:
+            return {"caption": "".join(list(generator()))}
+    def _generate_points(
+        self,
+        hidden: torch.Tensor,
+        next_token: torch.Tensor,
+        pos: int,
+        include_size: bool = True,
+        max_objects: int = DEFAULT_MAX_OBJECTS,
+        lora: Optional[dict] = None,
+    ):
+        out = []
+        mask = torch.zeros(
+            1, 1, self.config.text.max_context, device=self.device, dtype=torch.bool
+        )
+        mask[:, :, :pos] = 1
+        pos_ids = torch.tensor([pos], device=self.device, dtype=torch.long)
+        with torch.inference_mode():
+            while (
+                next_token.item() != self.config.tokenizer.eos_id
+                and len(out) < max_objects
+            ):
+                x_logits = decode_coordinate(hidden, self.region)
+                x_center = torch.argmax(x_logits, dim=-1) / x_logits.size(-1)
+                next_emb = encode_coordinate(
+                    x_center.to(dtype=x_logits.dtype), self.region
+                ).unsqueeze(0)
+                # Decode y-coordinate
+                mask[:, :, pos], pos_ids[0] = 1, pos
+                _, hidden = self._decode_one_tok(next_emb, mask, pos_ids, lora)
+                pos += 1
+                y_logits = decode_coordinate(hidden, self.region)
+                y_center = torch.argmax(y_logits, dim=-1) / y_logits.size(-1)
+                next_emb = encode_coordinate(
+                    y_center.to(dtype=y_logits.dtype), self.region
+                ).unsqueeze(0)
+                # Decode size
+                if include_size:
+                    mask[:, :, pos], pos_ids[0] = 1, pos
+                    logits, hidden = self._decode_one_tok(next_emb, mask, pos_ids, lora)
+                    pos += 1
+                    size_logits = decode_size(hidden, self.region)
+                    # Get bin indices from the logits
+                    w_bin = torch.argmax(size_logits[0], dim=-1)
+                    h_bin = torch.argmax(size_logits[1], dim=-1)
+                    # Convert from bin indices to actual size values using the inverse of the log-scale mapping
+                    # Formula: size = 2^((bin / 1023.0) * 10.0 - 10.0)
+                    w = torch.pow(2.0, (w_bin.float() / 1023.0) * 10.0 - 10.0)
+                    h = torch.pow(2.0, (h_bin.float() / 1023.0) * 10.0 - 10.0)
+                    next_emb = (
+                        encode_size(
+                            torch.tensor(
+                                [w, h], device=self.device, dtype=size_logits.dtype
+                            ),
+                            self.region,
+                        )
+                        .unsqueeze(0)
+                        .unsqueeze(0)
+                    )
+                    # Add object
+                    out.append(
+                        {
+                            "x_min": x_center.item() - w.item() / 2,
+                            "y_min": y_center.item() - h.item() / 2,
+                            "x_max": x_center.item() + w.item() / 2,
+                            "y_max": y_center.item() + h.item() / 2,
+                        }
+                    )
+                else:
+                    out.append({"x": x_center.item(), "y": y_center.item()})
+                # Decode next token (x-coordinate, or eos)
+                mask[:, :, pos], pos_ids[0] = 1, pos
+                logits, hidden = self._decode_one_tok(
+                    next_emb,
+                    mask,
+                    pos_ids,
+                    lora,
+                    lm_head_indices=self.point_gen_indices,
+                )
+                pos += 1
+                # Map back: index 0 -> coord_id, index 1 -> eos_id
+                next_token_idx = torch.argmax(logits, dim=-1)
+                next_token = self.point_gen_indices[next_token_idx]
+        return out
+    def detect(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        object: str,
+        settings: Optional[ObjectSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["detect"] is None:
+            raise NotImplementedError("Model does not support object detection.")
+        image = self.encode_image(image, settings)
+        self.load_encoded_image(image)
+        prompt_tokens = torch.tensor(
+            [
+                self.config.tokenizer.templates["detect"]["prefix"]
+                + self.tokenizer.encode(" " + object).ids
+                + self.config.tokenizer.templates["detect"]["suffix"]
+            ],
+            device=self.device,
+        )
+        lora = self._resolve_lora(settings)
+        _, hidden, next_token, pos = self._prefill_prompt(
+            prompt_tokens, image.pos, temperature=0, top_p=0, lora=lora
+        )
+        hidden = hidden[:, -1:, :]
+        max_objects = (
+            settings.get("max_objects", DEFAULT_MAX_OBJECTS)
+            if settings
+            else DEFAULT_MAX_OBJECTS
+        )
+        objects = self._generate_points(
+            hidden,
+            next_token,
+            pos,
+            include_size=True,
+            max_objects=max_objects,
+            lora=lora,
+        )
+        return {"objects": objects}
+    def point(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        object: str,
+        settings: Optional[ObjectSamplingSettings] = None,
+    ):
+        if self.config.tokenizer.templates["point"] is None:
+            raise NotImplementedError("Model does not support pointing.")
+        image = self.encode_image(image, settings)
+        self.load_encoded_image(image)
+        prompt_tokens = torch.tensor(
+            [
+                self.config.tokenizer.templates["point"]["prefix"]
+                + self.tokenizer.encode(" " + object).ids
+                + self.config.tokenizer.templates["point"]["suffix"]
+            ],
+            device=self.device,
+        )
+        lora = self._resolve_lora(settings)
+        _, hidden, next_token, pos = self._prefill_prompt(
+            prompt_tokens, image.pos, temperature=0, top_p=0, lora=lora
+        )
+        hidden = hidden[:, -1:, :]
+        max_objects = (
+            settings.get("max_objects", DEFAULT_MAX_OBJECTS)
+            if settings
+            else DEFAULT_MAX_OBJECTS
+        )
+        objects = self._generate_points(
+            hidden,
+            next_token,
+            pos,
+            include_size=False,
+            max_objects=max_objects,
+            lora=lora,
+        )
+        return {"points": objects}
+    def _detect_gaze(
+        self,
+        image: EncodedImage,
+        source: Tuple[float, float],
+        force_detect: bool = False,
+    ):
+        with torch.inference_mode():
+            before_emb = text_encoder(
+                torch.tensor(
+                    [self.tokenizer.encode("\n\nPoint:").ids], device=self.device
+                ),
+                self.text,
+            )
+            after_emb = text_encoder(
+                torch.tensor(
+                    [self.tokenizer.encode(" gaze\n\n").ids], device=self.device
+                ),
+                self.text,
+            )
+            x_emb = encode_coordinate(
+                torch.tensor([[[source[0]]]], device=self.device, dtype=torch.bfloat16),
+                self.region,
+            )
+            y_emb = encode_coordinate(
+                torch.tensor([[[source[1]]]], device=self.device, dtype=torch.bfloat16),
+                self.region,
+            )
+            prompt_emb = torch.cat([before_emb, x_emb, y_emb, after_emb], dim=1)
+            self.load_encoded_image(image)
+            mask = self.attn_mask[:, :, image.pos : image.pos + prompt_emb.size(1), :]
+            pos_ids = torch.arange(
+                image.pos,
+                image.pos + prompt_emb.size(1),
+                dtype=torch.long,
+                device=self.device,
+            )
+            hidden = self._prefill(prompt_emb, mask, pos_ids, lora=None)
+            logits = lm_head(hidden, self.text)
+            next_token = torch.argmax(logits, dim=-1)
+            pos = image.pos + prompt_emb.size(1)
+            hidden = hidden[:, -1:, :]
+            if force_detect:
+                next_token = torch.tensor([[0]], device=self.device)
+            if next_token.item() == self.config.tokenizer.eos_id:
+                return None
+            gaze = self._generate_points(
+                hidden, next_token, pos, include_size=False, max_objects=1
+            )
+            return gaze[0]
+    def detect_gaze(
+        self,
+        image: Union[Image.Image, EncodedImage],
+        eye: Optional[Tuple[float, float]] = None,
+        face: Optional[Dict[str, float]] = None,
+        unstable_settings: Dict[str, Any] = {},
+    ):
+        if "force_detect" in unstable_settings:
+            force_detect = unstable_settings["force_detect"]
+        else:
+            force_detect = False
+        if "prioritize_accuracy" in unstable_settings:
+            prioritize_accuracy = unstable_settings["prioritize_accuracy"]
+        else:
+            prioritize_accuracy = False
+        if not prioritize_accuracy:
+            if eye is None:
+                raise ValueError("eye must be provided when prioritize_accuracy=False")
+            image = self.encode_image(image)
+            return {"gaze": self._detect_gaze(image, eye, force_detect=force_detect)}
+        else:
+            if (
+                not isinstance(image, Image.Image)
+                and "flip_enc_img" not in unstable_settings
+            ):
+                raise ValueError(
+                    "image must be a PIL Image when prioritize_accuracy=True, "
+                    "or flip_enc_img must be provided"
+                )
+            if face is None:
+                raise ValueError("face must be provided when prioritize_accuracy=True")
+            encoded_image = self.encode_image(image)
+            if (
+                isinstance(image, Image.Image)
+                and "flip_enc_img" not in unstable_settings
+            ):
+                flipped_pil = image.copy()
+                flipped_pil = flipped_pil.transpose(method=Image.FLIP_LEFT_RIGHT)
+                encoded_flipped_image = self.encode_image(flipped_pil)
+            else:
+                encoded_flipped_image = unstable_settings["flip_enc_img"]
+            N = 10
+            detections = [
+                self._detect_gaze(
+                    encoded_image,
+                    (
+                        random.uniform(face["x_min"], face["x_max"]),
+                        random.uniform(face["y_min"], face["y_max"]),
+                    ),
+                    force_detect=force_detect,
+                )
+                for _ in range(N)
+            ]
+            detections = [
+                (gaze["x"], gaze["y"]) for gaze in detections if gaze is not None
+            ]
+            flipped_detections = [
+                self._detect_gaze(
+                    encoded_flipped_image,
+                    (
+                        1 - random.uniform(face["x_min"], face["x_max"]),
+                        random.uniform(face["y_min"], face["y_max"]),
+                    ),
+                    force_detect=force_detect,
+                )
+                for _ in range(N)
+            ]
+            detections.extend(
+                [
+                    (1 - gaze["x"], gaze["y"])
+                    for gaze in flipped_detections
+                    if gaze is not None
+                ]
+            )
+            if len(detections) < N:
+                return {"gaze": None}
+            detections = remove_outlier_points(detections)
+            mean_gaze = (
+                sum(gaze[0] for gaze in detections) / len(detections),
+                sum(gaze[1] for gaze in detections) / len(detections),
+            )
+            return {"gaze": {"x": mean_gaze[0], "y": mean_gaze[1]}}
+def _is_cjk_char(cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    if (
+        (cp >= 0x4E00 and cp <= 0x9FFF)
+        or (cp >= 0x3400 and cp <= 0x4DBF)
+        or (cp >= 0x2F800 and cp <= 0x2FA1F)
+    ):
+        return True
+    return False

region.py ADDED Viewed

	@@ -0,0 +1,136 @@

+import torch
+import torch.nn as nn
+import math
+from typing import List, Tuple, Union
+SpatialRefs = List[Union[Tuple[float, float], Tuple[float, float, float, float]]]
+def fourier_features(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
+    """
+    Applies Fourier feature mapping to input tensor x using frequency matrix w. This
+    projects inputs through sinusoidal functions to create higher dimensional features
+    that help mitigate spectral bias - the tendency of neural networks to learn
+    low-frequency functions more easily than high-frequency ones. By explicitly
+    mapping inputs to higher frequencies through sin/cos transformations, we enable
+    better learning of fine details and higher frequency patterns.
+    Args:
+        x: Input tensor to transform
+        w: Matrix of frequencies for the Fourier features transformation
+    Returns:
+        Concatenated cosine and sine transformed features as a tensor
+    """
+    f = 2 * math.pi * x @ w
+    return torch.cat([f.cos(), f.sin()], dim=-1)
+def encode_coordinate(coord: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes as input a tensor containing a single float coordinate value (x or y)
+    and encodes it into hidden states for input to the text model.
+    Args:
+        coord: Tensor with single float coordinate value
+    Returns:
+        Encoded hidden states tensor for input to text model
+    """
+    return w.coord_encoder(fourier_features(coord, w.coord_features))
+def decode_coordinate(hidden_state: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes as input the last hidden state from the text model and outputs a single logit
+    representing either an x or y coordinate prediction.
+    Args:
+        hidden_state: The final hidden state tensor from the text model.
+    Returns:
+        A single logit representing the predicted coordinate value (x or y)
+    """
+    hidden_state = w.ln(hidden_state)
+    return w.coord_decoder(hidden_state)
+def encode_size(size: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes a tensor containing width and height values and encodes them into
+    hidden states for input to the text model.
+    Args:
+        size: Tensor with two floats for width and height
+    Returns:
+        Encoded hidden states tensor for input to text model
+    """
+    return w.size_encoder(fourier_features(size, w.size_features))
+def decode_size(hidden_state: torch.Tensor, w: nn.Module) -> torch.Tensor:
+    """
+    Takes as input the last hidden state from the text model and outputs logits
+    for 1024 bins representing width and height in log-scale.
+    The bins are distributed according to the formula:
+    bin = (log2(size) + 10.0) / 10.0 * 1023.0
+    where size values are clamped to be at least 1/1024.
+    To convert from bin back to size:
+    size = 2^((bin / 1023.0) * 10.0 - 10.0)
+    Args:
+        hidden_state: The final hidden state tensor from the text model.
+    Returns:
+        A tensor containing logits for 1024 bins for width and height.
+        Shape is (2, 1024) where the first dimension corresponds to width and height.
+    """
+    hidden_state = w.ln(hidden_state)
+    return w.size_decoder(hidden_state).view(2, -1)
+def encode_spatial_refs(spatial_refs: SpatialRefs, w: nn.Module) -> torch.Tensor:
+    """
+    Takes a list of spatial references (points or regions) and encodes them into
+    hidden states for input to the text model.
+    Args:
+        spatial_refs: List of spatial references (points or boxes)
+            - Points are represented as normalized (x, y) tuples
+            - Boxes are represented as normalized (x_min, y_min, x_max, y_max) tuples
+    Returns:
+        {"coords": torch.Tensor, "sizes": Optional[torch.Tensor]}
+    """
+    coords, sizes = [], []
+    for ref in spatial_refs:
+        if len(ref) == 2:
+            coords.append(ref[0])
+            coords.append(ref[1])
+        else:
+            x_c = (ref[0] + ref[2]) / 2
+            y_c = (ref[1] + ref[3]) / 2
+            width = ref[2] - ref[0]
+            height = ref[3] - ref[1]
+            coords.append(x_c)
+            coords.append(y_c)
+            sizes.append([width, height])
+    coords = torch.tensor(
+        coords, device=w.coord_features.device, dtype=w.coord_features.dtype
+    ).view(-1, 1)
+    coords = encode_coordinate(coords, w)
+    if sizes:
+        sizes = torch.tensor(
+            sizes, device=w.size_features.device, dtype=w.size_features.dtype
+        )
+        sizes = encode_size(sizes, w)
+    else:
+        sizes = None
+    return {"coords": coords, "sizes": sizes}

rope.py ADDED Viewed

	@@ -0,0 +1,47 @@

+# Ethically sourced from https://github.com/xjdr-alt/entropix
+import torch
+def precompute_freqs_cis(
+    dim: int,
+    end: int,
+    theta: float = 1500000.0,
+    dtype: torch.dtype = torch.float32,
+) -> torch.Tensor:
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=dtype)[: (dim // 2)] / dim))
+    t = torch.arange(end, dtype=dtype).unsqueeze(1)
+    freqs = t * freqs.unsqueeze(0)
+    freqs = torch.exp(1j * freqs)
+    return torch.stack([freqs.real, freqs.imag], dim=-1)
+def apply_rotary_emb(
+    x: torch.Tensor,
+    freqs_cis: torch.Tensor,
+    position_ids: torch.Tensor,
+    num_heads: int,
+    rot_dim: int = 32,
+    interleave: bool = False,
+) -> torch.Tensor:
+    assert rot_dim == freqs_cis.shape[-2] * 2
+    assert num_heads == x.shape[1]
+    x_rot, x_pass = x[..., :rot_dim], x[..., rot_dim:]
+    if interleave:
+        xq_r = x_rot.float().reshape(*x_rot.shape[:-1], -1, 2)[..., 0]
+        xq_i = x_rot.float().reshape(*x_rot.shape[:-1], -1, 2)[..., 1]
+    else:
+        d_q = x_rot.shape[-1] // 2
+        xq_r, xq_i = x_rot[..., :d_q], x_rot[..., d_q:]
+    freqs_cos = freqs_cis[..., 0][position_ids, :].unsqueeze(0).unsqueeze(0)
+    freqs_sin = freqs_cis[..., 1][position_ids, :].unsqueeze(0).unsqueeze(0)
+    # Complex multiplication: (a + bi) * (c + di) = (ac - bd) + (ad + bc)i
+    xq_out_r = xq_r * freqs_cos - xq_i * freqs_sin
+    xq_out_i = xq_r * freqs_sin + xq_i * freqs_cos
+    xq_out = torch.stack((xq_out_r, xq_out_i), dim=-1).flatten(-2)
+    return torch.cat([xq_out.to(x.dtype), x_pass], dim=-1)

text.py ADDED Viewed

	@@ -0,0 +1,223 @@

+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from torch.nn.attention.flex_attention import flex_attention
+from typing import Optional
+from .layers import layer_norm, mlp, QuantizedLinear, moe_mlp
+from .rope import apply_rotary_emb, precompute_freqs_cis
+from .config import TextConfig
+from .lora import select_layer_lora
+def text_encoder(input_ids: torch.Tensor, w: nn.Module):
+    return F.embedding(input_ids, w.wte)
+def attn(
+    x: torch.Tensor,
+    w: nn.Module,
+    freqs_cis: torch.Tensor,
+    kv_cache: nn.Module,
+    attn_mask: torch.Tensor,
+    n_heads: int,
+    n_kv_heads: int,
+    position_ids: torch.Tensor,
+    flex_block_mask_slice=None,
+):
+    bsz, q_len, d_model = x.shape
+    head_dim = d_model // n_heads
+    qkv_out = w.qkv(x)  # shape: (bsz, q_len, (n_heads + 2*n_kv_heads)*head_dim)
+    q_dim = n_heads * head_dim
+    kv_dim = n_kv_heads * head_dim
+    q, k, v = qkv_out.split([q_dim, kv_dim, kv_dim], dim=-1)
+    q = q.view(bsz, q_len, n_heads, head_dim).transpose(1, 2)
+    k = k.view(bsz, q_len, n_kv_heads, head_dim).transpose(1, 2)
+    v = v.view(bsz, q_len, n_kv_heads, head_dim).transpose(1, 2)
+    if hasattr(w, "tau") and w.tau is not None:
+        tok_feat = F.gelu(qkv_out)
+        tok_q = torch.tanh(torch.matmul(tok_feat, w.tau["wq"].t())).permute(0, 2, 1)
+        tok_v = torch.tanh(torch.matmul(tok_feat, w.tau["wv"].t())).permute(0, 2, 1)
+        pos = position_ids.to(q.dtype) + 1
+        tau_pos = 1 + (
+            torch.sigmoid(w.tau["alpha"][:, None] * pos.log()) - 0.5
+        )  # (H,S)
+        tau_q = (tok_q + tau_pos[None]).unsqueeze(-1)  # (B,H,S,1)
+        tau_v = (tok_v + tau_pos[None]).unsqueeze(-1)
+        q = q * tau_q
+        v = v * tau_v
+    q = apply_rotary_emb(q, freqs_cis, position_ids, n_heads)
+    k = apply_rotary_emb(k, freqs_cis, position_ids, n_kv_heads)
+    if kv_cache is not None:
+        k, v = kv_cache.update(position_ids, k, v)
+    if flex_block_mask_slice is not None:
+        torch._assert(n_heads == n_kv_heads, "gqa not supported yet")
+        out = flex_attention(q, k, v, block_mask=flex_block_mask_slice)
+    else:
+        out = F.scaled_dot_product_attention(
+            q, k, v, attn_mask=attn_mask, enable_gqa=n_heads != n_kv_heads
+        )
+    out = out.transpose(1, 2).reshape(bsz, q_len, d_model)
+    return w.proj(out)
+def text_decoder(
+    x: torch.Tensor,
+    w: nn.Module,
+    attn_mask: torch.Tensor,
+    position_ids: torch.Tensor,
+    config: TextConfig,
+    lora: Optional[object] = None,
+    flex_block_mask_slice=None,
+):
+    for i, block in enumerate(w.blocks):
+        layer_lora = select_layer_lora(
+            lora, i, is_moe=config.moe is not None and i >= config.moe.start_layer
+        )
+        l_in = layer_norm(x, block.ln)
+        l_attn = attn(
+            l_in,
+            block.attn,
+            freqs_cis=w.freqs_cis,
+            kv_cache=block.kv_cache,
+            attn_mask=attn_mask,
+            n_heads=config.n_heads,
+            n_kv_heads=config.n_kv_heads,
+            position_ids=position_ids,
+            flex_block_mask_slice=flex_block_mask_slice,
+        )
+        if config.moe is not None and i >= config.moe.start_layer:
+            l_mlp = moe_mlp(
+                l_in, block.mlp, config.moe.experts_per_token, lora=layer_lora
+            )
+        else:
+            l_mlp = mlp(l_in, block.mlp, lora=layer_lora)
+        x = x + l_attn + l_mlp
+    return x
+def lm_head(
+    hidden_BTC: torch.Tensor, w: nn.Module, indices: Optional[torch.Tensor] = None
+):
+    hidden_BC = hidden_BTC[:, -1, :]
+    hidden_BC = layer_norm(hidden_BC, w.post_ln)
+    if indices is not None:
+        # Only compute logits for specified token indices
+        logits = hidden_BC @ w.lm_head.weight[indices].T + w.lm_head.bias[indices]
+    else:
+        logits = w.lm_head(hidden_BC)
+    return logits
+def build_dense_mlp(d_model, d_ffn, dtype, linear_cls):
+    return nn.ModuleDict(
+        {
+            "fc1": linear_cls(d_model, d_ffn, dtype=dtype),
+            "fc2": linear_cls(d_ffn, d_model, dtype=dtype),
+        }
+    )
+def build_moe_mlp(d_model, d_ffn, n_experts, dtype):
+    # For GeGLU, fc1 needs to output 2 * d_ffn (for gating)
+    mlp = nn.ModuleDict(
+        {
+            "router": nn.Linear(d_model, n_experts, dtype=dtype),
+            "fc1": nn.ParameterDict(
+                {
+                    "weight": nn.Parameter(
+                        torch.empty(n_experts, 2 * d_ffn, d_model, dtype=dtype)
+                    )
+                }
+            ),
+            "fc2": nn.ParameterDict(
+                {
+                    "weight": nn.Parameter(
+                        torch.empty(n_experts, d_model, d_ffn, dtype=dtype)
+                    )
+                }
+            ),
+        }
+    )
+    return mlp
+def build_text_model(config: TextConfig, dtype: torch.dtype) -> nn.Module:
+    qkv_dim = int(config.dim * (1 + 2 * config.n_kv_heads / config.n_heads))
+    linear_cls = QuantizedLinear if config.group_size is not None else nn.Linear
+    text = nn.ModuleDict(
+        {
+            "blocks": nn.ModuleList(
+                [
+                    nn.ModuleDict(
+                        {
+                            "ln": nn.LayerNorm(config.dim, dtype=dtype),
+                            "attn": nn.ModuleDict(
+                                {
+                                    "qkv": linear_cls(config.dim, qkv_dim, dtype=dtype),
+                                    "proj": linear_cls(
+                                        config.dim, config.dim, dtype=dtype
+                                    ),
+                                    "tau": nn.ParameterDict(
+                                        {
+                                            "wq": nn.Parameter(
+                                                torch.empty(
+                                                    config.n_heads, qkv_dim, dtype=dtype
+                                                )
+                                            ),
+                                            "wv": nn.Parameter(
+                                                torch.empty(
+                                                    config.n_heads, qkv_dim, dtype=dtype
+                                                )
+                                            ),
+                                            "alpha": nn.Parameter(
+                                                torch.empty(config.n_heads, dtype=dtype)
+                                            ),
+                                        }
+                                    ),
+                                }
+                            ),
+                            "mlp": (
+                                build_moe_mlp(
+                                    config.dim,
+                                    config.moe.expert_inner_dim,
+                                    config.moe.num_experts,
+                                    dtype,
+                                )
+                                if config.moe is not None
+                                and layer_idx >= config.moe.start_layer
+                                else build_dense_mlp(
+                                    config.dim, config.ff_dim, dtype, linear_cls
+                                )
+                            ),
+                        }
+                    )
+                    for layer_idx in range(config.n_layers)
+                ]
+            ),
+            "post_ln": nn.LayerNorm(config.dim, dtype=dtype),
+            "lm_head": nn.Linear(config.dim, config.vocab_size, dtype=dtype),
+        }
+    )
+    text.wte = nn.Parameter(torch.empty(config.vocab_size, config.dim, dtype=dtype))
+    text.register_buffer(
+        "freqs_cis",
+        precompute_freqs_cis(config.dim // (2 * config.n_heads), config.max_context),
+        persistent=False,
+    )
+    return text

utils.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import numpy as np
+def remove_outlier_points(points_tuples, k_nearest=2, threshold=2.0):
+    """
+    Robust outlier detection for list of (x,y) tuples.
+    Only requires numpy.
+    Args:
+        points_tuples: list of (x,y) tuples
+        k_nearest: number of neighbors to consider
+        threshold: multiplier for median distance
+    Returns:
+        list: filtered list of (x,y) tuples with outliers removed
+        list: list of booleans indicating which points were kept (True = kept)
+    """
+    points = np.array(points_tuples)
+    n_points = len(points)
+    # Calculate pairwise distances manually
+    dist_matrix = np.zeros((n_points, n_points))
+    for i in range(n_points):
+        for j in range(i + 1, n_points):
+            # Euclidean distance between points i and j
+            dist = np.sqrt(np.sum((points[i] - points[j]) ** 2))
+            dist_matrix[i, j] = dist
+            dist_matrix[j, i] = dist
+    # Get k nearest neighbors' distances
+    k = min(k_nearest, n_points - 1)
+    neighbor_distances = np.partition(dist_matrix, k, axis=1)[:, :k]
+    avg_neighbor_dist = np.mean(neighbor_distances, axis=1)
+    # Calculate mask using median distance
+    median_dist = np.median(avg_neighbor_dist)
+    mask = avg_neighbor_dist <= threshold * median_dist
+    # Return filtered tuples and mask
+    filtered_tuples = [t for t, m in zip(points_tuples, mask) if m]
+    return filtered_tuples

vision.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from typing import Union, Tuple
+from PIL import Image
+from .layers import attn, layer_norm, mlp
+from .image_crops import overlap_crop_image
+from .config import VisionConfig
+if torch.backends.mps.is_available():
+    # Non-divisible input sizes are not implemented on MPS device yet.
+    # https://github.com/pytorch/pytorch/issues/96056
+    def adaptive_avg_pool2d(input, output_size):
+        return F.adaptive_avg_pool2d(input.to("cpu"), output_size).to("mps")
+else:
+    adaptive_avg_pool2d = F.adaptive_avg_pool2d
+DeviceLike = Union[str, torch.device, int]
+def prepare_crops(
+    image: Image.Image, config: VisionConfig, device: DeviceLike
+) -> Tuple[torch.Tensor, Tuple[int, int]]:
+    np_image = np.array(image.convert("RGB"))
+    overlap_crops = overlap_crop_image(
+        np_image, max_crops=config.max_crops, overlap_margin=config.overlap_margin
+    )
+    all_crops = overlap_crops["crops"]
+    all_crops = np.transpose(all_crops, (0, 3, 1, 2))
+    all_crops = (
+        torch.from_numpy(all_crops)
+        .to(device=device, dtype=torch.bfloat16)
+        .div_(255.0)
+        .sub_(0.5)
+        .div_(0.5)
+    )
+    return all_crops, overlap_crops["tiling"]
+def create_patches(x, patch_size):
+    # Original shape: [B, C, H, W]
+    B, C, H, W = x.shape
+    P1 = P2 = patch_size
+    # Step 1: Split H and W dimensions into patches
+    # [B, C, H/P1, P1, W/P2, P2]
+    x = x.reshape(B, C, H // P1, P1, W // P2, P2)
+    # Step 2: Rearrange dimensions to match target shape
+    # [B, H/P1, W/P2, C, P1, P2]
+    x = x.permute(0, 2, 4, 1, 3, 5)
+    # Step 3: Combine dimensions to get final shape
+    # [B, (H/P1)*(W/P2), C*P1*P2]
+    x = x.reshape(B, (H // P1) * (W // P2), C * P1 * P2)
+    return x
+def vision_encoder(input_BCHW: torch.Tensor, w: nn.Module, config: VisionConfig):
+    x = create_patches(input_BCHW, config.enc_patch_size)
+    x = w.patch_emb(x)
+    x = x + w.pos_emb
+    for block in w.blocks:
+        x = x + attn(layer_norm(x, block.ln1), block.attn, n_heads=config.enc_n_heads)
+        x = x + mlp(layer_norm(x, block.ln2), block.mlp)
+    x = layer_norm(x, w.post_ln)
+    return x
+def vision_projection(
+    global_features: torch.Tensor,
+    reconstructed: torch.Tensor,
+    w: nn.Module,
+    config: VisionConfig,
+):
+    reconstructed = reconstructed.permute(2, 0, 1)
+    reconstructed = adaptive_avg_pool2d(
+        reconstructed, output_size=(config.enc_n_layers, config.enc_n_layers)
+    )
+    reconstructed = reconstructed.permute(1, 2, 0).view(729, config.enc_dim)
+    final_features = torch.cat([global_features, reconstructed], dim=-1)
+    return mlp(final_features, w.proj_mlp)
+def build_vision_model(config: VisionConfig, dtype: torch.dtype):
+    patch_dim = config.enc_patch_size * config.enc_patch_size * config.in_channels
+    grid_size = config.crop_size // config.enc_patch_size
+    num_patches = grid_size * grid_size
+    vision = nn.ModuleDict(
+        {
+            "patch_emb": nn.Linear(patch_dim, config.enc_dim, dtype=dtype),
+            "blocks": nn.ModuleList(
+                [
+                    nn.ModuleDict(
+                        {
+                            "ln1": nn.LayerNorm(config.enc_dim, dtype=dtype),
+                            "attn": nn.ModuleDict(
+                                {
+                                    "qkv": nn.Linear(
+                                        config.enc_dim, 3 * config.enc_dim, dtype=dtype
+                                    ),
+                                    "proj": nn.Linear(
+                                        config.enc_dim, config.enc_dim, dtype=dtype
+                                    ),
+                                }
+                            ),
+                            "ln2": nn.LayerNorm(config.enc_dim, dtype=dtype),
+                            "mlp": nn.ModuleDict(
+                                {
+                                    "fc1": nn.Linear(
+                                        config.enc_dim, config.enc_ff_dim, dtype=dtype
+                                    ),
+                                    "fc2": nn.Linear(
+                                        config.enc_ff_dim, config.enc_dim, dtype=dtype
+                                    ),
+                                }
+                            ),
+                        }
+                    )
+                    for _ in range(config.enc_n_layers)
+                ]
+            ),
+            "post_ln": nn.LayerNorm(config.enc_dim, dtype=dtype),
+            "proj_mlp": nn.ModuleDict(
+                {
+                    "fc1": nn.Linear(
+                        config.enc_dim * 2, config.proj_inner_dim, dtype=dtype
+                    ),
+                    "fc2": nn.Linear(
+                        config.proj_inner_dim, config.proj_out_dim, dtype=dtype
+                    ),
+                }
+            ),
+        }
+    )
+    vision.pos_emb = nn.Parameter(
+        torch.zeros(1, num_patches, config.enc_dim, dtype=dtype)
+    )
+    return vision