addpty TencentOpen commited on
Commit
44a5afa
·
0 Parent(s):

Duplicate from tencent/Youtu-VL-4B-Instruct

Browse files

Co-authored-by: TencentOpen <TencentOpen@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Tencent is pleased to support the community by making Youtu-VL available.
2
+
3
+ Copyright (C) 2026 Tencent. All rights reserved.
4
+
5
+ Youtu-VL is licensed under the License Term of Youtu-VL.
6
+
7
+ For the avoidance of doubt, Youtu-VL refers solely to inference code and weights made publicly available by Tencent in accordance with the License Term of Youtu-VL.
8
+
9
+ Terms of the License Term of Youtu-VL:
10
+ --------------------------------------------------------------------
11
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
12
+
13
+ 1. Youtu-VL IS NOT INTENDED FOR USE WITHIN THE EUROPEAN UNION.
14
+
15
+ 2. The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
18
+
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: youtu-vl
4
+ license_link: https://huggingface.co/tencent/Youtu-VL-4B-Instruct/blob/main/LICENSE.txt
5
+ pipeline_tag: image-text-to-text
6
+ extra_gated_eu_disallowed: true
7
+ library_name: transformers
8
+ ---
9
+
10
+ <div align="center">
11
+
12
+ # <img src="assets/youtu-vl-logo.png" alt="Youtu-VL Logo" height="100px">
13
+
14
+ [🏠 Project Page](https://youtu-tip.com/#llm) • [📃 License](LICENSE.txt) • [💻 Code](https://github.com/TencentCloudADP/youtu-vl) • [📑 Technical Report](https://arxiv.org/abs/2601.19798) • [📊 Benchmarks](#benchmarks) • [🚀 Getting Started](#quickstart)
15
+ </div>
16
+
17
+ ## 🎯 Introduction
18
+
19
+ **Youtu-VL** is a lightweight yet robust Vision-Language Model (VLM) built on the Youtu-LLM with 4B parameters. It pioneers Vision-Language Unified Autoregressive Supervision (VLUAS), which markedly strengthens visual perception and multimodal understanding. This enables a standard VLM to perform vision-centric tasks without task-specific additions. Across benchmarks, Youtu-VL stands out for its versatility, achieving competitive results on both vision-centric and general multimodal tasks.
20
+
21
+
22
+ ## ✨ Key Features
23
+
24
+ - **Comprehensive Vision-Centric Capabilities**: The model demonstrates strong, broad proficiency across classic vision-centric tasks, delivering competitive performance in visual grounding, image classification, object detection, referring segmentation, semantic segmentation, depth estimation, object counting, and human pose estimation.
25
+
26
+ - **Promising Performance with High Efficiency**: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.
27
+
28
+ <p align="center\">
29
+ <img src="assets/youtu-vl-overview.png" width="90%"/>
30
+ <p>
31
+
32
+ ## 🤗 Model Download
33
+
34
+ | Model Name | Description | Download |
35
+ | ----------- | ----------- |-----------
36
+ | Youtu-VL-4B-Instruct | Visual language model of Youtu-LLM | 🤗 [Model](https://huggingface.co/tencent/Youtu-VL-4B-Instruct)|
37
+ | Youtu-VL-4B-Instruct-GGUF | Visual language model of Youtu-LLM, in GGUF format | 🤗 [Model](https://huggingface.co/tencent/Youtu-VL-4B-Instruct-GGUF)|
38
+
39
+ ## 🧠 Model Architecture Highlights
40
+
41
+ - **Vision–Language Unified Autoregressive Supervision (VLUAS)**: Youtu-VL is built on the VLUAS paradigm to mitigate the text-dominant optimization bias in conventional VLMs, where visual signals are treated as passive conditions and fine-grained details are often dropped. Rather than using vision features only as inputs, Youtu-VL expands the text lexicon into a unified multimodal vocabulary through a learned visual codebook, turning visual signals into autoregressive supervision targets. Jointly reconstructing visual tokens and text explicitly preserves dense visual information while strengthening multimodal semantic understanding.
42
+
43
+ - **Vision-Centric Prediction with a Standard Architecture (no task-specific modules)**: Youtu-VL treats image and text tokens with equivalent autoregressive status, empowering it to perform vision-centric tasks for both dense vision prediction (e.g., segmentation, depth) and text-based prediction (e.g., grounding, detection) within a standard VLM architecture, eliminating the need for task-specific additions. This design yields a versitile general-purpose VLM, allowing a single model to flexibly accommodate a wide range of vision-centric and vsion-language requirements.
44
+
45
+ <p align="center\">
46
+ <img src="assets/architecture.png" width="90%"/>
47
+ <p>
48
+
49
+ <a id="benchmarks"></a>
50
+ ## 🏆 Model Performance
51
+
52
+ ### Vision-Centric Tasks
53
+
54
+ <p align="center\">
55
+ <img src="assets/vision-centric-performance.png" width="90%"/>
56
+ <p>
57
+
58
+
59
+ ### General Multimodal Tasks
60
+
61
+
62
+ <p align="center\">
63
+ <img src="assets/general-multimodal-performance.png" width="90%"/>
64
+ <p>
65
+
66
+
67
+ <a id="quickstart"></a>
68
+ ## 🚀 Quickstart
69
+
70
+ ### Using Transformers to Chat
71
+
72
+ Ensure your Python environment has the `transformers` library installed and that the version meets the requirements.
73
+
74
+ ```bash
75
+ pip install "transformers>=4.56.0,<=4.57.1" torch accelerate pillow torchvision git+https://github.com/lucasb-eyer/pydensecrf.git opencv-python-headless
76
+ ```
77
+
78
+ The snippet below shows how to interact with the chat model using `transformers`:
79
+
80
+ ```python
81
+ from transformers import AutoProcessor, AutoModelForCausalLM
82
+
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ "tencent/Youtu-VL-4B-Instruct", attn_implementation="flash_attention_2", torch_dtype="auto", device_map="cuda", trust_remote_code=True
85
+ ).eval()
86
+
87
+ processor = AutoProcessor.from_pretrained(
88
+ "tencent/Youtu-VL-4B-Instruct", use_fast=True, trust_remote_code=True
89
+ )
90
+
91
+ img_path = "./assets/logo.png"
92
+ messages = [
93
+ {
94
+ "role": "user",
95
+ "content": [
96
+ {"type": "image", "image": img_path},
97
+ {"type": "text", "text": "Describe the image"},
98
+ ],
99
+ }
100
+ ]
101
+
102
+ inputs = processor.apply_chat_template(
103
+ messages,
104
+ tokenize=True,
105
+ add_generation_prompt=True,
106
+ return_dict=True,
107
+ return_tensors="pt"
108
+ ).to(model.device)
109
+
110
+ generated_ids = model.generate(
111
+ **inputs,
112
+ temperature=0.1,
113
+ top_p=0.001,
114
+ repetition_penalty=1.05,
115
+ do_sample=True,
116
+ max_new_tokens=32768,
117
+ img_input=img_path,
118
+ )
119
+
120
+ generated_ids_trimmed = [
121
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
122
+ ]
123
+ outputs = processor.batch_decode(
124
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
125
+ )
126
+ generated_text = outputs[0]
127
+ print(f"Youtu-VL output: {generated_text}")
128
+ ```
129
+
130
+ ### Demo for VL and CV tasks
131
+
132
+ A simple demo for quick start, including VL and CV tasks: [jupyter notebook](https://github.com/TencentCloudADP/youtu-vl/blob/main/demo/demo.ipynb)
133
+
134
+ The core part of this demo is three lines below:
135
+
136
+ ```python
137
+ model_path = "tencent/Youtu-VL-4B-Instruct"
138
+ youtu_vl = YoutuVL(model_path)
139
+ response = youtu_vl(prompt, img_path, seg_mode=seg_mode)
140
+ ```
141
+
142
+ ### Qualitative Results
143
+ * **Task: Grounding**
144
+ > **Prompt:** Please provide the bounding box coordinate of the region this sentence describes: a black and white cat sitting on the edge of the bathtub
145
+ >
146
+ > <img src="https://cdn-uploads.huggingface.co/production/uploads/656312995475849b82c38bc4/349v2vYasfz4GtF_T_D09.png" width="300px">
147
+
148
+ * **Task: Object Detection**
149
+ > **Prompt:** Detect all objects in the provided image.
150
+ >
151
+ > <img src="https://cdn-uploads.huggingface.co/production/uploads/656312995475849b82c38bc4/rUJ6PzIjGJWwK4e9owPlY.png" width="300px">
152
+
153
+ * **Task: Referring Segmentation**
154
+ > **Prompt:** Can you segment "hotdog on left" in this image?
155
+ >
156
+ > <img src="https://cdn-uploads.huggingface.co/production/uploads/656312995475849b82c38bc4/K-5UG6HSLb28UFGx2pdPX.png" width="300px">
157
+
158
+ For more examples, please refer to paper and Jupyter notebooks.
159
+
160
+
161
+ ## 🎉 Citation
162
+
163
+ If you find our work useful in your research, please consider citing our paper:
164
+
165
+ ```
166
+ @article{youtu-vl,
167
+ title={Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision},
168
+ author={Tencent Youtu Lab},
169
+ year={2026},
170
+ eprint={2601.19798},
171
+ archivePrefix={arXiv},
172
+ primaryClass={cs.CV},
173
+ url={https://arxiv.org/abs/2601.19798},
174
+ }
175
+
176
+ @article{youtu-llm,
177
+ title={Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models},
178
+ author={Tencent Youtu Lab},
179
+ year={2025},
180
+ eprint={2512.24618},
181
+ archivePrefix={arXiv},
182
+ primaryClass={cs.CL},
183
+ url={https://arxiv.org/abs/2512.24618},
184
+ }
185
+ ```
__init__.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 The Youtu Team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ from typing import TYPE_CHECKING
15
+ from transformers.utils import _LazyModule
16
+ from transformers.utils.import_utils import define_import_structure
17
+
18
+ if TYPE_CHECKING:
19
+ from .configuration_youtu_vl import *
20
+ from .modeling_youtu_vl import *
21
+ from .processing_youtu_vl import *
22
+ from .configuration_siglip2 import *
23
+ from .image_processing_siglip2_fast import *
24
+ from .modeling_siglip2 import *
25
+ else:
26
+ import sys
27
+
28
+ _file = globals()["__file__"]
29
+ sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
30
+
31
+
assets/architecture.png ADDED

Git LFS Details

  • SHA256: 19baf08183356a10afb306e4b7cc99ed4e4a26a865b72203d9fd441223e61273
  • Pointer size: 132 Bytes
  • Size of remote file: 1.07 MB
assets/general-multimodal-performance.png ADDED

Git LFS Details

  • SHA256: ec7f34def0ceaba9040ab0cdee0d66746c8ce0beffbfcf566c672d790ae421ac
  • Pointer size: 131 Bytes
  • Size of remote file: 408 kB
assets/logo.png ADDED

Git LFS Details

  • SHA256: dbcd8caf64935e9b33fc779e36eea69cfbd2e5a5a521e5fcefab9b8b8cc1c7d2
  • Pointer size: 131 Bytes
  • Size of remote file: 614 kB
assets/vision-centric-performance.png ADDED

Git LFS Details

  • SHA256: 498d7051881549e578946e909c09d5289157451d6d1408e58f3da3d841779d21
  • Pointer size: 131 Bytes
  • Size of remote file: 534 kB
assets/youtu-vl-logo.png ADDED

Git LFS Details

  • SHA256: 09433e54caa12a5173c78f79dac9db1085488a48f09f18aba6dbb43e5e8e57b0
  • Pointer size: 130 Bytes
  • Size of remote file: 95.4 kB
assets/youtu-vl-overview.png ADDED

Git LFS Details

  • SHA256: d5993219ccd330f00b0797db55f84e5e4d6cd287cc11b3d575d2f11212a3f449
  • Pointer size: 132 Bytes
  • Size of remote file: 2.51 MB
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|begin_of_text|>system\nYou are a helpful assistant.<|end_of_text|>\n{% endif %}<|begin_of_text|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|end_of_text|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% endif %}{% endfor %}{% for content in message['content'] %}{% if content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% endif %}{% endfor %}{% for content in message['content'] %}{% if 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|end_of_text|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|begin_of_text|>assistant\n{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "YoutuVLForConditionalGeneration"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_youtu_vl.YoutuVLConfig",
9
+ "AutoModelForCausalLM": "modeling_youtu_vl.YoutuVLForConditionalGeneration",
10
+ "AutoProcessor": "processing_youtu_vl.YoutuVLProcessor",
11
+ "AutoImageProcessor": "image_processing_siglip2_fast.Siglip2ImageProcessorFast"
12
+ },
13
+ "bos_token_id": 128000,
14
+ "embedding_initializer_range": 0.025,
15
+ "eos_token_id": 128001,
16
+ "head_dim": 64,
17
+ "hidden_act": "silu",
18
+ "hidden_size": 2560,
19
+ "image_token_id": 128264,
20
+ "initializer_range": 0.0125,
21
+ "intermediate_size": 9728,
22
+ "kv_lora_rank": 512,
23
+ "max_position_embeddings": 32768,
24
+ "mlp_bias": false,
25
+ "model_type": "youtu_vl",
26
+ "num_attention_heads": 32,
27
+ "num_hidden_layers": 40,
28
+ "num_key_value_heads": 32,
29
+ "pad_token_id": 128001,
30
+ "q_lora_rank": 1536,
31
+ "qk_head_dim": 192,
32
+ "qk_nope_head_dim": 128,
33
+ "qk_rope_head_dim": 64,
34
+ "rms_norm_eps": 1e-06,
35
+ "rope_interleave": true,
36
+ "rope_theta": 500000,
37
+ "tie_word_embeddings": true,
38
+ "torch_dtype": "bfloat16",
39
+ "transformers_version": "4.56.0",
40
+ "use_cache": false,
41
+ "v_head_dim": 128,
42
+ "video_token_id": 128265,
43
+ "vision_config": {
44
+ "attention_dropout": 0.0,
45
+ "hidden_act": "gelu_pytorch_tanh",
46
+ "hidden_size": 1152,
47
+ "intermediate_size": 4304,
48
+ "layer_norm_eps": 1e-06,
49
+ "model_type": "siglip2_vision_model",
50
+ "num_attention_heads": 16,
51
+ "num_channels": 3,
52
+ "num_hidden_layers": 27,
53
+ "num_patches": 4096,
54
+ "out_hidden_size": 2560,
55
+ "patch_size": 16,
56
+ "tokens_per_second": 2,
57
+ "torch_dtype": "bfloat16",
58
+ "vision_use_head": false,
59
+ "window_size": 256,
60
+ "fullatt_block_indexes": [
61
+ 7,
62
+ 15,
63
+ 23,
64
+ 26
65
+ ]
66
+ },
67
+ "vision_end_token_id": 128263,
68
+ "vision_start_token_id": 128262,
69
+ "custom_tokens": {
70
+ "<custom_1>": [282363],
71
+ "<|image_pad|>": [128264],
72
+ "<ins>": [283365],
73
+ "<ref>": [283371],
74
+ "</ref>": [283372],
75
+ "<mask>": [27, 16499, 29],
76
+ "</mask>": [713, 16499, 29],
77
+ "<depth>": [
78
+ [440, 29064, 24661],
79
+ [440, 29064, 12672]
80
+ ],
81
+ "comma": [11],
82
+ "digit_start": [15],
83
+ "<OTHERS>": [283375],
84
+ "<x_0>": [278267],
85
+ "<y_2047>": [282362],
86
+ "<mask_rle>": [7],
87
+ "</mask_rle>": [8]
88
+ },
89
+ "vocab_size": 283386
90
+ }
configuration_siglip2.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+ from transformers.utils import logging
3
+
4
+
5
+ logger = logging.get_logger(__name__)
6
+
7
+
8
+ class Siglip2TextConfig(PretrainedConfig):
9
+ r"""
10
+ Args:
11
+ vocab_size (`int`, *optional*, defaults to 32000):
12
+ Vocabulary size of the Siglip2 text model. Defines the number of different tokens that can be represented by
13
+ the `inputs_ids` passed when calling [`Siglip2Model`].
14
+ hidden_size (`int`, *optional*, defaults to 768):
15
+ Dimensionality of the encoder layers and the pooler layer.
16
+ intermediate_size (`int`, *optional*, defaults to 3072):
17
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
18
+ num_hidden_layers (`int`, *optional*, defaults to 12):
19
+ Number of hidden layers in the Transformer encoder.
20
+ num_attention_heads (`int`, *optional*, defaults to 12):
21
+ Number of attention heads for each attention layer in the Transformer encoder.
22
+ max_position_embeddings (`int`, *optional*, defaults to 64):
23
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
24
+ just in case (e.g., 512 or 1024 or 2048).
25
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
26
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
27
+ `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
28
+ layer_norm_eps (`float`, *optional*, defaults to 1e-06):
29
+ The epsilon used by the layer normalization layers.
30
+ attention_dropout (`float`, *optional*, defaults to 0.0):
31
+ The dropout ratio for the attention probabilities.
32
+ pad_token_id (`int`, *optional*, defaults to 1):
33
+ The id of the padding token in the vocabulary.
34
+ bos_token_id (`int`, *optional*, defaults to 49406):
35
+ The id of the beginning-of-sequence token in the vocabulary.
36
+ eos_token_id (`int`, *optional*, defaults to 49407):
37
+ The id of the end-of-sequence token in the vocabulary.
38
+ projection_size (`int`, *optional*, defaults to `hidden_size`):
39
+ The size of the projection head.
40
+
41
+ """
42
+
43
+ model_type = "siglip2_text_model"
44
+ base_config_key = "text_config"
45
+
46
+ def __init__(
47
+ self,
48
+ vocab_size=32000,
49
+ hidden_size=768,
50
+ intermediate_size=3072,
51
+ num_hidden_layers=12,
52
+ num_attention_heads=12,
53
+ max_position_embeddings=64,
54
+ hidden_act="gelu_pytorch_tanh",
55
+ layer_norm_eps=1e-6,
56
+ attention_dropout=0.0,
57
+ pad_token_id=1,
58
+ bos_token_id=49406,
59
+ eos_token_id=49407,
60
+ projection_size=None,
61
+ **kwargs,
62
+ ):
63
+ super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
64
+
65
+ self.vocab_size = vocab_size
66
+ self.hidden_size = hidden_size
67
+ self.intermediate_size = intermediate_size
68
+ self.num_hidden_layers = num_hidden_layers
69
+ self.num_attention_heads = num_attention_heads
70
+ self.max_position_embeddings = max_position_embeddings
71
+ self.layer_norm_eps = layer_norm_eps
72
+ self.hidden_act = hidden_act
73
+ self.attention_dropout = attention_dropout
74
+ self.projection_size = projection_size if projection_size is not None else hidden_size
75
+
76
+
77
+ class Siglip2VisionConfig(PretrainedConfig):
78
+ r"""
79
+ Args:
80
+ hidden_size (`int`, *optional*, defaults to 768):
81
+ Dimensionality of the encoder layers and the pooler layer.
82
+ intermediate_size (`int`, *optional*, defaults to 3072):
83
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
84
+ num_hidden_layers (`int`, *optional*, defaults to 12):
85
+ Number of hidden layers in the Transformer encoder.
86
+ num_attention_heads (`int`, *optional*, defaults to 12):
87
+ Number of attention heads for each attention layer in the Transformer encoder.
88
+ num_channels (`int`, *optional*, defaults to 3):
89
+ Number of channels in the input images.
90
+ num_patches (`int`, *optional*, defaults to 256):
91
+ The number of patches in the image with the size of (`patch_size`, `patch_size`).
92
+ The image is resized to fill maximum of this number of patches, and to preserve
93
+ the aspect ratio. In case the resulted number of patches is lower, the image is
94
+ padded in "patch" dimension.
95
+ patch_size (`int`, *optional*, defaults to 16):
96
+ The size (resolution) of each patch.
97
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
98
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
99
+ `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
100
+ layer_norm_eps (`float`, *optional*, defaults to 1e-06):
101
+ The epsilon used by the layer normalization layers.
102
+ attention_dropout (`float`, *optional*, defaults to 0.0):
103
+ The dropout ratio for the attention probabilities.
104
+
105
+ """
106
+
107
+ model_type = "siglip2_vision_model"
108
+ base_config_key = "vision_config"
109
+
110
+ def __init__(
111
+ self,
112
+ hidden_size=768,
113
+ out_hidden_size=2048,
114
+ intermediate_size=3072,
115
+ num_hidden_layers=12,
116
+ num_attention_heads=12,
117
+ num_channels=3,
118
+ num_patches=256,
119
+ patch_size=16,
120
+ hidden_act="gelu_pytorch_tanh",
121
+ layer_norm_eps=1e-6,
122
+ attention_dropout=0.0,
123
+ **kwargs,
124
+ ):
125
+ super().__init__(**kwargs)
126
+
127
+ self.hidden_size = hidden_size
128
+ self.out_hidden_size = out_hidden_size
129
+ self.intermediate_size = intermediate_size
130
+ self.num_hidden_layers = num_hidden_layers
131
+ self.num_attention_heads = num_attention_heads
132
+ self.num_channels = num_channels
133
+ self.patch_size = patch_size
134
+ self.attention_dropout = attention_dropout
135
+ self.layer_norm_eps = layer_norm_eps
136
+ self.hidden_act = hidden_act
137
+ self.num_patches = num_patches
138
+ self.in_features = -1
139
+
140
+
141
+ class Siglip2Config(PretrainedConfig):
142
+ r"""
143
+ Args:
144
+ text_config (`dict`, *optional*):
145
+ Dictionary of configuration options used to initialize [`Siglip2TextConfig`].
146
+ vision_config (`dict`, *optional*):
147
+ Dictionary of configuration options used to initialize [`Siglip2VisionConfig`].
148
+ kwargs (*optional*):
149
+ Dictionary of keyword arguments.
150
+
151
+ """
152
+
153
+ model_type = "siglip2"
154
+ sub_configs = {"text_config": Siglip2TextConfig, "vision_config": Siglip2VisionConfig}
155
+
156
+ def __init__(self, text_config=None, vision_config=None, **kwargs):
157
+ super().__init__(**kwargs)
158
+
159
+ if text_config is None:
160
+ text_config = {}
161
+ logger.info("`text_config` is `None`. Initializing the `Siglip2TextConfig` with default values.")
162
+
163
+ if vision_config is None:
164
+ vision_config = {}
165
+ logger.info("`vision_config` is `None`. initializing the `Siglip2VisionConfig` with default values.")
166
+
167
+ self.text_config = Siglip2TextConfig(**text_config)
168
+ self.vision_config = Siglip2VisionConfig(**vision_config)
169
+
170
+ self.initializer_factor = 1.0
171
+
172
+ @classmethod
173
+ def from_text_vision_configs(cls, text_config: Siglip2TextConfig, vision_config: Siglip2VisionConfig, **kwargs):
174
+
175
+ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
176
+
177
+
178
+ __all__ = ["Siglip2Config", "Siglip2TextConfig", "Siglip2VisionConfig"]
configuration_youtu_vl.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 Tencent Youtu Lab and the HuggingFace Inc. team. All rights reserved.
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from transformers.configuration_utils import PretrainedConfig
16
+ from transformers.modeling_rope_utils import rope_config_validation
17
+ from .configuration_siglip2 import Siglip2VisionConfig
18
+
19
+
20
+ YOUTU_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
21
+
22
+ class YoutuVLConfig(PretrainedConfig):
23
+ r"""
24
+ Args:
25
+ vocab_size (`int`, *optional*, defaults to 129280):
26
+ Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
27
+ `inputs_ids` passed when calling [`YoutuModel`]
28
+ hidden_size (`int`, *optional*, defaults to 7168):
29
+ Dimension of the hidden representations.
30
+ intermediate_size (`int`, *optional*, defaults to 18432):
31
+ Dimension of the MLP representations.
32
+ num_hidden_layers (`int`, *optional*, defaults to 61):
33
+ Number of hidden layers in the Transformer decoder.
34
+ num_attention_heads (`int`, *optional*, defaults to 128):
35
+ Number of attention heads for each attention layer in the Transformer decoder.
36
+ num_key_value_heads (`int`, *optional*, defaults to 128):
37
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
38
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
39
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
40
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
41
+ by meanpooling all the original heads within that group. For more details checkout [this
42
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
43
+ `num_attention_heads`.
44
+ n_shared_experts (`int`, *optional*, defaults to 1):
45
+ Number of shared experts.
46
+ n_routed_experts (`int`, *optional*, defaults to 256):
47
+ Number of routed experts.
48
+ routed_scaling_factor (`float`, *optional*, defaults to 2.5):
49
+ Scaling factor or routed experts.
50
+ kv_lora_rank (`int`, *optional*, defaults to 512):
51
+ Rank of the LoRA matrices for key and value projections.
52
+ q_lora_rank (`int`, *optional*, defaults to 1536):
53
+ Rank of the LoRA matrices for query projections.
54
+ qk_rope_head_dim (`int`, *optional*, defaults to 64):
55
+ Dimension of the query/key heads that use rotary position embeddings.
56
+ v_head_dim (`int`, *optional*, defaults to 128):
57
+ Dimension of the value heads.
58
+ qk_nope_head_dim (`int`, *optional*, defaults to 128):
59
+ Dimension of the query/key heads that don't use rotary position embeddings.
60
+ n_group (`int`, *optional*, defaults to 8):
61
+ Number of groups for routed experts.
62
+ topk_group (`int`, *optional*, defaults to 4):
63
+ Number of selected groups for each token.
64
+ num_experts_per_tok (`int`, *optional*, defaults to 8):
65
+ Number of selected experts, None means dense model.
66
+ norm_topk_prob (`bool`, *optional*, defaults to `True`):
67
+ Whether to normalize the weights of the routed experts.
68
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
69
+ The non-linear activation function (function or string) in the decoder.
70
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
71
+ The maximum sequence length that this model might ever be used with.
72
+ initializer_range (`float`, *optional*, defaults to 0.02):
73
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
74
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
75
+ The epsilon used by the rms normalization layers.
76
+ use_cache (`bool`, *optional*, defaults to `True`):
77
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
78
+ relevant if `config.is_decoder=True`.
79
+ pad_token_id (`int`, *optional*):
80
+ Padding token id.
81
+ bos_token_id (`int`, *optional*, defaults to 0):
82
+ Beginning of stream token id.
83
+ eos_token_id (`int`, *optional*, defaults to 1):
84
+ End of stream token id.
85
+ pretraining_tp (`int`, *optional*, defaults to 1):
86
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
87
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
88
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
89
+ issue](https://github.com/pytorch/pytorch/issues/76232).
90
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
91
+ Whether to tie weight embeddings
92
+ rope_theta (`float`, *optional*, defaults to 10000.0):
93
+ The base period of the RoPE embeddings.
94
+ rope_scaling (`Dict`, *optional*):
95
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
96
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
97
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
98
+ `max_position_embeddings` to the expected new maximum.
99
+ rope_interleave (`bool`, *optional*, defaults to `True`):
100
+ Whether to interleave the rotary position embeddings.
101
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
102
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
103
+ attention_dropout (`float`, *optional*, defaults to 0.0):
104
+ The dropout ratio for the attention probabilities.
105
+
106
+ """
107
+
108
+ sub_configs = {"vision_config": Siglip2VisionConfig}
109
+ model_type = "youtu_vl"
110
+ keys_to_ignore_at_inference = ["past_key_values"]
111
+ base_model_pp_plan = {
112
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
113
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
114
+ "norm": (["hidden_states"], ["hidden_states"]),
115
+ }
116
+
117
+ def __init__(
118
+ self,
119
+ vocab_size=129280,
120
+ hidden_size=7168,
121
+ intermediate_size=18432,
122
+ num_hidden_layers=61,
123
+ num_attention_heads=128,
124
+ num_key_value_heads=128,
125
+ n_shared_experts=1,
126
+ n_routed_experts=256,
127
+ routed_scaling_factor=2.5,
128
+ kv_lora_rank=512,
129
+ q_lora_rank=1536,
130
+ qk_rope_head_dim=64,
131
+ v_head_dim=128,
132
+ qk_nope_head_dim=128,
133
+ n_group=8,
134
+ topk_group=4,
135
+ num_experts_per_tok=8,
136
+ norm_topk_prob=True,
137
+ hidden_act="silu",
138
+ max_position_embeddings=4096,
139
+ initializer_range=None,
140
+ embedding_initializer_range=None,
141
+ rms_norm_eps=1e-6,
142
+ use_cache=True,
143
+ pad_token_id=None,
144
+ bos_token_id=0,
145
+ eos_token_id=1,
146
+ pretraining_tp=1,
147
+ tie_word_embeddings=False,
148
+ rope_theta=10000.0,
149
+ rope_scaling=None,
150
+ rope_interleave=True,
151
+ attention_bias=False,
152
+ attention_dropout=0.0,
153
+ vision_config=None,
154
+ custom_tokens=None,
155
+ **kwargs,
156
+ ):
157
+ if isinstance(vision_config, dict):
158
+ self.vision_config = self.sub_configs["vision_config"](**vision_config)
159
+ elif vision_config is None:
160
+ self.vision_config = self.sub_configs["vision_config"]()
161
+
162
+ self.vocab_size = vocab_size
163
+ self.max_position_embeddings = max_position_embeddings
164
+ self.hidden_size = hidden_size
165
+ self.intermediate_size = intermediate_size
166
+ self.num_hidden_layers = num_hidden_layers
167
+ self.num_attention_heads = num_attention_heads
168
+ self.n_shared_experts = n_shared_experts
169
+ self.n_routed_experts = n_routed_experts
170
+ self.routed_scaling_factor = routed_scaling_factor
171
+ self.kv_lora_rank = kv_lora_rank
172
+ self.q_lora_rank = q_lora_rank
173
+ self.qk_rope_head_dim = qk_rope_head_dim
174
+ self.v_head_dim = v_head_dim
175
+ self.qk_nope_head_dim = qk_nope_head_dim
176
+ self.qk_head_dim = qk_nope_head_dim + qk_rope_head_dim
177
+ self.head_dim = qk_rope_head_dim
178
+ self.n_group = n_group
179
+ self.topk_group = topk_group
180
+ self.num_experts_per_tok = num_experts_per_tok
181
+ self.norm_topk_prob = norm_topk_prob
182
+ self.rope_interleave = rope_interleave
183
+ self.flash_att_sliding_window = None
184
+ self.custom_tokens = custom_tokens
185
+
186
+ self.mlp_bias = False
187
+ self.mtp_loss_weight = 0.3
188
+
189
+ if num_key_value_heads is None:
190
+ num_key_value_heads = num_attention_heads
191
+
192
+ self.num_key_value_heads = num_key_value_heads
193
+ self.hidden_act = hidden_act
194
+ self.initializer_range = (
195
+ (2.0 / (5.0 * self.hidden_size)) ** 0.5
196
+ if initializer_range is None
197
+ else initializer_range
198
+ )
199
+ self.embedding_initializer_range = (
200
+ self.initializer_range * 2.0
201
+ if embedding_initializer_range is None
202
+ else embedding_initializer_range
203
+ )
204
+ self.rms_norm_eps = rms_norm_eps
205
+ self.pretraining_tp = pretraining_tp
206
+ self.use_cache = use_cache
207
+ self.rope_theta = rope_theta
208
+ self.rope_scaling = rope_scaling
209
+ self.attention_bias = attention_bias
210
+ self.attention_dropout = attention_dropout
211
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
212
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
213
+ rope_config_validation(self)
214
+
215
+ super().__init__(
216
+ pad_token_id=pad_token_id,
217
+ bos_token_id=bos_token_id,
218
+ eos_token_id=eos_token_id,
219
+ tie_word_embeddings=tie_word_embeddings,
220
+ **kwargs,
221
+ )
222
+
223
+
224
+ __all__ = ["YoutuVLConfig"]
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 128000,
4
+ "eos_token_id": 128001,
5
+ "pad_token_id": 128001,
6
+ "transformers_version": "4.56.0"
7
+ }
image_processing_siglip2_fast.py ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Optional, Tuple, Union
2
+ import os
3
+ import torch
4
+ import math
5
+ from torchvision.transforms import functional as F
6
+ from transformers.image_processing_utils import BatchFeature
7
+ from transformers.image_processing_utils_fast import (
8
+ BaseImageProcessorFast,
9
+ DefaultFastImageProcessorKwargs,
10
+ SizeDict,
11
+ )
12
+ from transformers.image_utils import (
13
+ ImageInput,
14
+ PILImageResampling,
15
+ )
16
+ from transformers.processing_utils import Unpack
17
+ from transformers.utils import (
18
+ TensorType,
19
+ add_start_docstrings,
20
+ is_torch_available,
21
+ is_torchvision_available,
22
+ is_torchvision_v2_available,
23
+ logging,
24
+ )
25
+
26
+ BASE_IMAGE_PROCESSOR_FAST_DOCSTRING = r"""
27
+
28
+ Args:
29
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
30
+ Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
31
+ `do_resize` parameter in the `preprocess` method.
32
+ size (`dict`, *optional*, defaults to `self.size`):
33
+ Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
34
+ method.
35
+ default_to_square (`bool`, *optional*, defaults to `self.default_to_square`):
36
+ Whether to default to a square image when resizing, if size is an int.
37
+ resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
38
+ Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
39
+ overridden by the `resample` parameter in the `preprocess` method.
40
+ do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
41
+ Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
42
+ `preprocess` method.
43
+ crop_size (`Dict[str, int]` *optional*, defaults to `self.crop_size`):
44
+ Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
45
+ method.
46
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
47
+ Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
48
+ `do_rescale` parameter in the `preprocess` method.
49
+ rescale_factor (`int` or `float`, *optional*, defaults to `self.rescale_factor`):
50
+ Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
51
+ overridden by the `rescale_factor` parameter in the `preprocess` method.
52
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
53
+ Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
54
+ method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
55
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
56
+ Mean to use if normalizing the image. This is a float or list of floats the length of the number of
57
+ channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
58
+ overridden by the `image_mean` parameter in the `preprocess` method.
59
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
60
+ Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
61
+ number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
62
+ Can be overridden by the `image_std` parameter in the `preprocess` method.
63
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
64
+ Whether to convert the image to RGB.
65
+ return_tensors (`str` or `TensorType`, *optional*, defaults to `self.return_tensors`):
66
+ Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
67
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `self.data_format`):
68
+ Only `ChannelDimension.FIRST` is supported. Added for compatibility with slow processors.
69
+ input_data_format (`ChannelDimension` or `str`, *optional*, defaults to `self.input_data_format`):
70
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
71
+ from the input image. Can be one of:
72
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
73
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
74
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
75
+ device (`torch.device`, *optional*, defaults to `self.device`):
76
+ The device to process the images on. If unset, the device is inferred from the input images."""
77
+
78
+ BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS = r"""
79
+ Preprocess an image or batch of images.
80
+
81
+ Args:
82
+ images (`ImageInput`):
83
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
84
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
85
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
86
+ Whether to resize the image.
87
+ size (`Dict[str, int]`, *optional*, defaults to `self.size`):
88
+ Describes the maximum input dimensions to the model.
89
+ resample (`PILImageResampling` or `InterpolationMode`, *optional*, defaults to `self.resample`):
90
+ Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
91
+ has an effect if `do_resize` is set to `True`.
92
+ do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
93
+ Whether to center crop the image.
94
+ crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
95
+ Size of the output image after applying `center_crop`.
96
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
97
+ Whether to rescale the image.
98
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
99
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
100
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
101
+ Whether to normalize the image.
102
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
103
+ Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
104
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
105
+ Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
106
+ `True`.
107
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
108
+ Whether to convert the image to RGB.
109
+ return_tensors (`str` or `TensorType`, *optional*, defaults to `self.return_tensors`):
110
+ Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
111
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `self.data_format`):
112
+ Only `ChannelDimension.FIRST` is supported. Added for compatibility with slow processors.
113
+ input_data_format (`ChannelDimension` or `str`, *optional*, defaults to `self.input_data_format`):
114
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
115
+ from the input image. Can be one of:
116
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
117
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
118
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
119
+ device (`torch.device`, *optional*, defaults to `self.device`):
120
+ The device to process the images on. If unset, the device is inferred from the input images."""
121
+
122
+
123
+ if is_torch_available():
124
+ import torch
125
+
126
+ if is_torchvision_available():
127
+ if is_torchvision_v2_available():
128
+ from torchvision.transforms.v2 import functional as F
129
+ else:
130
+ from torchvision.transforms import functional as F
131
+
132
+
133
+ logger = logging.get_logger(__name__)
134
+
135
+
136
+ def get_image_size_for_patches(
137
+ image_height: int, image_width: int, patch_size: int, max_num_patches: int
138
+ ) -> Tuple[int, int]:
139
+ """
140
+ Args:
141
+ image_height (`int`):
142
+ Original image height.
143
+ image_width (`int`):
144
+ Original image width.
145
+ patch_size (`int`):
146
+ Patch size for processing.
147
+
148
+ Returns:
149
+ Tuple: (target_height, target_width)
150
+ """
151
+
152
+ def get_scaled_image_size(scale: float, size: int, patch_size: int) -> int:
153
+ patch_size = patch_size * 2
154
+ scaled_size = size * scale
155
+ scaled_size = math.ceil(scaled_size / patch_size) * patch_size
156
+ scaled_size = max(patch_size, scaled_size)
157
+ return int(scaled_size)
158
+
159
+ scale = 1.0
160
+ while True:
161
+ target_height = get_scaled_image_size(scale, image_height, patch_size)
162
+ target_width = get_scaled_image_size(scale, image_width, patch_size)
163
+ num_patches = (target_height / patch_size) * (target_width / patch_size)
164
+
165
+ if num_patches > max_num_patches:
166
+ scale -= 0.02
167
+ else:
168
+ break
169
+
170
+ return target_height, target_width
171
+
172
+
173
+ def convert_image_to_patches(image: "torch.Tensor", patch_size: int, merge_size: int) -> "torch.Tensor":
174
+ """
175
+ Converts an input image into flattened patches.
176
+
177
+ Args:
178
+ image: Input image tensor of shape (channels, height, width)
179
+ patch_size: Size of each square patch (in pixels)
180
+ merge_size: Number of adjacent patches to merge
181
+
182
+ """
183
+
184
+ num_channels, image_height, image_width = image.shape
185
+ num_patches_height = image_height // patch_size
186
+ num_patches_width = image_width // patch_size
187
+ patched_image = image.reshape(num_channels,
188
+ num_patches_height//merge_size,
189
+ merge_size, patch_size,
190
+ num_patches_width//merge_size,
191
+ merge_size, patch_size)
192
+ patched_image = patched_image.permute(1, 4, 2, 5, 3, 6, 0)
193
+ patched_image = patched_image.reshape(num_patches_height * num_patches_width, -1)
194
+ return patched_image
195
+
196
+ def pad_along_first_dim(
197
+ tensor: "torch.Tensor", target_length: int, pad_value: int = 0
198
+ ) -> Tuple["torch.Tensor", "torch.Tensor"]:
199
+ """
200
+ Pad the input tensor along its first dimension to a target length.
201
+
202
+ Args:
203
+ tensor (torch.Tensor): The input tensor to be padded.
204
+ target_length (int): The desired length of the first dimension after padding.
205
+ pad_value (int, optional): The value to use for padding. Defaults to 0.
206
+ """
207
+ current_length = tensor.shape[0]
208
+ padding_length = target_length - current_length
209
+ mask = torch.ones((target_length,), dtype=torch.int32)
210
+ if padding_length > 0:
211
+ padding = [0, 0] * (tensor.ndim - 1) + [0, padding_length]
212
+ tensor = torch.nn.functional.pad(tensor, padding, mode="constant", value=pad_value)
213
+ mask[-padding_length:] = 0
214
+ return tensor, mask
215
+
216
+
217
+ class Siglip2FastImageProcessorKwargs(DefaultFastImageProcessorKwargs):
218
+ patch_size: Optional[int]
219
+ max_num_patches: Optional[int]
220
+
221
+
222
+ @add_start_docstrings(
223
+ r"Constructs a fast Siglip2 image processor.",
224
+ BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
225
+ """
226
+ patch_size (`int`, *optional*, defaults to 16):
227
+ The size (resolution) of each patch the image will be split to.
228
+ max_num_patches (`int`, *optional*, defaults to 256):
229
+ The image will be resized to have at most this number of patches,
230
+ and then padded in "patch" dimension to match this number exactly.
231
+ """,
232
+ )
233
+ class Siglip2ImageProcessorFast(BaseImageProcessorFast):
234
+ resample = PILImageResampling.BILINEAR
235
+ image_mean = [0.5, 0.5, 0.5]
236
+ image_std = [0.5, 0.5, 0.5]
237
+ do_resize = True
238
+ do_rescale = True
239
+ do_normalize = True
240
+ patch_size = 16
241
+ max_num_patches = 256
242
+ valid_kwargs = Siglip2FastImageProcessorKwargs
243
+ unused_kwargs = ["size", "do_center_crop", "crop_size"]
244
+ print_max_patched = True
245
+
246
+ def __init__(self, **kwargs: Unpack[Siglip2FastImageProcessorKwargs]):
247
+ super().__init__(**kwargs)
248
+
249
+ def _validate_preprocess_kwargs(self, **kwargs) -> tuple:
250
+ kwargs.pop("do_resize", None)
251
+ return super()._validate_preprocess_kwargs(**kwargs)
252
+
253
+ @add_start_docstrings(
254
+ BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS,
255
+ """
256
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
257
+ The size (resolution) of each patch the image will be split to.
258
+ max_num_patches (`int`, *optional*, defaults to `self.max_num_patches`):
259
+ The image will be resized to have at most this number of patches,
260
+ and then padded in "patch" dimension to match this number exactly.
261
+ """,
262
+ )
263
+ def preprocess(self, images: ImageInput, **kwargs: Unpack[Siglip2FastImageProcessorKwargs]) -> BatchFeature:
264
+ return super().preprocess(images, **kwargs)
265
+
266
+ def get_max_image_patches(self, images):
267
+ return 4096 * 6 * 6
268
+
269
+ def _preprocess(
270
+ self,
271
+ images: List["torch.Tensor"],
272
+ do_resize: bool,
273
+ patch_size: int,
274
+ max_num_patches: int,
275
+ interpolation: Optional["F.InterpolationMode"],
276
+ do_rescale: bool,
277
+ rescale_factor: float,
278
+ do_normalize: bool,
279
+ image_mean: Optional[Union[float, List[float]]],
280
+ image_std: Optional[Union[float, List[float]]],
281
+ return_tensors: Optional[Union[str, TensorType]],
282
+ **kwargs,
283
+ ) -> BatchFeature:
284
+ pixel_masks = []
285
+ pixel_values = []
286
+ spatial_shapes = []
287
+
288
+ if Siglip2ImageProcessorFast.print_max_patched:
289
+ Siglip2ImageProcessorFast.print_max_patched = False
290
+
291
+ for i, image in enumerate(images):
292
+ height, width, = get_image_size_for_patches(
293
+ image_height=image.shape[1],
294
+ image_width=image.shape[2],
295
+ patch_size=patch_size,
296
+ max_num_patches=max_num_patches,
297
+ )
298
+
299
+ side_dict = SizeDict(height=height, width=width)
300
+ image = self.resize(image=image, size=side_dict, interpolation=interpolation)
301
+ image = self.rescale_and_normalize(image, do_rescale, rescale_factor, do_normalize, image_mean, image_std)
302
+
303
+ patches = convert_image_to_patches(image, patch_size, 2)
304
+ patches, mask = pad_along_first_dim(patches, len(patches))
305
+
306
+ num_patches_height = image.shape[1] // patch_size
307
+ num_patches_width = image.shape[2] // patch_size
308
+
309
+ spatial_shapes.append((num_patches_height, num_patches_width))
310
+ pixel_values.append(patches)
311
+ pixel_masks.append(mask)
312
+
313
+ pixel_values = torch.stack(pixel_values, dim=0)
314
+ pixel_masks = torch.stack(pixel_masks, dim=0)
315
+ spatial_shapes = torch.tensor(spatial_shapes)
316
+
317
+ batch_feature = BatchFeature(
318
+ data={
319
+ "pixel_values": pixel_values,
320
+ "pixel_attention_mask": pixel_masks,
321
+ "spatial_shapes": spatial_shapes,
322
+ },
323
+ tensor_type=return_tensors,
324
+ )
325
+ return batch_feature
326
+
327
+
328
+ __all__ = ["Siglip2ImageProcessorFast"]
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:107939f0c4aaf5fdc7c5d9f4ad741546e4ebaef32fcd93d36a462f1e5ea04d0b
3
+ size 4968257712
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7898bdeb9c5cbc996c0a2789c7b3495c3faa179c9690cb2c01bad65b615250e
3
+ size 4999465960
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:548ebf9f8deea536596d6dffc3fed769a60f15e414e6fb9f735a7bb7fa627340
3
+ size 713493632
model.safetensors.index.json ADDED
@@ -0,0 +1,930 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 10681102048
4
+ },
5
+ "weight_map": {
6
+ "merger.ln_q.weight": "model-00001-of-00003.safetensors",
7
+ "merger.mlp.0.bias": "model-00001-of-00003.safetensors",
8
+ "merger.mlp.0.weight": "model-00001-of-00003.safetensors",
9
+ "merger.mlp.2.bias": "model-00001-of-00003.safetensors",
10
+ "merger.mlp.2.weight": "model-00001-of-00003.safetensors",
11
+ "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
12
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
13
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
14
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
15
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
16
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
17
+ "model.layers.0.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
18
+ "model.layers.0.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
19
+ "model.layers.0.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
20
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
21
+ "model.layers.0.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
22
+ "model.layers.0.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
23
+ "model.layers.0.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
24
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
25
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
26
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
27
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
28
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
29
+ "model.layers.1.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
30
+ "model.layers.1.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
31
+ "model.layers.1.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
32
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
33
+ "model.layers.1.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
34
+ "model.layers.1.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
35
+ "model.layers.1.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
36
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00003.safetensors",
37
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
38
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
39
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
40
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
41
+ "model.layers.10.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
42
+ "model.layers.10.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
43
+ "model.layers.10.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
44
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
45
+ "model.layers.10.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
46
+ "model.layers.10.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
47
+ "model.layers.10.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
48
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00003.safetensors",
49
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
50
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
51
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
52
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
53
+ "model.layers.11.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
54
+ "model.layers.11.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
55
+ "model.layers.11.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
56
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
57
+ "model.layers.11.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
58
+ "model.layers.11.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
59
+ "model.layers.11.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
60
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00003.safetensors",
61
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
62
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
63
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
64
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
65
+ "model.layers.12.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
66
+ "model.layers.12.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
67
+ "model.layers.12.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
68
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
69
+ "model.layers.12.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
70
+ "model.layers.12.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
71
+ "model.layers.12.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
72
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00003.safetensors",
73
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
74
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
75
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
76
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
77
+ "model.layers.13.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
78
+ "model.layers.13.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
79
+ "model.layers.13.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
80
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
81
+ "model.layers.13.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
82
+ "model.layers.13.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
83
+ "model.layers.13.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
84
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00003.safetensors",
85
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
86
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
87
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
88
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
89
+ "model.layers.14.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
90
+ "model.layers.14.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
91
+ "model.layers.14.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
92
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
93
+ "model.layers.14.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
94
+ "model.layers.14.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
95
+ "model.layers.14.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
96
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00003.safetensors",
97
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
98
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
99
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
100
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
101
+ "model.layers.15.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
102
+ "model.layers.15.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
103
+ "model.layers.15.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
104
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
105
+ "model.layers.15.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
106
+ "model.layers.15.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
107
+ "model.layers.15.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
108
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
109
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
110
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
111
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
112
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
113
+ "model.layers.16.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
114
+ "model.layers.16.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
115
+ "model.layers.16.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
116
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
117
+ "model.layers.16.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
118
+ "model.layers.16.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
119
+ "model.layers.16.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
120
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00003.safetensors",
121
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
122
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
123
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
124
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
125
+ "model.layers.17.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
126
+ "model.layers.17.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
127
+ "model.layers.17.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
128
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
129
+ "model.layers.17.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
130
+ "model.layers.17.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
131
+ "model.layers.17.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
132
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
133
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
134
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
135
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
136
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
137
+ "model.layers.18.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
138
+ "model.layers.18.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
139
+ "model.layers.18.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
140
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
141
+ "model.layers.18.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
142
+ "model.layers.18.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
143
+ "model.layers.18.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
144
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00003.safetensors",
145
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
146
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
147
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
148
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
149
+ "model.layers.19.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
150
+ "model.layers.19.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
151
+ "model.layers.19.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
152
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
153
+ "model.layers.19.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
154
+ "model.layers.19.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
155
+ "model.layers.19.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
156
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
157
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
158
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
159
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
160
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
161
+ "model.layers.2.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
162
+ "model.layers.2.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
163
+ "model.layers.2.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
164
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
165
+ "model.layers.2.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
166
+ "model.layers.2.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
167
+ "model.layers.2.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
168
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00003.safetensors",
169
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
170
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
171
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
172
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
173
+ "model.layers.20.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
174
+ "model.layers.20.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
175
+ "model.layers.20.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
176
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
177
+ "model.layers.20.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
178
+ "model.layers.20.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
179
+ "model.layers.20.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
180
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00003.safetensors",
181
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
182
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
183
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
184
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
185
+ "model.layers.21.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
186
+ "model.layers.21.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
187
+ "model.layers.21.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
188
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
189
+ "model.layers.21.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
190
+ "model.layers.21.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
191
+ "model.layers.21.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
192
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00003.safetensors",
193
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
194
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
195
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
196
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
197
+ "model.layers.22.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
198
+ "model.layers.22.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
199
+ "model.layers.22.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
200
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
201
+ "model.layers.22.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
202
+ "model.layers.22.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
203
+ "model.layers.22.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
204
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00003.safetensors",
205
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
206
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
207
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
208
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
209
+ "model.layers.23.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
210
+ "model.layers.23.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
211
+ "model.layers.23.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
212
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
213
+ "model.layers.23.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
214
+ "model.layers.23.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
215
+ "model.layers.23.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
216
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00003.safetensors",
217
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
218
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
219
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
220
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
221
+ "model.layers.24.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
222
+ "model.layers.24.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
223
+ "model.layers.24.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
224
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
225
+ "model.layers.24.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
226
+ "model.layers.24.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
227
+ "model.layers.24.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
228
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00003.safetensors",
229
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
230
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
231
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
232
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
233
+ "model.layers.25.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
234
+ "model.layers.25.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
235
+ "model.layers.25.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
236
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
237
+ "model.layers.25.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
238
+ "model.layers.25.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
239
+ "model.layers.25.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
240
+ "model.layers.26.input_layernorm.weight": "model-00002-of-00003.safetensors",
241
+ "model.layers.26.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
242
+ "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
243
+ "model.layers.26.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
244
+ "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
245
+ "model.layers.26.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
246
+ "model.layers.26.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
247
+ "model.layers.26.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
248
+ "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
249
+ "model.layers.26.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
250
+ "model.layers.26.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
251
+ "model.layers.26.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
252
+ "model.layers.27.input_layernorm.weight": "model-00002-of-00003.safetensors",
253
+ "model.layers.27.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
254
+ "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
255
+ "model.layers.27.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
256
+ "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
257
+ "model.layers.27.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
258
+ "model.layers.27.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
259
+ "model.layers.27.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
260
+ "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
261
+ "model.layers.27.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
262
+ "model.layers.27.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
263
+ "model.layers.27.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
264
+ "model.layers.28.input_layernorm.weight": "model-00002-of-00003.safetensors",
265
+ "model.layers.28.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
266
+ "model.layers.28.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
267
+ "model.layers.28.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
268
+ "model.layers.28.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
269
+ "model.layers.28.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
270
+ "model.layers.28.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
271
+ "model.layers.28.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
272
+ "model.layers.28.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
273
+ "model.layers.28.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
274
+ "model.layers.28.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
275
+ "model.layers.28.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
276
+ "model.layers.29.input_layernorm.weight": "model-00002-of-00003.safetensors",
277
+ "model.layers.29.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
278
+ "model.layers.29.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
279
+ "model.layers.29.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
280
+ "model.layers.29.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
281
+ "model.layers.29.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
282
+ "model.layers.29.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
283
+ "model.layers.29.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
284
+ "model.layers.29.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
285
+ "model.layers.29.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
286
+ "model.layers.29.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
287
+ "model.layers.29.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
288
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
289
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
290
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
291
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
292
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
293
+ "model.layers.3.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
294
+ "model.layers.3.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
295
+ "model.layers.3.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
296
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
297
+ "model.layers.3.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
298
+ "model.layers.3.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
299
+ "model.layers.3.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
300
+ "model.layers.30.input_layernorm.weight": "model-00002-of-00003.safetensors",
301
+ "model.layers.30.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
302
+ "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
303
+ "model.layers.30.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
304
+ "model.layers.30.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
305
+ "model.layers.30.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
306
+ "model.layers.30.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
307
+ "model.layers.30.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
308
+ "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
309
+ "model.layers.30.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
310
+ "model.layers.30.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
311
+ "model.layers.30.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
312
+ "model.layers.31.input_layernorm.weight": "model-00002-of-00003.safetensors",
313
+ "model.layers.31.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
314
+ "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
315
+ "model.layers.31.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
316
+ "model.layers.31.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
317
+ "model.layers.31.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
318
+ "model.layers.31.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
319
+ "model.layers.31.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
320
+ "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
321
+ "model.layers.31.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
322
+ "model.layers.31.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
323
+ "model.layers.31.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
324
+ "model.layers.32.input_layernorm.weight": "model-00002-of-00003.safetensors",
325
+ "model.layers.32.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
326
+ "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
327
+ "model.layers.32.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
328
+ "model.layers.32.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
329
+ "model.layers.32.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
330
+ "model.layers.32.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
331
+ "model.layers.32.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
332
+ "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
333
+ "model.layers.32.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
334
+ "model.layers.32.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
335
+ "model.layers.32.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
336
+ "model.layers.33.input_layernorm.weight": "model-00002-of-00003.safetensors",
337
+ "model.layers.33.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
338
+ "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
339
+ "model.layers.33.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
340
+ "model.layers.33.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
341
+ "model.layers.33.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
342
+ "model.layers.33.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
343
+ "model.layers.33.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
344
+ "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
345
+ "model.layers.33.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
346
+ "model.layers.33.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
347
+ "model.layers.33.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
348
+ "model.layers.34.input_layernorm.weight": "model-00002-of-00003.safetensors",
349
+ "model.layers.34.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
350
+ "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
351
+ "model.layers.34.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
352
+ "model.layers.34.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
353
+ "model.layers.34.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
354
+ "model.layers.34.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
355
+ "model.layers.34.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
356
+ "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
357
+ "model.layers.34.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
358
+ "model.layers.34.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
359
+ "model.layers.34.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
360
+ "model.layers.35.input_layernorm.weight": "model-00002-of-00003.safetensors",
361
+ "model.layers.35.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
362
+ "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
363
+ "model.layers.35.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
364
+ "model.layers.35.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
365
+ "model.layers.35.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
366
+ "model.layers.35.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
367
+ "model.layers.35.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
368
+ "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
369
+ "model.layers.35.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
370
+ "model.layers.35.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
371
+ "model.layers.35.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
372
+ "model.layers.36.input_layernorm.weight": "model-00003-of-00003.safetensors",
373
+ "model.layers.36.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
374
+ "model.layers.36.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
375
+ "model.layers.36.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
376
+ "model.layers.36.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
377
+ "model.layers.36.self_attn.kv_a_layernorm.weight": "model-00002-of-00003.safetensors",
378
+ "model.layers.36.self_attn.kv_a_proj_with_mqa.weight": "model-00002-of-00003.safetensors",
379
+ "model.layers.36.self_attn.kv_b_proj.weight": "model-00002-of-00003.safetensors",
380
+ "model.layers.36.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
381
+ "model.layers.36.self_attn.q_a_layernorm.weight": "model-00002-of-00003.safetensors",
382
+ "model.layers.36.self_attn.q_a_proj.weight": "model-00002-of-00003.safetensors",
383
+ "model.layers.36.self_attn.q_b_proj.weight": "model-00002-of-00003.safetensors",
384
+ "model.layers.37.input_layernorm.weight": "model-00003-of-00003.safetensors",
385
+ "model.layers.37.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
386
+ "model.layers.37.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
387
+ "model.layers.37.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
388
+ "model.layers.37.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
389
+ "model.layers.37.self_attn.kv_a_layernorm.weight": "model-00003-of-00003.safetensors",
390
+ "model.layers.37.self_attn.kv_a_proj_with_mqa.weight": "model-00003-of-00003.safetensors",
391
+ "model.layers.37.self_attn.kv_b_proj.weight": "model-00003-of-00003.safetensors",
392
+ "model.layers.37.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
393
+ "model.layers.37.self_attn.q_a_layernorm.weight": "model-00003-of-00003.safetensors",
394
+ "model.layers.37.self_attn.q_a_proj.weight": "model-00003-of-00003.safetensors",
395
+ "model.layers.37.self_attn.q_b_proj.weight": "model-00003-of-00003.safetensors",
396
+ "model.layers.38.input_layernorm.weight": "model-00003-of-00003.safetensors",
397
+ "model.layers.38.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
398
+ "model.layers.38.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
399
+ "model.layers.38.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
400
+ "model.layers.38.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
401
+ "model.layers.38.self_attn.kv_a_layernorm.weight": "model-00003-of-00003.safetensors",
402
+ "model.layers.38.self_attn.kv_a_proj_with_mqa.weight": "model-00003-of-00003.safetensors",
403
+ "model.layers.38.self_attn.kv_b_proj.weight": "model-00003-of-00003.safetensors",
404
+ "model.layers.38.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
405
+ "model.layers.38.self_attn.q_a_layernorm.weight": "model-00003-of-00003.safetensors",
406
+ "model.layers.38.self_attn.q_a_proj.weight": "model-00003-of-00003.safetensors",
407
+ "model.layers.38.self_attn.q_b_proj.weight": "model-00003-of-00003.safetensors",
408
+ "model.layers.39.input_layernorm.weight": "model-00003-of-00003.safetensors",
409
+ "model.layers.39.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
410
+ "model.layers.39.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
411
+ "model.layers.39.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
412
+ "model.layers.39.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
413
+ "model.layers.39.self_attn.kv_a_layernorm.weight": "model-00003-of-00003.safetensors",
414
+ "model.layers.39.self_attn.kv_a_proj_with_mqa.weight": "model-00003-of-00003.safetensors",
415
+ "model.layers.39.self_attn.kv_b_proj.weight": "model-00003-of-00003.safetensors",
416
+ "model.layers.39.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
417
+ "model.layers.39.self_attn.q_a_layernorm.weight": "model-00003-of-00003.safetensors",
418
+ "model.layers.39.self_attn.q_a_proj.weight": "model-00003-of-00003.safetensors",
419
+ "model.layers.39.self_attn.q_b_proj.weight": "model-00003-of-00003.safetensors",
420
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
421
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
422
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
423
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
424
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
425
+ "model.layers.4.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
426
+ "model.layers.4.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
427
+ "model.layers.4.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
428
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
429
+ "model.layers.4.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
430
+ "model.layers.4.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
431
+ "model.layers.4.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
432
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
433
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
434
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
435
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
436
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
437
+ "model.layers.5.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
438
+ "model.layers.5.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
439
+ "model.layers.5.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
440
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
441
+ "model.layers.5.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
442
+ "model.layers.5.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
443
+ "model.layers.5.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
444
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
445
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
446
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
447
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
448
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
449
+ "model.layers.6.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
450
+ "model.layers.6.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
451
+ "model.layers.6.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
452
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
453
+ "model.layers.6.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
454
+ "model.layers.6.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
455
+ "model.layers.6.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
456
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
457
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
458
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
459
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
460
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
461
+ "model.layers.7.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
462
+ "model.layers.7.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
463
+ "model.layers.7.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
464
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
465
+ "model.layers.7.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
466
+ "model.layers.7.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
467
+ "model.layers.7.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
468
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
469
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
470
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
471
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
472
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
473
+ "model.layers.8.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
474
+ "model.layers.8.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
475
+ "model.layers.8.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
476
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
477
+ "model.layers.8.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
478
+ "model.layers.8.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
479
+ "model.layers.8.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
480
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
481
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
482
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
483
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
484
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
485
+ "model.layers.9.self_attn.kv_a_layernorm.weight": "model-00001-of-00003.safetensors",
486
+ "model.layers.9.self_attn.kv_a_proj_with_mqa.weight": "model-00001-of-00003.safetensors",
487
+ "model.layers.9.self_attn.kv_b_proj.weight": "model-00001-of-00003.safetensors",
488
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
489
+ "model.layers.9.self_attn.q_a_layernorm.weight": "model-00001-of-00003.safetensors",
490
+ "model.layers.9.self_attn.q_a_proj.weight": "model-00001-of-00003.safetensors",
491
+ "model.layers.9.self_attn.q_b_proj.weight": "model-00001-of-00003.safetensors",
492
+ "model.norm.weight": "model-00003-of-00003.safetensors",
493
+ "siglip2.vision_model.embeddings.patch_embedding.bias": "model-00001-of-00003.safetensors",
494
+ "siglip2.vision_model.embeddings.patch_embedding.weight": "model-00001-of-00003.safetensors",
495
+ "siglip2.vision_model.encoder.layers.0.layer_norm1.bias": "model-00001-of-00003.safetensors",
496
+ "siglip2.vision_model.encoder.layers.0.layer_norm1.weight": "model-00001-of-00003.safetensors",
497
+ "siglip2.vision_model.encoder.layers.0.layer_norm2.bias": "model-00001-of-00003.safetensors",
498
+ "siglip2.vision_model.encoder.layers.0.layer_norm2.weight": "model-00001-of-00003.safetensors",
499
+ "siglip2.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00003.safetensors",
500
+ "siglip2.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00001-of-00003.safetensors",
501
+ "siglip2.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00003.safetensors",
502
+ "siglip2.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00001-of-00003.safetensors",
503
+ "siglip2.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
504
+ "siglip2.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
505
+ "siglip2.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
506
+ "siglip2.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
507
+ "siglip2.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
508
+ "siglip2.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
509
+ "siglip2.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
510
+ "siglip2.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
511
+ "siglip2.vision_model.encoder.layers.1.layer_norm1.bias": "model-00001-of-00003.safetensors",
512
+ "siglip2.vision_model.encoder.layers.1.layer_norm1.weight": "model-00001-of-00003.safetensors",
513
+ "siglip2.vision_model.encoder.layers.1.layer_norm2.bias": "model-00001-of-00003.safetensors",
514
+ "siglip2.vision_model.encoder.layers.1.layer_norm2.weight": "model-00001-of-00003.safetensors",
515
+ "siglip2.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00003.safetensors",
516
+ "siglip2.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00003.safetensors",
517
+ "siglip2.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00001-of-00003.safetensors",
518
+ "siglip2.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00001-of-00003.safetensors",
519
+ "siglip2.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
520
+ "siglip2.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
521
+ "siglip2.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
522
+ "siglip2.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
523
+ "siglip2.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
524
+ "siglip2.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
525
+ "siglip2.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
526
+ "siglip2.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
527
+ "siglip2.vision_model.encoder.layers.10.layer_norm1.bias": "model-00001-of-00003.safetensors",
528
+ "siglip2.vision_model.encoder.layers.10.layer_norm1.weight": "model-00001-of-00003.safetensors",
529
+ "siglip2.vision_model.encoder.layers.10.layer_norm2.bias": "model-00001-of-00003.safetensors",
530
+ "siglip2.vision_model.encoder.layers.10.layer_norm2.weight": "model-00001-of-00003.safetensors",
531
+ "siglip2.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00001-of-00003.safetensors",
532
+ "siglip2.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00001-of-00003.safetensors",
533
+ "siglip2.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00003.safetensors",
534
+ "siglip2.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00001-of-00003.safetensors",
535
+ "siglip2.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
536
+ "siglip2.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
537
+ "siglip2.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
538
+ "siglip2.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
539
+ "siglip2.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
540
+ "siglip2.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
541
+ "siglip2.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
542
+ "siglip2.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
543
+ "siglip2.vision_model.encoder.layers.11.layer_norm1.bias": "model-00001-of-00003.safetensors",
544
+ "siglip2.vision_model.encoder.layers.11.layer_norm1.weight": "model-00001-of-00003.safetensors",
545
+ "siglip2.vision_model.encoder.layers.11.layer_norm2.bias": "model-00001-of-00003.safetensors",
546
+ "siglip2.vision_model.encoder.layers.11.layer_norm2.weight": "model-00001-of-00003.safetensors",
547
+ "siglip2.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00003.safetensors",
548
+ "siglip2.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00003.safetensors",
549
+ "siglip2.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00003.safetensors",
550
+ "siglip2.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00001-of-00003.safetensors",
551
+ "siglip2.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
552
+ "siglip2.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
553
+ "siglip2.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
554
+ "siglip2.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
555
+ "siglip2.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
556
+ "siglip2.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
557
+ "siglip2.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
558
+ "siglip2.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
559
+ "siglip2.vision_model.encoder.layers.12.layer_norm1.bias": "model-00001-of-00003.safetensors",
560
+ "siglip2.vision_model.encoder.layers.12.layer_norm1.weight": "model-00001-of-00003.safetensors",
561
+ "siglip2.vision_model.encoder.layers.12.layer_norm2.bias": "model-00001-of-00003.safetensors",
562
+ "siglip2.vision_model.encoder.layers.12.layer_norm2.weight": "model-00001-of-00003.safetensors",
563
+ "siglip2.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00001-of-00003.safetensors",
564
+ "siglip2.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00001-of-00003.safetensors",
565
+ "siglip2.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00003.safetensors",
566
+ "siglip2.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00001-of-00003.safetensors",
567
+ "siglip2.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
568
+ "siglip2.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
569
+ "siglip2.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
570
+ "siglip2.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
571
+ "siglip2.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
572
+ "siglip2.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
573
+ "siglip2.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
574
+ "siglip2.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
575
+ "siglip2.vision_model.encoder.layers.13.layer_norm1.bias": "model-00001-of-00003.safetensors",
576
+ "siglip2.vision_model.encoder.layers.13.layer_norm1.weight": "model-00001-of-00003.safetensors",
577
+ "siglip2.vision_model.encoder.layers.13.layer_norm2.bias": "model-00001-of-00003.safetensors",
578
+ "siglip2.vision_model.encoder.layers.13.layer_norm2.weight": "model-00001-of-00003.safetensors",
579
+ "siglip2.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00003.safetensors",
580
+ "siglip2.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00001-of-00003.safetensors",
581
+ "siglip2.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00001-of-00003.safetensors",
582
+ "siglip2.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00003.safetensors",
583
+ "siglip2.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
584
+ "siglip2.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
585
+ "siglip2.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
586
+ "siglip2.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
587
+ "siglip2.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
588
+ "siglip2.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
589
+ "siglip2.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
590
+ "siglip2.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
591
+ "siglip2.vision_model.encoder.layers.14.layer_norm1.bias": "model-00001-of-00003.safetensors",
592
+ "siglip2.vision_model.encoder.layers.14.layer_norm1.weight": "model-00001-of-00003.safetensors",
593
+ "siglip2.vision_model.encoder.layers.14.layer_norm2.bias": "model-00001-of-00003.safetensors",
594
+ "siglip2.vision_model.encoder.layers.14.layer_norm2.weight": "model-00001-of-00003.safetensors",
595
+ "siglip2.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00003.safetensors",
596
+ "siglip2.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00001-of-00003.safetensors",
597
+ "siglip2.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00001-of-00003.safetensors",
598
+ "siglip2.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00003.safetensors",
599
+ "siglip2.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
600
+ "siglip2.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
601
+ "siglip2.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
602
+ "siglip2.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
603
+ "siglip2.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
604
+ "siglip2.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
605
+ "siglip2.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
606
+ "siglip2.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
607
+ "siglip2.vision_model.encoder.layers.15.layer_norm1.bias": "model-00001-of-00003.safetensors",
608
+ "siglip2.vision_model.encoder.layers.15.layer_norm1.weight": "model-00001-of-00003.safetensors",
609
+ "siglip2.vision_model.encoder.layers.15.layer_norm2.bias": "model-00001-of-00003.safetensors",
610
+ "siglip2.vision_model.encoder.layers.15.layer_norm2.weight": "model-00001-of-00003.safetensors",
611
+ "siglip2.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00003.safetensors",
612
+ "siglip2.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00001-of-00003.safetensors",
613
+ "siglip2.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00003.safetensors",
614
+ "siglip2.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00003.safetensors",
615
+ "siglip2.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
616
+ "siglip2.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
617
+ "siglip2.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
618
+ "siglip2.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
619
+ "siglip2.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
620
+ "siglip2.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
621
+ "siglip2.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
622
+ "siglip2.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
623
+ "siglip2.vision_model.encoder.layers.16.layer_norm1.bias": "model-00001-of-00003.safetensors",
624
+ "siglip2.vision_model.encoder.layers.16.layer_norm1.weight": "model-00001-of-00003.safetensors",
625
+ "siglip2.vision_model.encoder.layers.16.layer_norm2.bias": "model-00001-of-00003.safetensors",
626
+ "siglip2.vision_model.encoder.layers.16.layer_norm2.weight": "model-00001-of-00003.safetensors",
627
+ "siglip2.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00001-of-00003.safetensors",
628
+ "siglip2.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00003.safetensors",
629
+ "siglip2.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00001-of-00003.safetensors",
630
+ "siglip2.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00001-of-00003.safetensors",
631
+ "siglip2.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
632
+ "siglip2.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
633
+ "siglip2.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
634
+ "siglip2.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
635
+ "siglip2.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
636
+ "siglip2.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
637
+ "siglip2.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
638
+ "siglip2.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
639
+ "siglip2.vision_model.encoder.layers.17.layer_norm1.bias": "model-00001-of-00003.safetensors",
640
+ "siglip2.vision_model.encoder.layers.17.layer_norm1.weight": "model-00001-of-00003.safetensors",
641
+ "siglip2.vision_model.encoder.layers.17.layer_norm2.bias": "model-00001-of-00003.safetensors",
642
+ "siglip2.vision_model.encoder.layers.17.layer_norm2.weight": "model-00001-of-00003.safetensors",
643
+ "siglip2.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00001-of-00003.safetensors",
644
+ "siglip2.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00001-of-00003.safetensors",
645
+ "siglip2.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00001-of-00003.safetensors",
646
+ "siglip2.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00001-of-00003.safetensors",
647
+ "siglip2.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
648
+ "siglip2.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
649
+ "siglip2.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
650
+ "siglip2.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
651
+ "siglip2.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
652
+ "siglip2.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
653
+ "siglip2.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
654
+ "siglip2.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
655
+ "siglip2.vision_model.encoder.layers.18.layer_norm1.bias": "model-00001-of-00003.safetensors",
656
+ "siglip2.vision_model.encoder.layers.18.layer_norm1.weight": "model-00001-of-00003.safetensors",
657
+ "siglip2.vision_model.encoder.layers.18.layer_norm2.bias": "model-00001-of-00003.safetensors",
658
+ "siglip2.vision_model.encoder.layers.18.layer_norm2.weight": "model-00001-of-00003.safetensors",
659
+ "siglip2.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00003.safetensors",
660
+ "siglip2.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00001-of-00003.safetensors",
661
+ "siglip2.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00003.safetensors",
662
+ "siglip2.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00003.safetensors",
663
+ "siglip2.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
664
+ "siglip2.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
665
+ "siglip2.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
666
+ "siglip2.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
667
+ "siglip2.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
668
+ "siglip2.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
669
+ "siglip2.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
670
+ "siglip2.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
671
+ "siglip2.vision_model.encoder.layers.19.layer_norm1.bias": "model-00001-of-00003.safetensors",
672
+ "siglip2.vision_model.encoder.layers.19.layer_norm1.weight": "model-00001-of-00003.safetensors",
673
+ "siglip2.vision_model.encoder.layers.19.layer_norm2.bias": "model-00001-of-00003.safetensors",
674
+ "siglip2.vision_model.encoder.layers.19.layer_norm2.weight": "model-00001-of-00003.safetensors",
675
+ "siglip2.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00003.safetensors",
676
+ "siglip2.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00003.safetensors",
677
+ "siglip2.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00001-of-00003.safetensors",
678
+ "siglip2.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00003.safetensors",
679
+ "siglip2.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
680
+ "siglip2.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
681
+ "siglip2.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
682
+ "siglip2.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
683
+ "siglip2.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
684
+ "siglip2.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
685
+ "siglip2.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
686
+ "siglip2.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
687
+ "siglip2.vision_model.encoder.layers.2.layer_norm1.bias": "model-00001-of-00003.safetensors",
688
+ "siglip2.vision_model.encoder.layers.2.layer_norm1.weight": "model-00001-of-00003.safetensors",
689
+ "siglip2.vision_model.encoder.layers.2.layer_norm2.bias": "model-00001-of-00003.safetensors",
690
+ "siglip2.vision_model.encoder.layers.2.layer_norm2.weight": "model-00001-of-00003.safetensors",
691
+ "siglip2.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00003.safetensors",
692
+ "siglip2.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00003.safetensors",
693
+ "siglip2.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00001-of-00003.safetensors",
694
+ "siglip2.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00003.safetensors",
695
+ "siglip2.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
696
+ "siglip2.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
697
+ "siglip2.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
698
+ "siglip2.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
699
+ "siglip2.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
700
+ "siglip2.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
701
+ "siglip2.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
702
+ "siglip2.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
703
+ "siglip2.vision_model.encoder.layers.20.layer_norm1.bias": "model-00001-of-00003.safetensors",
704
+ "siglip2.vision_model.encoder.layers.20.layer_norm1.weight": "model-00001-of-00003.safetensors",
705
+ "siglip2.vision_model.encoder.layers.20.layer_norm2.bias": "model-00001-of-00003.safetensors",
706
+ "siglip2.vision_model.encoder.layers.20.layer_norm2.weight": "model-00001-of-00003.safetensors",
707
+ "siglip2.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00001-of-00003.safetensors",
708
+ "siglip2.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00001-of-00003.safetensors",
709
+ "siglip2.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00001-of-00003.safetensors",
710
+ "siglip2.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00001-of-00003.safetensors",
711
+ "siglip2.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
712
+ "siglip2.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
713
+ "siglip2.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
714
+ "siglip2.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
715
+ "siglip2.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
716
+ "siglip2.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
717
+ "siglip2.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
718
+ "siglip2.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
719
+ "siglip2.vision_model.encoder.layers.21.layer_norm1.bias": "model-00001-of-00003.safetensors",
720
+ "siglip2.vision_model.encoder.layers.21.layer_norm1.weight": "model-00001-of-00003.safetensors",
721
+ "siglip2.vision_model.encoder.layers.21.layer_norm2.bias": "model-00001-of-00003.safetensors",
722
+ "siglip2.vision_model.encoder.layers.21.layer_norm2.weight": "model-00001-of-00003.safetensors",
723
+ "siglip2.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00001-of-00003.safetensors",
724
+ "siglip2.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00001-of-00003.safetensors",
725
+ "siglip2.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00001-of-00003.safetensors",
726
+ "siglip2.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00001-of-00003.safetensors",
727
+ "siglip2.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
728
+ "siglip2.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
729
+ "siglip2.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
730
+ "siglip2.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
731
+ "siglip2.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
732
+ "siglip2.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
733
+ "siglip2.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
734
+ "siglip2.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
735
+ "siglip2.vision_model.encoder.layers.22.layer_norm1.bias": "model-00001-of-00003.safetensors",
736
+ "siglip2.vision_model.encoder.layers.22.layer_norm1.weight": "model-00001-of-00003.safetensors",
737
+ "siglip2.vision_model.encoder.layers.22.layer_norm2.bias": "model-00001-of-00003.safetensors",
738
+ "siglip2.vision_model.encoder.layers.22.layer_norm2.weight": "model-00001-of-00003.safetensors",
739
+ "siglip2.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00001-of-00003.safetensors",
740
+ "siglip2.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00001-of-00003.safetensors",
741
+ "siglip2.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00001-of-00003.safetensors",
742
+ "siglip2.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00001-of-00003.safetensors",
743
+ "siglip2.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
744
+ "siglip2.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
745
+ "siglip2.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
746
+ "siglip2.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
747
+ "siglip2.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
748
+ "siglip2.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
749
+ "siglip2.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
750
+ "siglip2.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
751
+ "siglip2.vision_model.encoder.layers.23.layer_norm1.bias": "model-00001-of-00003.safetensors",
752
+ "siglip2.vision_model.encoder.layers.23.layer_norm1.weight": "model-00001-of-00003.safetensors",
753
+ "siglip2.vision_model.encoder.layers.23.layer_norm2.bias": "model-00001-of-00003.safetensors",
754
+ "siglip2.vision_model.encoder.layers.23.layer_norm2.weight": "model-00001-of-00003.safetensors",
755
+ "siglip2.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00001-of-00003.safetensors",
756
+ "siglip2.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00001-of-00003.safetensors",
757
+ "siglip2.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00001-of-00003.safetensors",
758
+ "siglip2.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00001-of-00003.safetensors",
759
+ "siglip2.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
760
+ "siglip2.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
761
+ "siglip2.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
762
+ "siglip2.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
763
+ "siglip2.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
764
+ "siglip2.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
765
+ "siglip2.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
766
+ "siglip2.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
767
+ "siglip2.vision_model.encoder.layers.24.layer_norm1.bias": "model-00001-of-00003.safetensors",
768
+ "siglip2.vision_model.encoder.layers.24.layer_norm1.weight": "model-00001-of-00003.safetensors",
769
+ "siglip2.vision_model.encoder.layers.24.layer_norm2.bias": "model-00001-of-00003.safetensors",
770
+ "siglip2.vision_model.encoder.layers.24.layer_norm2.weight": "model-00001-of-00003.safetensors",
771
+ "siglip2.vision_model.encoder.layers.24.mlp.fc1.bias": "model-00001-of-00003.safetensors",
772
+ "siglip2.vision_model.encoder.layers.24.mlp.fc1.weight": "model-00001-of-00003.safetensors",
773
+ "siglip2.vision_model.encoder.layers.24.mlp.fc2.bias": "model-00001-of-00003.safetensors",
774
+ "siglip2.vision_model.encoder.layers.24.mlp.fc2.weight": "model-00001-of-00003.safetensors",
775
+ "siglip2.vision_model.encoder.layers.24.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
776
+ "siglip2.vision_model.encoder.layers.24.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
777
+ "siglip2.vision_model.encoder.layers.24.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
778
+ "siglip2.vision_model.encoder.layers.24.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
779
+ "siglip2.vision_model.encoder.layers.24.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
780
+ "siglip2.vision_model.encoder.layers.24.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
781
+ "siglip2.vision_model.encoder.layers.24.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
782
+ "siglip2.vision_model.encoder.layers.24.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
783
+ "siglip2.vision_model.encoder.layers.25.layer_norm1.bias": "model-00001-of-00003.safetensors",
784
+ "siglip2.vision_model.encoder.layers.25.layer_norm1.weight": "model-00001-of-00003.safetensors",
785
+ "siglip2.vision_model.encoder.layers.25.layer_norm2.bias": "model-00001-of-00003.safetensors",
786
+ "siglip2.vision_model.encoder.layers.25.layer_norm2.weight": "model-00001-of-00003.safetensors",
787
+ "siglip2.vision_model.encoder.layers.25.mlp.fc1.bias": "model-00001-of-00003.safetensors",
788
+ "siglip2.vision_model.encoder.layers.25.mlp.fc1.weight": "model-00001-of-00003.safetensors",
789
+ "siglip2.vision_model.encoder.layers.25.mlp.fc2.bias": "model-00001-of-00003.safetensors",
790
+ "siglip2.vision_model.encoder.layers.25.mlp.fc2.weight": "model-00001-of-00003.safetensors",
791
+ "siglip2.vision_model.encoder.layers.25.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
792
+ "siglip2.vision_model.encoder.layers.25.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
793
+ "siglip2.vision_model.encoder.layers.25.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
794
+ "siglip2.vision_model.encoder.layers.25.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
795
+ "siglip2.vision_model.encoder.layers.25.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
796
+ "siglip2.vision_model.encoder.layers.25.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
797
+ "siglip2.vision_model.encoder.layers.25.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
798
+ "siglip2.vision_model.encoder.layers.25.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
799
+ "siglip2.vision_model.encoder.layers.26.layer_norm1.bias": "model-00001-of-00003.safetensors",
800
+ "siglip2.vision_model.encoder.layers.26.layer_norm1.weight": "model-00001-of-00003.safetensors",
801
+ "siglip2.vision_model.encoder.layers.26.layer_norm2.bias": "model-00001-of-00003.safetensors",
802
+ "siglip2.vision_model.encoder.layers.26.layer_norm2.weight": "model-00001-of-00003.safetensors",
803
+ "siglip2.vision_model.encoder.layers.26.mlp.fc1.bias": "model-00001-of-00003.safetensors",
804
+ "siglip2.vision_model.encoder.layers.26.mlp.fc1.weight": "model-00001-of-00003.safetensors",
805
+ "siglip2.vision_model.encoder.layers.26.mlp.fc2.bias": "model-00001-of-00003.safetensors",
806
+ "siglip2.vision_model.encoder.layers.26.mlp.fc2.weight": "model-00001-of-00003.safetensors",
807
+ "siglip2.vision_model.encoder.layers.26.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
808
+ "siglip2.vision_model.encoder.layers.26.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
809
+ "siglip2.vision_model.encoder.layers.26.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
810
+ "siglip2.vision_model.encoder.layers.26.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
811
+ "siglip2.vision_model.encoder.layers.26.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
812
+ "siglip2.vision_model.encoder.layers.26.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
813
+ "siglip2.vision_model.encoder.layers.26.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
814
+ "siglip2.vision_model.encoder.layers.26.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
815
+ "siglip2.vision_model.encoder.layers.3.layer_norm1.bias": "model-00001-of-00003.safetensors",
816
+ "siglip2.vision_model.encoder.layers.3.layer_norm1.weight": "model-00001-of-00003.safetensors",
817
+ "siglip2.vision_model.encoder.layers.3.layer_norm2.bias": "model-00001-of-00003.safetensors",
818
+ "siglip2.vision_model.encoder.layers.3.layer_norm2.weight": "model-00001-of-00003.safetensors",
819
+ "siglip2.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00001-of-00003.safetensors",
820
+ "siglip2.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00003.safetensors",
821
+ "siglip2.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00003.safetensors",
822
+ "siglip2.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00001-of-00003.safetensors",
823
+ "siglip2.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
824
+ "siglip2.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
825
+ "siglip2.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
826
+ "siglip2.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
827
+ "siglip2.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
828
+ "siglip2.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
829
+ "siglip2.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
830
+ "siglip2.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
831
+ "siglip2.vision_model.encoder.layers.4.layer_norm1.bias": "model-00001-of-00003.safetensors",
832
+ "siglip2.vision_model.encoder.layers.4.layer_norm1.weight": "model-00001-of-00003.safetensors",
833
+ "siglip2.vision_model.encoder.layers.4.layer_norm2.bias": "model-00001-of-00003.safetensors",
834
+ "siglip2.vision_model.encoder.layers.4.layer_norm2.weight": "model-00001-of-00003.safetensors",
835
+ "siglip2.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00003.safetensors",
836
+ "siglip2.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00001-of-00003.safetensors",
837
+ "siglip2.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00003.safetensors",
838
+ "siglip2.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00001-of-00003.safetensors",
839
+ "siglip2.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
840
+ "siglip2.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
841
+ "siglip2.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
842
+ "siglip2.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
843
+ "siglip2.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
844
+ "siglip2.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
845
+ "siglip2.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
846
+ "siglip2.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
847
+ "siglip2.vision_model.encoder.layers.5.layer_norm1.bias": "model-00001-of-00003.safetensors",
848
+ "siglip2.vision_model.encoder.layers.5.layer_norm1.weight": "model-00001-of-00003.safetensors",
849
+ "siglip2.vision_model.encoder.layers.5.layer_norm2.bias": "model-00001-of-00003.safetensors",
850
+ "siglip2.vision_model.encoder.layers.5.layer_norm2.weight": "model-00001-of-00003.safetensors",
851
+ "siglip2.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00003.safetensors",
852
+ "siglip2.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00003.safetensors",
853
+ "siglip2.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00003.safetensors",
854
+ "siglip2.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00003.safetensors",
855
+ "siglip2.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
856
+ "siglip2.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
857
+ "siglip2.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
858
+ "siglip2.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
859
+ "siglip2.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
860
+ "siglip2.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
861
+ "siglip2.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
862
+ "siglip2.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
863
+ "siglip2.vision_model.encoder.layers.6.layer_norm1.bias": "model-00001-of-00003.safetensors",
864
+ "siglip2.vision_model.encoder.layers.6.layer_norm1.weight": "model-00001-of-00003.safetensors",
865
+ "siglip2.vision_model.encoder.layers.6.layer_norm2.bias": "model-00001-of-00003.safetensors",
866
+ "siglip2.vision_model.encoder.layers.6.layer_norm2.weight": "model-00001-of-00003.safetensors",
867
+ "siglip2.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00003.safetensors",
868
+ "siglip2.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00001-of-00003.safetensors",
869
+ "siglip2.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00003.safetensors",
870
+ "siglip2.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00001-of-00003.safetensors",
871
+ "siglip2.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
872
+ "siglip2.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
873
+ "siglip2.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
874
+ "siglip2.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
875
+ "siglip2.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
876
+ "siglip2.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
877
+ "siglip2.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
878
+ "siglip2.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
879
+ "siglip2.vision_model.encoder.layers.7.layer_norm1.bias": "model-00001-of-00003.safetensors",
880
+ "siglip2.vision_model.encoder.layers.7.layer_norm1.weight": "model-00001-of-00003.safetensors",
881
+ "siglip2.vision_model.encoder.layers.7.layer_norm2.bias": "model-00001-of-00003.safetensors",
882
+ "siglip2.vision_model.encoder.layers.7.layer_norm2.weight": "model-00001-of-00003.safetensors",
883
+ "siglip2.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00001-of-00003.safetensors",
884
+ "siglip2.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00003.safetensors",
885
+ "siglip2.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00001-of-00003.safetensors",
886
+ "siglip2.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00003.safetensors",
887
+ "siglip2.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
888
+ "siglip2.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
889
+ "siglip2.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
890
+ "siglip2.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
891
+ "siglip2.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
892
+ "siglip2.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
893
+ "siglip2.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
894
+ "siglip2.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
895
+ "siglip2.vision_model.encoder.layers.8.layer_norm1.bias": "model-00001-of-00003.safetensors",
896
+ "siglip2.vision_model.encoder.layers.8.layer_norm1.weight": "model-00001-of-00003.safetensors",
897
+ "siglip2.vision_model.encoder.layers.8.layer_norm2.bias": "model-00001-of-00003.safetensors",
898
+ "siglip2.vision_model.encoder.layers.8.layer_norm2.weight": "model-00001-of-00003.safetensors",
899
+ "siglip2.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00003.safetensors",
900
+ "siglip2.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00003.safetensors",
901
+ "siglip2.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00003.safetensors",
902
+ "siglip2.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00001-of-00003.safetensors",
903
+ "siglip2.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
904
+ "siglip2.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
905
+ "siglip2.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
906
+ "siglip2.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
907
+ "siglip2.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
908
+ "siglip2.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
909
+ "siglip2.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
910
+ "siglip2.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
911
+ "siglip2.vision_model.encoder.layers.9.layer_norm1.bias": "model-00001-of-00003.safetensors",
912
+ "siglip2.vision_model.encoder.layers.9.layer_norm1.weight": "model-00001-of-00003.safetensors",
913
+ "siglip2.vision_model.encoder.layers.9.layer_norm2.bias": "model-00001-of-00003.safetensors",
914
+ "siglip2.vision_model.encoder.layers.9.layer_norm2.weight": "model-00001-of-00003.safetensors",
915
+ "siglip2.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00003.safetensors",
916
+ "siglip2.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00003.safetensors",
917
+ "siglip2.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00001-of-00003.safetensors",
918
+ "siglip2.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00003.safetensors",
919
+ "siglip2.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
920
+ "siglip2.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
921
+ "siglip2.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
922
+ "siglip2.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
923
+ "siglip2.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
924
+ "siglip2.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
925
+ "siglip2.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
926
+ "siglip2.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
927
+ "siglip2.vision_model.post_layernorm.bias": "model-00001-of-00003.safetensors",
928
+ "siglip2.vision_model.post_layernorm.weight": "model-00001-of-00003.safetensors"
929
+ }
930
+ }
modeling_siglip2.py ADDED
@@ -0,0 +1,1623 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import warnings
3
+ from dataclasses import dataclass
4
+ from typing import Any, Callable, Optional, Tuple, Union
5
+
6
+ import os
7
+ import numpy as np
8
+ import torch
9
+ import torch.nn as nn
10
+ import torch.nn.functional as F
11
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
12
+ from torch.nn.init import _calculate_fan_in_and_fan_out
13
+
14
+ from transformers.activations import ACT2FN
15
+ from transformers.modeling_attn_mask_utils import _prepare_4d_attention_mask
16
+ from transformers.modeling_layers import GradientCheckpointingLayer
17
+ from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput
18
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
19
+ from transformers.utils import (
20
+ ModelOutput,
21
+ add_start_docstrings,
22
+ add_start_docstrings_to_model_forward,
23
+ can_return_tuple,
24
+ logging,
25
+ replace_return_docstrings,
26
+ is_flash_attn_2_available,
27
+ is_flash_attn_greater_or_equal_2_10,
28
+ )
29
+
30
+ from .configuration_siglip2 import Siglip2Config, Siglip2TextConfig, Siglip2VisionConfig
31
+
32
+
33
+ logger = logging.get_logger(__name__)
34
+
35
+ _CONFIG_FOR_DOC = "Siglip2Config"
36
+
37
+ is_aiter_available = False
38
+
39
+ if is_flash_attn_2_available():
40
+ try:
41
+ from aiter import flash_attn_varlen_func
42
+ is_aiter_available = True
43
+ except ImportError:
44
+ from flash_attn import flash_attn_varlen_func
45
+ from flash_attn.layers.rotary import apply_rotary_emb
46
+
47
+ else:
48
+ flash_attn_varlen_func = None
49
+ apply_rotary_emb = None
50
+
51
+
52
+ if is_flash_attn_2_available():
53
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
54
+ else:
55
+ flash_attn_varlen_func = None
56
+
57
+ @dataclass
58
+ class Siglip2VisionOutput(ModelOutput):
59
+ """
60
+ Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
61
+
62
+ Args:
63
+ image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`
64
+ *optional* returned when model is initialized with `with_projection=True`):
65
+ The image embeddings obtained by applying the projection layer to the pooler_output.
66
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
67
+ Sequence of hidden-states at the output of the last layer of the model.
68
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`
69
+ is passed or when `config.output_hidden_states=True`):
70
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
71
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
72
+
73
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
74
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or
75
+ when `config.output_attentions=True`):
76
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
77
+ sequence_length)`.
78
+
79
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
80
+ heads.
81
+ """
82
+
83
+ image_embeds: Optional[torch.FloatTensor] = None
84
+ last_hidden_state: Optional[torch.FloatTensor] = None
85
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
86
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
87
+
88
+
89
+ @dataclass
90
+ class Siglip2TextOutput(ModelOutput):
91
+ """
92
+ Base class for text model's outputs that also contains a pooling of the last hidden states.
93
+
94
+ Args:
95
+ text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`
96
+ *optional* returned when model is initialized with `with_projection=True`):
97
+ The text embeddings obtained by applying the projection layer to the pooler_output.
98
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
99
+ Sequence of hidden-states at the output of the last layer of the model.
100
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned
101
+ when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
102
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
103
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
104
+
105
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
106
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed
107
+ or when `config.output_attentions=True`):
108
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
109
+ sequence_length)`.
110
+
111
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
112
+ heads.
113
+ """
114
+
115
+ text_embeds: Optional[torch.FloatTensor] = None
116
+ last_hidden_state: Optional[torch.FloatTensor] = None
117
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
118
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
119
+
120
+
121
+ @dataclass
122
+ class Siglip2Output(ModelOutput):
123
+ """
124
+ Args:
125
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
126
+ Contrastive loss for image-text similarity.
127
+ logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
128
+ The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
129
+ similarity scores.
130
+ logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
131
+ The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
132
+ similarity scores.
133
+ text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
134
+ The text embeddings obtained by applying the projection layer to the pooled output of [`Siglip2TextModel`].
135
+ image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
136
+ The image embeddings obtained by applying the projection layer to
137
+ the pooled output of [`Siglip2VisionModel`].
138
+ text_model_output (`BaseModelOutputWithPooling`):
139
+ The output of the [`Siglip2TextModel`].
140
+ vision_model_output (`BaseModelOutputWithPooling`):
141
+ The output of the [`Siglip2VisionModel`].
142
+ """
143
+
144
+ loss: Optional[torch.FloatTensor] = None
145
+ logits_per_image: Optional[torch.FloatTensor] = None
146
+ logits_per_text: Optional[torch.FloatTensor] = None
147
+ text_embeds: Optional[torch.FloatTensor] = None
148
+ image_embeds: Optional[torch.FloatTensor] = None
149
+ text_model_output: BaseModelOutputWithPooling = None
150
+ vision_model_output: BaseModelOutputWithPooling = None
151
+
152
+ def to_tuple(self) -> Tuple[Any]:
153
+ return tuple(
154
+ self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
155
+ for k in self.keys()
156
+ )
157
+
158
+ class Siglip2VisionEmbeddings(nn.Module):
159
+ def __init__(self, config: Siglip2VisionConfig):
160
+ super().__init__()
161
+ self.config = config
162
+ self.embed_dim = config.hidden_size
163
+ self.patch_size = config.patch_size
164
+
165
+ if hasattr(config, 'in_features') and config.in_features > 0:
166
+ self.in_features = config.in_features
167
+ else:
168
+ self.in_features = config.num_channels * self.patch_size * self.patch_size
169
+
170
+ self.patch_embedding = nn.Linear(
171
+ in_features=self.in_features,
172
+ out_features=self.embed_dim,
173
+ )
174
+
175
+ self.num_patches = config.num_patches
176
+ self.position_embedding_size = int(self.num_patches**0.5)
177
+ self.position_embedding = nn.Embedding(self.num_patches, self.embed_dim)
178
+
179
+ @staticmethod
180
+ def resize_positional_embeddings(
181
+ positional_embeddings: torch.Tensor,
182
+ spatial_shapes: torch.LongTensor,
183
+ max_length: int,
184
+ ) -> torch.Tensor:
185
+ """
186
+ Resize positional embeddings to image-specific size and pad to a fixed size.
187
+
188
+ Args:
189
+ positional_embeddings (`torch.Tensor`):
190
+ Position embeddings of shape (height, width, embed_dim)
191
+ spatial_shapes (`torch.LongTensor`):
192
+ Spatial shapes of shape (batch_size, 2) to resize the positional embeddings to
193
+ max_length (`int`):
194
+ Maximum length of the positional embeddings to pad resized positional embeddings to
195
+
196
+ Returns:
197
+ `torch.Tensor`: Embeddings of shape (batch_size, max_length, embed_dim)
198
+ """
199
+ batch_size = spatial_shapes.shape[0]
200
+ embed_dim = positional_embeddings.shape[-1]
201
+ source_dtype = positional_embeddings.dtype
202
+
203
+ resulted_positional_embeddings = torch.empty(
204
+ (batch_size, max_length, embed_dim),
205
+ device=positional_embeddings.device,
206
+ dtype=source_dtype,
207
+ )
208
+
209
+ positional_embeddings = positional_embeddings.permute(2, 0, 1).unsqueeze(0)
210
+ if positional_embeddings.device.type == "cpu":
211
+ positional_embeddings = positional_embeddings.to(torch.float32)
212
+ for i in range(batch_size):
213
+ height, width = spatial_shapes[i]
214
+ resized_embeddings = F.interpolate(
215
+ positional_embeddings,
216
+ size=(height, width),
217
+ mode="bilinear",
218
+ align_corners=False,
219
+ antialias=True,
220
+ )
221
+ resized_embeddings = resized_embeddings.reshape(embed_dim, height * width).transpose(0, 1)
222
+ resized_embeddings = resized_embeddings.to(source_dtype)
223
+
224
+ resulted_positional_embeddings[i, : height * width] = resized_embeddings
225
+ resulted_positional_embeddings[i, height * width :] = resized_embeddings[0]
226
+
227
+ return resulted_positional_embeddings
228
+
229
+
230
+ def forward(self, pixel_values: torch.FloatTensor, spatial_shapes: torch.LongTensor) -> torch.Tensor:
231
+ """
232
+ Args:
233
+ pixel_values (`torch.FloatTensor`):
234
+ Pixel values of shape (batch_size, max_num_patches, num_channels * patch_size * patch_size)
235
+ spatial_shapes (`List[Tuple[int, int]]`):
236
+ Spatial shapes of shape (batch_size, 2) to resize the positional embeddings to
237
+ """
238
+
239
+ target_dtype = self.patch_embedding.weight.dtype
240
+ patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))
241
+ positional_embeddings = self.position_embedding.weight.reshape(
242
+ self.position_embedding_size, self.position_embedding_size, -1
243
+ )
244
+
245
+ resized_positional_embeddings = self.resize_positional_embeddings(
246
+ positional_embeddings, spatial_shapes, max_length=pixel_values.shape[1]
247
+ )
248
+ embeddings = patch_embeds + resized_positional_embeddings
249
+ return embeddings
250
+
251
+
252
+ class Siglip2VisionEmbeddingsWoPos(nn.Module):
253
+ def __init__(self, config: Siglip2VisionConfig):
254
+ super().__init__()
255
+ self.config = config
256
+ self.embed_dim = config.hidden_size
257
+ self.patch_size = config.patch_size
258
+
259
+ if hasattr(config, 'in_features') and config.in_features > 0:
260
+ self.in_features = config.in_features
261
+ else:
262
+ self.in_features = config.num_channels * self.patch_size * self.patch_size
263
+
264
+ self.patch_embedding = nn.Linear(
265
+ in_features=self.in_features,
266
+ out_features=self.embed_dim,
267
+ )
268
+
269
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
270
+ target_dtype = self.patch_embedding.weight.dtype
271
+ patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))
272
+ patch_embeds = patch_embeds.view(-1, self.embed_dim)
273
+ return patch_embeds
274
+
275
+
276
+ def eager_attention_forward(
277
+ module: nn.Module,
278
+ query: torch.Tensor,
279
+ key: torch.Tensor,
280
+ value: torch.Tensor,
281
+ attention_mask: Optional[torch.Tensor],
282
+ scaling: float,
283
+ dropout: float = 0.0,
284
+ **kwargs,
285
+ ):
286
+
287
+ attn_weights = torch.matmul(query, key.transpose(-1, -2)) * scaling
288
+ if attention_mask is not None:
289
+ attn_weights = attn_weights + attention_mask
290
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
291
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
292
+ attn_output = torch.matmul(attn_weights, value)
293
+ attn_output = attn_output.transpose(1, 2).contiguous()
294
+ return attn_output, attn_weights
295
+
296
+ def apply_rotary_pos_emb_flashatt(
297
+ q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
298
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
299
+ cos = cos.chunk(2, dim=-1)[0].contiguous()
300
+ sin = sin.chunk(2, dim=-1)[0].contiguous()
301
+ q_embed = apply_rotary_emb(q.float(), cos.float(), sin.float()).type_as(q)
302
+ k_embed = apply_rotary_emb(k.float(), cos.float(), sin.float()).type_as(k)
303
+ return q_embed, k_embed
304
+
305
+ class Siglip2Attention(nn.Module):
306
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
307
+
308
+ def __init__(self, config: Union[Siglip2VisionConfig, Siglip2TextConfig]):
309
+ super().__init__()
310
+ self.config = config
311
+ self.embed_dim = config.hidden_size
312
+ self.num_heads = config.num_attention_heads
313
+ self.head_dim = self.embed_dim // self.num_heads
314
+ if self.head_dim * self.num_heads != self.embed_dim:
315
+ raise ValueError(
316
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
317
+ f" {self.num_heads})."
318
+ )
319
+ self.scale = self.head_dim**-0.5
320
+ self.dropout = config.attention_dropout
321
+ self.is_causal = False
322
+
323
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
324
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
325
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
326
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
327
+
328
+ def forward(
329
+ self,
330
+ hidden_states: torch.Tensor,
331
+ attention_mask: Optional[torch.Tensor] = None,
332
+ output_attentions: Optional[bool] = False,
333
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
334
+ """Input shape: Batch x Time x Channel"""
335
+
336
+ batch_size, seq_length, embed_dim = hidden_states.shape
337
+
338
+ queries = self.q_proj(hidden_states)
339
+ keys = self.k_proj(hidden_states)
340
+ values = self.v_proj(hidden_states)
341
+
342
+ queries = queries.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
343
+ keys = keys.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
344
+ values = values.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
345
+
346
+ attention_interface: Callable = eager_attention_forward
347
+ if self.config._attn_implementation != "eager":
348
+ if self.config._attn_implementation == "sdpa" and output_attentions:
349
+ logger.warning_once(
350
+ "`torch.nn.functional.scaled_dot_product_attention` does not support"
351
+ "`output_attentions=True`. Falling back to 'eager attention. This warning"
352
+ 'can be removed using the argument `attn_implementation="eager"` when loading the model.'
353
+ )
354
+ else:
355
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
356
+
357
+ attn_output, attn_weights = attention_interface(
358
+ self,
359
+ queries,
360
+ keys,
361
+ values,
362
+ attention_mask,
363
+ is_causal=self.is_causal,
364
+ scaling=self.scale,
365
+ dropout=0.0 if not self.training else self.dropout,
366
+ )
367
+
368
+ attn_output = attn_output.reshape(batch_size, seq_length, embed_dim).contiguous()
369
+ attn_output = self.out_proj(attn_output)
370
+
371
+ if not output_attentions:
372
+ attn_weights = None
373
+
374
+ return attn_output, attn_weights
375
+
376
+ class Vision_FlashAttention2(nn.Module):
377
+ def __init__(self, config: Union[Siglip2VisionConfig, Siglip2TextConfig]) -> None:
378
+ super().__init__()
379
+ dim = config.hidden_size
380
+ self.num_heads = config.num_attention_heads
381
+ self.k_proj = nn.Linear(dim, dim)
382
+ self.v_proj = nn.Linear(dim, dim)
383
+ self.q_proj = nn.Linear(dim, dim)
384
+ self.out_proj = nn.Linear(dim, dim)
385
+
386
+ def forward(
387
+ self,
388
+ hidden_states: torch.Tensor,
389
+ cu_seqlens: torch.Tensor,
390
+ rotary_pos_emb: Optional[torch.Tensor] = None,
391
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
392
+ ) -> torch.Tensor:
393
+
394
+ seq_length = hidden_states.shape[0]
395
+ q = self.q_proj(hidden_states).reshape(seq_length, self.num_heads, -1)
396
+ k = self.k_proj(hidden_states).reshape(seq_length, self.num_heads, -1)
397
+ v = self.v_proj(hidden_states).reshape(seq_length, self.num_heads, -1)
398
+
399
+
400
+ if position_embeddings is None:
401
+ logger.warning_once(
402
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
403
+ "through `rotary_pos_emb` (2D tensor of RoPE theta values), to using externally computed "
404
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.54 `rotary_pos_emb` will be "
405
+ "removed and `position_embeddings` will be mandatory."
406
+ )
407
+ emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
408
+ cos = emb.cos()
409
+ sin = emb.sin()
410
+ else:
411
+ cos, sin = position_embeddings
412
+
413
+ q, k = apply_rotary_pos_emb_flashatt(q.unsqueeze(0), k.unsqueeze(0), cos, sin)
414
+ q = q.squeeze(0)
415
+ k = k.squeeze(0)
416
+
417
+ max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
418
+ if is_aiter_available:
419
+ attn_output = flash_attn_varlen_func(q, k, v, cu_seqlens, cu_seqlens,
420
+ max_seqlen, max_seqlen, return_lse=True)[0].reshape(
421
+ seq_length, -1)
422
+ else:
423
+ attn_output = flash_attn_varlen_func(q, k, v, cu_seqlens, cu_seqlens,
424
+ max_seqlen, max_seqlen).reshape(
425
+ seq_length, -1)
426
+ attn_output = self.out_proj(attn_output)
427
+ return attn_output, None
428
+
429
+
430
+
431
+ def rotate_half(x):
432
+ """Rotates half the hidden dims of the input."""
433
+ x1 = x[..., : x.shape[-1] // 2]
434
+ x2 = x[..., x.shape[-1] // 2 :]
435
+ return torch.cat((-x2, x1), dim=-1)
436
+
437
+
438
+ def apply_rotary_pos_emb_vision(
439
+ q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
440
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
441
+ orig_q_dtype = q.dtype
442
+ orig_k_dtype = k.dtype
443
+ q, k = q.float(), k.float()
444
+ cos, sin = cos.unsqueeze(-2).float(), sin.unsqueeze(-2).float()
445
+ q_embed = (q * cos) + (rotate_half(q) * sin)
446
+ k_embed = (k * cos) + (rotate_half(k) * sin)
447
+ q_embed = q_embed.to(orig_q_dtype)
448
+ k_embed = k_embed.to(orig_k_dtype)
449
+ return q_embed, k_embed
450
+
451
+
452
+ class Vision_EagerAttention(nn.Module):
453
+ def __init__(self, config: Union[Siglip2VisionConfig, Siglip2TextConfig]) -> None:
454
+ super().__init__()
455
+ dim = config.hidden_size
456
+ self.num_heads = config.num_attention_heads
457
+ self.k_proj = nn.Linear(dim, dim)
458
+ self.v_proj = nn.Linear(dim, dim)
459
+ self.q_proj = nn.Linear(dim, dim)
460
+ self.out_proj = nn.Linear(dim, dim)
461
+ self.head_dim = dim // self.num_heads
462
+
463
+ def forward(
464
+ self,
465
+ hidden_states: torch.Tensor,
466
+ cu_seqlens: torch.Tensor,
467
+ rotary_pos_emb: Optional[torch.Tensor] = None,
468
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
469
+ ) -> torch.Tensor:
470
+ seq_length = hidden_states.shape[0]
471
+ q = self.q_proj(hidden_states).reshape(seq_length, self.num_heads, -1)
472
+ k = self.k_proj(hidden_states).reshape(seq_length, self.num_heads, -1)
473
+ v = self.v_proj(hidden_states).reshape(seq_length, self.num_heads, -1)
474
+
475
+ if position_embeddings is None:
476
+ logger.warning_once(
477
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
478
+ "through `rotary_pos_emb` (2D tensor of RoPE theta values), to using externally computed "
479
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.54 `rotary_pos_emb` will be "
480
+ "removed and `position_embeddings` will be mandatory."
481
+ )
482
+ emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
483
+ cos = emb.cos()
484
+ sin = emb.sin()
485
+ else:
486
+ cos, sin = position_embeddings
487
+ q, k = apply_rotary_pos_emb_vision(q, k, cos, sin)
488
+
489
+ attention_mask = torch.full(
490
+ [1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype
491
+ )
492
+ for i in range(1, len(cu_seqlens)):
493
+ attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = 0
494
+
495
+ q = q.transpose(0, 1)
496
+ k = k.transpose(0, 1)
497
+ v = v.transpose(0, 1)
498
+ attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
499
+ attn_weights = attn_weights + attention_mask
500
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
501
+ attn_output = torch.matmul(attn_weights, v)
502
+ attn_output = attn_output.transpose(0, 1)
503
+ attn_output = attn_output.reshape(seq_length, -1)
504
+ attn_output = self.out_proj(attn_output)
505
+ return attn_output, None
506
+
507
+
508
+ class Vision_SDPAAttention(nn.Module):
509
+ def __init__(self, config) -> None:
510
+ super().__init__()
511
+ dim, heads = config.hidden_size, config.num_attention_heads
512
+ self.num_heads, self.head_dim = heads, dim // heads
513
+ self.k_proj, self.v_proj, self.q_proj, self.out_proj = [nn.Linear(dim, dim) for _ in range(4)]
514
+ self.dropout = getattr(config, "attention_dropout", 0.0)
515
+
516
+ def forward(self, hidden_states, cu_seqlens, rotary_pos_emb=None, position_embeddings=None):
517
+ seq_length = hidden_states.shape[0]
518
+ q, k, v = self.q_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.k_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.v_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim)
519
+ if position_embeddings is None:
520
+ emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
521
+ cos = emb.cos()
522
+ sin = emb.sin()
523
+ else:
524
+ cos, sin = position_embeddings
525
+ q, k = apply_rotary_pos_emb_vision(q, k, cos, sin)
526
+ attention_mask = torch.full([1, 1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype)
527
+ for i in range(1, len(cu_seqlens)):
528
+ attention_mask[..., cu_seqlens[i-1]:cu_seqlens[i], cu_seqlens[i-1]:cu_seqlens[i]] = 0
529
+
530
+ q = q.transpose(0, 1).unsqueeze(0)
531
+ k = k.transpose(0, 1).unsqueeze(0)
532
+ v = v.transpose(0, 1).unsqueeze(0)
533
+ attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attention_mask)
534
+ return self.out_proj(attn_output.squeeze(0).transpose(0, 1).reshape(seq_length, -1).to(hidden_states.dtype)), None
535
+
536
+
537
+ VISION_ATTENTION_CLASSES = {
538
+ 'sdpa': Vision_SDPAAttention,
539
+ 'eager': Vision_EagerAttention,
540
+ 'flash_attention_2': Vision_FlashAttention2,
541
+ }
542
+
543
+
544
+ class Siglip2MLP(nn.Module):
545
+ def __init__(self, config):
546
+ super().__init__()
547
+ self.config = config
548
+ self.activation_fn = ACT2FN[config.hidden_act]
549
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
550
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
551
+
552
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
553
+ hidden_states = self.fc1(hidden_states)
554
+ hidden_states = self.activation_fn(hidden_states)
555
+ hidden_states = self.fc2(hidden_states)
556
+ return hidden_states
557
+
558
+
559
+ class Siglip2EncoderLayer(GradientCheckpointingLayer):
560
+ def __init__(self, config: Union[Siglip2VisionConfig, Siglip2TextConfig]):
561
+ super().__init__()
562
+ self.embed_dim = config.hidden_size
563
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
564
+ self.self_attn = VISION_ATTENTION_CLASSES[config._attn_implementation](config=config)
565
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
566
+ self.mlp = Siglip2MLP(config)
567
+
568
+ def forward(
569
+ self,
570
+ hidden_states: torch.Tensor,
571
+ attention_mask: torch.Tensor,
572
+ cu_seqlens: torch.Tensor,
573
+ rotary_pos_emb: Optional[torch.Tensor] = None,
574
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
575
+ output_attentions: Optional[bool] = False,
576
+ ) -> Tuple[torch.FloatTensor]:
577
+ """
578
+ Args:
579
+ hidden_states (`torch.FloatTensor`):
580
+ Input to the layer of shape `(batch, seq_len, embed_dim)`.
581
+ attention_mask (`torch.FloatTensor`):
582
+ Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements
583
+ are indicated by very large negative values.
584
+ output_attentions (`bool`, *optional*, defaults to `False`):
585
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
586
+ returned tensors for more detail.
587
+ """
588
+ residual = hidden_states
589
+
590
+ hidden_states = self.layer_norm1(hidden_states)
591
+ hidden_states, attn_weights = self.self_attn(
592
+ hidden_states=hidden_states,
593
+ cu_seqlens=cu_seqlens,
594
+ rotary_pos_emb=rotary_pos_emb,
595
+ position_embeddings=position_embeddings,
596
+ )
597
+ hidden_states = residual + hidden_states
598
+
599
+ residual = hidden_states
600
+ hidden_states = self.layer_norm2(hidden_states)
601
+ hidden_states = self.mlp(hidden_states)
602
+ hidden_states = residual + hidden_states
603
+
604
+ outputs = (hidden_states,)
605
+
606
+ if output_attentions:
607
+ outputs += (attn_weights,)
608
+
609
+ return outputs
610
+
611
+ class VisionRope(nn.Module):
612
+ def __init__(self, dim: int, theta: float = 10000.0) -> None:
613
+ super().__init__()
614
+ inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))
615
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
616
+
617
+ def forward(self, seqlen: int) -> torch.Tensor:
618
+ seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
619
+ freqs = torch.outer(seq, self.inv_freq)
620
+ return freqs
621
+
622
+ class Siglip2Encoder(nn.Module):
623
+ """
624
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
625
+ [`Siglip2EncoderLayer`].
626
+
627
+ Args:
628
+ config: Siglip2Config
629
+ """
630
+
631
+ def __init__(self, config: Siglip2Config):
632
+ super().__init__()
633
+ self.config = config
634
+ self.layers = nn.ModuleList([Siglip2EncoderLayer(config) for _ in range(config.num_hidden_layers)])
635
+ self.gradient_checkpointing = False
636
+ self.spatial_merge_size = 2
637
+ self.spatial_merge_unit = self.spatial_merge_size * self.spatial_merge_size
638
+ self.patch_size = config.patch_size
639
+ self.window_size = self.patch_size * 2 * 8
640
+
641
+ assert(config.hidden_size%(config.num_attention_heads*2) == 0)
642
+ self.rotary_pos_emb = VisionRope(config.hidden_size//config.num_attention_heads//2)
643
+
644
+ def rot_pos_emb(self, spatial_shapes):
645
+ pos_ids = []
646
+
647
+ for h, w in spatial_shapes:
648
+ t = 1
649
+ hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
650
+
651
+ hpos_ids = hpos_ids.reshape(
652
+ h // self.spatial_merge_size,
653
+ self.spatial_merge_size,
654
+ w // self.spatial_merge_size,
655
+ self.spatial_merge_size,
656
+ )
657
+ hpos_ids = hpos_ids.permute(0, 2, 1, 3)
658
+ hpos_ids = hpos_ids.flatten()
659
+
660
+ wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
661
+ wpos_ids = wpos_ids.reshape(
662
+ h // self.spatial_merge_size,
663
+ self.spatial_merge_size,
664
+ w // self.spatial_merge_size,
665
+ self.spatial_merge_size,
666
+ )
667
+ wpos_ids = wpos_ids.permute(0, 2, 1, 3)
668
+ wpos_ids = wpos_ids.flatten()
669
+
670
+ pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
671
+ pos_ids = torch.cat(pos_ids, dim=0)
672
+ max_grid_size = spatial_shapes.max()
673
+ rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
674
+ rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
675
+ return rotary_pos_emb
676
+
677
+ def get_window_index(self, spatial_shapes):
678
+ window_index: list = []
679
+ cu_window_seqlens: list = [0]
680
+ window_index_id = 0
681
+ vit_merger_window_size = self.window_size // self.spatial_merge_size // self.patch_size
682
+
683
+ for grid_h, grid_w in spatial_shapes:
684
+ grid_t = 1
685
+ llm_grid_h, llm_grid_w = (
686
+ grid_h // self.spatial_merge_size,
687
+ grid_w // self.spatial_merge_size,
688
+ )
689
+ index = torch.arange(grid_t * llm_grid_h * llm_grid_w).reshape(grid_t, llm_grid_h, llm_grid_w)
690
+ pad_h = (vit_merger_window_size - llm_grid_h % vit_merger_window_size) % vit_merger_window_size
691
+ pad_w = (vit_merger_window_size - llm_grid_w % vit_merger_window_size) % vit_merger_window_size
692
+ num_windows_h = (llm_grid_h + pad_h) // vit_merger_window_size
693
+ num_windows_w = (llm_grid_w + pad_w) // vit_merger_window_size
694
+ index_padded = F.pad(index, (0, pad_w, 0, pad_h), "constant", -100)
695
+ index_padded = index_padded.reshape(
696
+ grid_t,
697
+ num_windows_h,
698
+ vit_merger_window_size,
699
+ num_windows_w,
700
+ vit_merger_window_size,
701
+ )
702
+
703
+ index_padded = index_padded.permute(0, 1, 3, 2, 4).reshape(
704
+ grid_t,
705
+ num_windows_h * num_windows_w,
706
+ vit_merger_window_size,
707
+ vit_merger_window_size,
708
+ )
709
+ seqlens = (index_padded != -100).sum([2, 3]).reshape(-1)
710
+ index_padded = index_padded.reshape(-1)
711
+ index_new = index_padded[index_padded != -100]
712
+ window_index.append(index_new + window_index_id)
713
+ cu_seqlens_tmp = seqlens.cumsum(0) * self.spatial_merge_unit + cu_window_seqlens[-1]
714
+ cu_window_seqlens.extend(cu_seqlens_tmp.tolist())
715
+ window_index_id += (grid_t * llm_grid_h * llm_grid_w).item()
716
+
717
+ window_index = torch.cat(window_index, dim=0)
718
+
719
+ return window_index, cu_window_seqlens
720
+
721
+ @can_return_tuple
722
+ def forward(
723
+ self,
724
+ inputs_embeds,
725
+ spatial_shapes: torch.LongTensor,
726
+ attention_mask: Optional[torch.Tensor] = None,
727
+ output_attentions: Optional[bool] = None,
728
+ output_hidden_states: Optional[bool] = None,
729
+ ) -> BaseModelOutput:
730
+ r"""
731
+ Args:
732
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
733
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
734
+ This is useful if you want more control over how to convert `input_ids` indices into associated vectors
735
+ than the model's internal embedding lookup matrix.
736
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
737
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
738
+
739
+ - 1 for tokens that are **not masked**,
740
+ - 0 for tokens that are **masked**.
741
+
742
+ [What are attention masks?](../glossary#attention-mask)
743
+ output_attentions (`bool`, *optional*):
744
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
745
+ returned tensors for more detail.
746
+ output_hidden_states (`bool`, *optional*):
747
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
748
+ for more detail.
749
+ return_dict (`bool`, *optional*):
750
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
751
+ """
752
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
753
+ output_hidden_states = (
754
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
755
+ )
756
+
757
+ encoder_states = () if output_hidden_states else None
758
+ all_attentions = () if output_attentions else None
759
+
760
+ hidden_states = inputs_embeds
761
+ rotary_pos_emb = self.rot_pos_emb(spatial_shapes)
762
+ window_index, cu_window_seqlens = self.get_window_index(spatial_shapes)
763
+ cu_window_seqlens = torch.tensor(
764
+ cu_window_seqlens,
765
+ device=hidden_states.device,
766
+ dtype=spatial_shapes.dtype if torch.jit.is_tracing() else torch.int32,
767
+ )
768
+ cu_window_seqlens = torch.unique_consecutive(cu_window_seqlens)
769
+
770
+ seq_len, _ = hidden_states.size()
771
+ hidden_states = hidden_states.reshape(seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1)
772
+ hidden_states = hidden_states[window_index, :, :]
773
+ hidden_states = hidden_states.reshape(seq_len, -1)
774
+ rotary_pos_emb = rotary_pos_emb.reshape(seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1)
775
+ rotary_pos_emb = rotary_pos_emb[window_index, :, :]
776
+ rotary_pos_emb = rotary_pos_emb.reshape(seq_len, -1)
777
+ emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
778
+ position_embeddings = (emb.cos(), emb.sin())
779
+
780
+ cu_seqlens = torch.repeat_interleave(spatial_shapes[:, 0] * spatial_shapes[:, 1], 1).cumsum(
781
+ dim=0,
782
+ dtype=spatial_shapes.dtype if torch.jit.is_tracing() else torch.int32,
783
+ )
784
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
785
+
786
+ for layer_num, encoder_layer in enumerate(self.layers):
787
+ if output_hidden_states:
788
+ encoder_states = encoder_states + (hidden_states,)
789
+
790
+ if (1+layer_num) % 8 == 0 or layer_num == len(self.layers) - 1:
791
+ cu_seqlens_now = cu_seqlens
792
+ else:
793
+ cu_seqlens_now = cu_window_seqlens
794
+
795
+ layer_outputs = encoder_layer(
796
+ hidden_states,
797
+ attention_mask,
798
+ cu_seqlens=cu_seqlens_now,
799
+ position_embeddings=position_embeddings
800
+ )
801
+
802
+ hidden_states = layer_outputs[0]
803
+ if output_attentions:
804
+ all_attentions = all_attentions + (layer_outputs[1],)
805
+
806
+
807
+ hidden_states = hidden_states.reshape(seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1)
808
+ reverse_indices = torch.argsort(window_index)
809
+ hidden_states = hidden_states[reverse_indices, :, :]
810
+ hidden_states = hidden_states.reshape(seq_len, -1)
811
+
812
+ if output_hidden_states:
813
+ encoder_states = encoder_states + (hidden_states,)
814
+
815
+ return BaseModelOutput(
816
+ last_hidden_state=hidden_states,
817
+ hidden_states=encoder_states,
818
+ attentions=all_attentions,
819
+ )
820
+
821
+
822
+ SIGLIP2_VISION_INPUTS_DOCSTRING = r"""
823
+ Args:
824
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
825
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
826
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
827
+ output_attentions (`bool`, *optional*):
828
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
829
+ tensors for more detail.
830
+ output_hidden_states (`bool`, *optional*):
831
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
832
+ more detail.
833
+ interpolate_pos_encoding (`bool`, *optional*, defaults to `False`):
834
+ Whether to interpolate the pre-trained position encodings.
835
+ return_dict (`bool`, *optional*):
836
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
837
+ """
838
+
839
+
840
+ class Siglip2VisionTransformer(nn.Module):
841
+ def __init__(self, config: Siglip2VisionConfig):
842
+ super().__init__()
843
+ self.config = config
844
+ embed_dim = config.hidden_size
845
+
846
+ self.embeddings = Siglip2VisionEmbeddingsWoPos(config)
847
+ self.encoder = Siglip2Encoder(config)
848
+ self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
849
+ self.use_head = False
850
+ if self.use_head:
851
+ self.head = Siglip2MultiheadAttentionPoolingHead(config)
852
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
853
+
854
+ @can_return_tuple
855
+ @add_start_docstrings_to_model_forward(SIGLIP2_VISION_INPUTS_DOCSTRING)
856
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=Siglip2VisionConfig)
857
+ def forward(
858
+ self,
859
+ pixel_values: torch.FloatTensor,
860
+ attention_mask: torch.Tensor,
861
+ spatial_shapes: torch.LongTensor,
862
+ output_attentions: Optional[bool] = None,
863
+ output_hidden_states: Optional[bool] = None,
864
+ ) -> BaseModelOutputWithPooling:
865
+ r"""
866
+ Returns:
867
+
868
+ """
869
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
870
+ output_hidden_states = (
871
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
872
+ )
873
+
874
+ bs, length, dim = pixel_values.shape
875
+ hidden_states = self.embeddings(pixel_values)
876
+
877
+ if attention_mask is not None and not self._use_flash_attention_2:
878
+ encoder_attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
879
+ else:
880
+ encoder_attention_mask = attention_mask
881
+
882
+ encoder_outputs: BaseModelOutput = self.encoder(
883
+ inputs_embeds=hidden_states,
884
+ spatial_shapes=spatial_shapes,
885
+ attention_mask=encoder_attention_mask,
886
+ output_attentions=output_attentions,
887
+ output_hidden_states=output_hidden_states,
888
+ )
889
+
890
+ last_hidden_state = encoder_outputs.last_hidden_state
891
+
892
+ last_hidden_state = self.post_layernorm(last_hidden_state)
893
+
894
+ return BaseModelOutputWithPooling(
895
+ last_hidden_state=last_hidden_state,
896
+ pooler_output=None,
897
+ hidden_states=encoder_outputs.hidden_states,
898
+ attentions=encoder_outputs.attentions,
899
+ )
900
+
901
+
902
+ class Siglip2TextEmbeddings(nn.Module):
903
+ def __init__(self, config: Siglip2TextConfig):
904
+ super().__init__()
905
+ embed_dim = config.hidden_size
906
+
907
+ self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
908
+ self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
909
+
910
+ self.register_buffer(
911
+ "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
912
+ )
913
+
914
+ def forward(
915
+ self,
916
+ input_ids: Optional[torch.LongTensor] = None,
917
+ position_ids: Optional[torch.LongTensor] = None,
918
+ inputs_embeds: Optional[torch.FloatTensor] = None,
919
+ ) -> torch.Tensor:
920
+ seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
921
+ max_position_embedding = self.position_embedding.weight.shape[0]
922
+ if seq_length > max_position_embedding:
923
+ raise ValueError(
924
+ f"Sequence length must be less than max_position_embeddings (got `sequence length`: "
925
+ f"{seq_length} and max_position_embeddings: {max_position_embedding}"
926
+ )
927
+
928
+ if position_ids is None:
929
+ position_ids = self.position_ids[:, :seq_length]
930
+
931
+ if inputs_embeds is None:
932
+ inputs_embeds = self.token_embedding(input_ids)
933
+
934
+ position_embeddings = self.position_embedding(position_ids)
935
+ embeddings = inputs_embeds + position_embeddings
936
+
937
+ return embeddings
938
+
939
+
940
+ def _trunc_normal_(tensor, mean, std, a, b):
941
+ def norm_cdf(x):
942
+ return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
943
+
944
+ if (mean < a - 2 * std) or (mean > b + 2 * std):
945
+ warnings.warn(
946
+ "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
947
+ "The distribution of values may be incorrect.",
948
+ stacklevel=2,
949
+ )
950
+
951
+ l = norm_cdf((a - mean) / std)
952
+ u = norm_cdf((b - mean) / std)
953
+
954
+ tensor.uniform_(2 * l - 1, 2 * u - 1)
955
+
956
+ tensor.erfinv_()
957
+
958
+ tensor.mul_(std * math.sqrt(2.0))
959
+ tensor.add_(mean)
960
+
961
+ tensor.clamp_(min=a, max=b)
962
+
963
+
964
+ def trunc_normal_tf_(
965
+ tensor: torch.Tensor, mean: float = 0.0, std: float = 1.0, a: float = -2.0, b: float = 2.0
966
+ ) -> torch.Tensor:
967
+ """
968
+ Args:
969
+ tensor: an n-dimensional `torch.Tensor`
970
+ mean: the mean of the normal distribution
971
+ std: the standard deviation of the normal distribution
972
+ a: the minimum cutoff value
973
+ b: the maximum cutoff value
974
+ """
975
+ with torch.no_grad():
976
+ _trunc_normal_(tensor, 0, 1.0, a, b)
977
+ tensor.mul_(std).add_(mean)
978
+
979
+
980
+ def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="normal"):
981
+ fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
982
+ if mode == "fan_in":
983
+ denom = fan_in
984
+ elif mode == "fan_out":
985
+ denom = fan_out
986
+ elif mode == "fan_avg":
987
+ denom = (fan_in + fan_out) / 2
988
+
989
+ variance = scale / denom
990
+
991
+ if distribution == "truncated_normal":
992
+ trunc_normal_tf_(tensor, std=math.sqrt(variance) / 0.87962566103423978)
993
+ elif distribution == "normal":
994
+ with torch.no_grad():
995
+ tensor.normal_(std=math.sqrt(variance))
996
+ elif distribution == "uniform":
997
+ bound = math.sqrt(3 * variance)
998
+ with torch.no_grad():
999
+ tensor.uniform_(-bound, bound)
1000
+ else:
1001
+ raise ValueError(f"invalid distribution {distribution}")
1002
+
1003
+
1004
+ def lecun_normal_(tensor):
1005
+ variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
1006
+
1007
+
1008
+ def default_flax_embed_init(tensor):
1009
+ variance_scaling_(tensor, mode="fan_in", distribution="normal")
1010
+
1011
+
1012
+ SIGLIP2_TEXT_INPUTS_DOCSTRING = r"""
1013
+ Args:
1014
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
1015
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
1016
+ it.
1017
+
1018
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1019
+ [`PreTrainedTokenizer.__call__`] for details.
1020
+
1021
+ [What are input IDs?](../glossary#input-ids)
1022
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
1023
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1024
+
1025
+ - 1 for tokens that are **not masked**,
1026
+ - 0 for tokens that are **masked**.
1027
+
1028
+ [What are attention masks?](../glossary#attention-mask)
1029
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1030
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1031
+ config.max_position_embeddings - 1]`.
1032
+
1033
+ [What are position IDs?](../glossary#position-ids)
1034
+ output_attentions (`bool`, *optional*):
1035
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1036
+ tensors for more detail.
1037
+ output_hidden_states (`bool`, *optional*):
1038
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1039
+ more detail.
1040
+ return_dict (`bool`, *optional*):
1041
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1042
+ """
1043
+
1044
+
1045
+ class Siglip2TextTransformer(nn.Module):
1046
+ def __init__(self, config: Siglip2TextConfig):
1047
+ super().__init__()
1048
+ self.config = config
1049
+ embed_dim = config.hidden_size
1050
+ self.embeddings = Siglip2TextEmbeddings(config)
1051
+ self.encoder = Siglip2Encoder(config)
1052
+ self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
1053
+
1054
+ self.head = nn.Linear(embed_dim, config.projection_size)
1055
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
1056
+
1057
+ @can_return_tuple
1058
+ @add_start_docstrings_to_model_forward(SIGLIP2_TEXT_INPUTS_DOCSTRING)
1059
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=Siglip2TextConfig)
1060
+ def forward(
1061
+ self,
1062
+ input_ids: Optional[torch.Tensor] = None,
1063
+ attention_mask: Optional[torch.Tensor] = None,
1064
+ position_ids: Optional[torch.Tensor] = None,
1065
+ output_attentions: Optional[bool] = None,
1066
+ output_hidden_states: Optional[bool] = None,
1067
+ ) -> BaseModelOutputWithPooling:
1068
+ r"""
1069
+ Returns:
1070
+
1071
+ """
1072
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1073
+ output_hidden_states = (
1074
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1075
+ )
1076
+
1077
+ if input_ids is None:
1078
+ raise ValueError("You have to specify input_ids")
1079
+
1080
+ input_shape = input_ids.size()
1081
+ input_ids = input_ids.view(-1, input_shape[-1])
1082
+
1083
+ hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
1084
+
1085
+ if attention_mask is not None and not self._use_flash_attention_2:
1086
+ attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
1087
+
1088
+ encoder_outputs: BaseModelOutput = self.encoder(
1089
+ inputs_embeds=hidden_states,
1090
+ attention_mask=attention_mask,
1091
+ output_attentions=output_attentions,
1092
+ output_hidden_states=output_hidden_states,
1093
+ )
1094
+
1095
+ last_hidden_state = encoder_outputs.last_hidden_state
1096
+
1097
+ last_hidden_state = self.final_layer_norm(last_hidden_state)
1098
+
1099
+ pooled_output = last_hidden_state[:, -1, :]
1100
+ pooled_output = self.head(pooled_output)
1101
+
1102
+ return BaseModelOutputWithPooling(
1103
+ last_hidden_state=last_hidden_state,
1104
+ pooler_output=pooled_output,
1105
+ hidden_states=encoder_outputs.hidden_states,
1106
+ attentions=encoder_outputs.attentions,
1107
+ )
1108
+
1109
+
1110
+ SIGLIP2_START_DOCSTRING = r"""
1111
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
1112
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
1113
+ etc.)
1114
+
1115
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
1116
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
1117
+ and behavior.
1118
+
1119
+ Parameters:
1120
+ config ([`Siglip2Config`]): Model configuration class with all the parameters of the model.
1121
+ Initializing with a config file does not load the weights associated with the model, only the
1122
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
1123
+ """
1124
+
1125
+ SIGLIP2_INPUTS_DOCSTRING = r"""
1126
+ Args:
1127
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
1128
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
1129
+ it.
1130
+
1131
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1132
+ [`PreTrainedTokenizer.__call__`] for details.
1133
+
1134
+ [What are input IDs?](../glossary#input-ids)
1135
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
1136
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1137
+
1138
+ - 1 for tokens that are **not masked**,
1139
+ - 0 for tokens that are **masked**.
1140
+
1141
+ [What are attention masks?](../glossary#attention-mask)
1142
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1143
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1144
+ config.max_position_embeddings - 1]`.
1145
+
1146
+ [What are position IDs?](../glossary#position-ids)
1147
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
1148
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
1149
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
1150
+ return_loss (`bool`, *optional*):
1151
+ Whether or not to return the contrastive loss.
1152
+ output_attentions (`bool`, *optional*):
1153
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1154
+ tensors for more detail.
1155
+ output_hidden_states (`bool`, *optional*):
1156
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1157
+ more detail.
1158
+ interpolate_pos_encoding (`bool`, *optional*, defaults to `False`):
1159
+ Whether to interpolate the pre-trained position encodings.
1160
+ return_dict (`bool`, *optional*):
1161
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1162
+ """
1163
+
1164
+
1165
+ class Siglip2PreTrainedModel(PreTrainedModel):
1166
+
1167
+ config_class = Siglip2Config
1168
+ base_model_prefix = "siglip2"
1169
+ supports_gradient_checkpointing = True
1170
+
1171
+ _no_split_modules = [
1172
+ "Siglip2TextEmbeddings",
1173
+ "Siglip2EncoderLayer",
1174
+ "Siglip2VisionEmbeddings",
1175
+ "Siglip2EncoderLayer",
1176
+ "Siglip2MultiheadAttentionPoolingHead",
1177
+ ]
1178
+ _supports_flash_attn_2 = True
1179
+ _supports_sdpa = True
1180
+
1181
+ def _init_weights(self, module):
1182
+ """Initialize the weights"""
1183
+ if isinstance(module, Siglip2VisionEmbeddings):
1184
+ width = (
1185
+ self.config.vision_config.hidden_size
1186
+ if isinstance(self.config, Siglip2Config)
1187
+ else self.config.hidden_size
1188
+ )
1189
+ nn.init.normal_(module.position_embedding.weight, std=1 / np.sqrt(width))
1190
+ elif isinstance(module, nn.Embedding):
1191
+ default_flax_embed_init(module.weight)
1192
+ elif isinstance(module, Siglip2Attention):
1193
+ nn.init.xavier_uniform_(module.q_proj.weight)
1194
+ nn.init.xavier_uniform_(module.k_proj.weight)
1195
+ nn.init.xavier_uniform_(module.v_proj.weight)
1196
+ nn.init.xavier_uniform_(module.out_proj.weight)
1197
+ nn.init.zeros_(module.q_proj.bias)
1198
+ nn.init.zeros_(module.k_proj.bias)
1199
+ nn.init.zeros_(module.v_proj.bias)
1200
+ nn.init.zeros_(module.out_proj.bias)
1201
+ elif isinstance(module, Siglip2MLP):
1202
+ nn.init.xavier_uniform_(module.fc1.weight)
1203
+ nn.init.xavier_uniform_(module.fc2.weight)
1204
+ nn.init.normal_(module.fc1.bias, std=1e-6)
1205
+ nn.init.normal_(module.fc2.bias, std=1e-6)
1206
+ elif isinstance(module, Siglip2MultiheadAttentionPoolingHead):
1207
+ nn.init.xavier_uniform_(module.probe.data)
1208
+ nn.init.xavier_uniform_(module.attention.in_proj_weight.data)
1209
+ nn.init.zeros_(module.attention.in_proj_bias.data)
1210
+ elif isinstance(module, Siglip2Model):
1211
+ logit_scale_init = torch.log(torch.tensor(1.0))
1212
+ module.logit_scale.data.fill_(logit_scale_init)
1213
+ module.logit_bias.data.zero_()
1214
+ elif isinstance(module, Siglip2ForImageClassification):
1215
+ nn.init.normal_(
1216
+ module.classifier.weight,
1217
+ std=self.config.vision_config.hidden_size**-0.5 * self.config.initializer_factor,
1218
+ )
1219
+ elif isinstance(module, (nn.Linear, nn.Conv2d)):
1220
+ lecun_normal_(module.weight)
1221
+ if module.bias is not None:
1222
+ nn.init.zeros_(module.bias)
1223
+ elif isinstance(module, nn.LayerNorm):
1224
+ module.bias.data.zero_()
1225
+ module.weight.data.fill_(1.0)
1226
+
1227
+
1228
+ @add_start_docstrings(
1229
+ """The text model from Siglip2 without any head or projection on top.""",
1230
+ SIGLIP2_START_DOCSTRING,
1231
+ )
1232
+ class Siglip2TextModel(Siglip2PreTrainedModel):
1233
+ config_class = Siglip2TextConfig
1234
+
1235
+ def __init__(self, config: Siglip2TextConfig):
1236
+ super().__init__(config)
1237
+ self.text_model = Siglip2TextTransformer(config)
1238
+ # Initialize weights and apply final processing
1239
+ self.post_init()
1240
+
1241
+ def get_input_embeddings(self) -> nn.Module:
1242
+ return self.text_model.embeddings.token_embedding
1243
+
1244
+ def set_input_embeddings(self, value):
1245
+ self.text_model.embeddings.token_embedding = value
1246
+
1247
+ @can_return_tuple
1248
+ @add_start_docstrings_to_model_forward(SIGLIP2_TEXT_INPUTS_DOCSTRING)
1249
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=Siglip2TextConfig)
1250
+ def forward(
1251
+ self,
1252
+ input_ids: Optional[torch.Tensor] = None,
1253
+ attention_mask: Optional[torch.Tensor] = None,
1254
+ position_ids: Optional[torch.Tensor] = None,
1255
+ output_attentions: Optional[bool] = None,
1256
+ output_hidden_states: Optional[bool] = None,
1257
+ ) -> BaseModelOutputWithPooling:
1258
+ r"""
1259
+ Returns:
1260
+
1261
+ """
1262
+
1263
+ return self.text_model(
1264
+ input_ids=input_ids,
1265
+ attention_mask=attention_mask,
1266
+ position_ids=position_ids,
1267
+ output_attentions=output_attentions,
1268
+ output_hidden_states=output_hidden_states,
1269
+ )
1270
+
1271
+
1272
+ class Siglip2MultiheadAttentionPoolingHead(nn.Module):
1273
+ """Multihead Attention Pooling."""
1274
+
1275
+ def __init__(self, config: Siglip2VisionConfig):
1276
+ super().__init__()
1277
+
1278
+ self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size))
1279
+ self.attention = torch.nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
1280
+ self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
1281
+ self.mlp = Siglip2MLP(config)
1282
+ self.num_heads = config.num_attention_heads
1283
+
1284
+ def forward(self, hidden_state: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
1285
+ batch_size = hidden_state.shape[0]
1286
+ probe = self.probe.repeat(batch_size, 1, 1)
1287
+
1288
+ if attention_mask is not None:
1289
+ target_len, source_len = probe.shape[1], hidden_state.shape[1]
1290
+ attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_state.dtype, target_len)
1291
+ attention_mask = attention_mask.repeat(1, self.num_heads, target_len, 1)
1292
+ attention_mask = attention_mask.reshape(-1, target_len, source_len)
1293
+
1294
+ hidden_state = self.attention(probe, hidden_state, hidden_state, attn_mask=attention_mask)[0]
1295
+
1296
+ residual = hidden_state
1297
+ hidden_state = self.layernorm(hidden_state)
1298
+ hidden_state = residual + self.mlp(hidden_state)
1299
+
1300
+ return hidden_state[:, 0]
1301
+
1302
+
1303
+ @add_start_docstrings(
1304
+ """The vision model from Siglip2 without any head or projection on top.""",
1305
+ SIGLIP2_START_DOCSTRING,
1306
+ )
1307
+ class Siglip2VisionModel(Siglip2PreTrainedModel):
1308
+ config_class = Siglip2VisionConfig
1309
+ main_input_name = "pixel_values"
1310
+
1311
+ def __init__(self, config: Siglip2VisionConfig):
1312
+ super().__init__(config)
1313
+
1314
+ self.vision_model = Siglip2VisionTransformer(config)
1315
+
1316
+ # Initialize weights and apply final processing
1317
+ self.post_init()
1318
+
1319
+ def get_input_embeddings(self) -> nn.Module:
1320
+ return self.vision_model.embeddings.patch_embedding
1321
+
1322
+ @can_return_tuple
1323
+ @add_start_docstrings_to_model_forward(SIGLIP2_VISION_INPUTS_DOCSTRING)
1324
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=Siglip2VisionConfig)
1325
+ def forward(
1326
+ self,
1327
+ pixel_values: torch.FloatTensor,
1328
+ pixel_attention_mask: torch.Tensor,
1329
+ spatial_shapes: torch.LongTensor,
1330
+ output_attentions: Optional[bool] = None,
1331
+ output_hidden_states: Optional[bool] = None,
1332
+ ) -> BaseModelOutputWithPooling:
1333
+ r"""
1334
+ Returns:
1335
+
1336
+ ```"""
1337
+ return self.vision_model(
1338
+ pixel_values=pixel_values,
1339
+ attention_mask=pixel_attention_mask,
1340
+ spatial_shapes=spatial_shapes,
1341
+ output_attentions=output_attentions,
1342
+ output_hidden_states=output_hidden_states,
1343
+ )
1344
+
1345
+
1346
+ @add_start_docstrings(SIGLIP2_START_DOCSTRING)
1347
+ class Siglip2Model(Siglip2PreTrainedModel):
1348
+ config_class = Siglip2Config
1349
+
1350
+ def __init__(self, config: Siglip2Config):
1351
+ super().__init__(config)
1352
+
1353
+ if not isinstance(config.text_config, Siglip2TextConfig):
1354
+ raise TypeError(
1355
+ "config.text_config is expected to be of type Siglip2TextConfig but is of type"
1356
+ f" {type(config.text_config)}."
1357
+ )
1358
+
1359
+ if not isinstance(config.vision_config, Siglip2VisionConfig):
1360
+ raise TypeError(
1361
+ "config.vision_config is expected to be of type Siglip2VisionConfig but is of type"
1362
+ f" {type(config.vision_config)}."
1363
+ )
1364
+
1365
+ text_config = config.text_config
1366
+ vision_config = config.vision_config
1367
+
1368
+ text_model = Siglip2TextModel._from_config(text_config)
1369
+ vision_model = Siglip2VisionModel._from_config(vision_config)
1370
+
1371
+ self.text_model = text_model.text_model
1372
+ self.vision_model = vision_model.vision_model
1373
+
1374
+ self.logit_scale = nn.Parameter(torch.randn(1))
1375
+ self.logit_bias = nn.Parameter(torch.randn(1))
1376
+
1377
+ self.post_init()
1378
+
1379
+ @add_start_docstrings_to_model_forward(SIGLIP2_TEXT_INPUTS_DOCSTRING)
1380
+ def get_text_features(
1381
+ self,
1382
+ input_ids: Optional[torch.Tensor] = None,
1383
+ attention_mask: Optional[torch.Tensor] = None,
1384
+ position_ids: Optional[torch.Tensor] = None,
1385
+ output_attentions: Optional[bool] = None,
1386
+ output_hidden_states: Optional[bool] = None,
1387
+ ) -> torch.FloatTensor:
1388
+ r"""
1389
+ Returns:
1390
+ text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
1391
+ applying the projection layer to the pooled output of [`Siglip2TextModel`].
1392
+
1393
+ """
1394
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1395
+ output_hidden_states = (
1396
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1397
+ )
1398
+
1399
+ text_outputs: BaseModelOutputWithPooling = self.text_model(
1400
+ input_ids=input_ids,
1401
+ attention_mask=attention_mask,
1402
+ position_ids=position_ids,
1403
+ output_attentions=output_attentions,
1404
+ output_hidden_states=output_hidden_states,
1405
+ )
1406
+
1407
+ pooled_output = text_outputs.pooler_output
1408
+
1409
+ return pooled_output
1410
+
1411
+ @add_start_docstrings_to_model_forward(SIGLIP2_VISION_INPUTS_DOCSTRING)
1412
+ def get_image_features(
1413
+ self,
1414
+ pixel_values: Optional[torch.FloatTensor] = None,
1415
+ pixel_attention_mask: Optional[torch.Tensor] = None,
1416
+ spatial_shapes: Optional[torch.LongTensor] = None,
1417
+ output_attentions: Optional[bool] = None,
1418
+ output_hidden_states: Optional[bool] = None,
1419
+ ) -> torch.FloatTensor:
1420
+ r"""
1421
+ Returns:
1422
+ image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
1423
+ applying the projection layer to the pooled output of [`Siglip2VisionModel`].
1424
+
1425
+ """
1426
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1427
+ output_hidden_states = (
1428
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1429
+ )
1430
+
1431
+ vision_outputs: BaseModelOutputWithPooling = self.vision_model(
1432
+ pixel_values=pixel_values,
1433
+ attention_mask=pixel_attention_mask,
1434
+ spatial_shapes=spatial_shapes,
1435
+ output_attentions=output_attentions,
1436
+ output_hidden_states=output_hidden_states,
1437
+ )
1438
+
1439
+ pooled_output = vision_outputs.pooler_output
1440
+
1441
+ return pooled_output
1442
+
1443
+ @can_return_tuple
1444
+ @add_start_docstrings_to_model_forward(SIGLIP2_INPUTS_DOCSTRING)
1445
+ @replace_return_docstrings(output_type=Siglip2Output, config_class=Siglip2Config)
1446
+ def forward(
1447
+ self,
1448
+ input_ids: Optional[torch.LongTensor] = None,
1449
+ pixel_values: Optional[torch.FloatTensor] = None,
1450
+ pixel_attention_mask: Optional[torch.Tensor] = None,
1451
+ spatial_shapes: Optional[torch.LongTensor] = None,
1452
+ attention_mask: Optional[torch.Tensor] = None,
1453
+ position_ids: Optional[torch.LongTensor] = None,
1454
+ return_loss: Optional[bool] = None,
1455
+ output_attentions: Optional[bool] = None,
1456
+ output_hidden_states: Optional[bool] = None,
1457
+ ) -> Siglip2Output:
1458
+ r"""
1459
+ Returns:
1460
+
1461
+ """
1462
+
1463
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1464
+ output_hidden_states = (
1465
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1466
+ )
1467
+
1468
+ vision_outputs: BaseModelOutputWithPooling = self.vision_model(
1469
+ pixel_values=pixel_values,
1470
+ attention_mask=pixel_attention_mask,
1471
+ spatial_shapes=spatial_shapes,
1472
+ output_attentions=output_attentions,
1473
+ output_hidden_states=output_hidden_states,
1474
+ )
1475
+
1476
+ text_outputs: BaseModelOutputWithPooling = self.text_model(
1477
+ input_ids=input_ids,
1478
+ attention_mask=attention_mask,
1479
+ position_ids=position_ids,
1480
+ output_attentions=output_attentions,
1481
+ output_hidden_states=output_hidden_states,
1482
+ )
1483
+
1484
+ image_embeds = vision_outputs.pooler_output
1485
+ text_embeds = text_outputs.pooler_output
1486
+
1487
+ image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
1488
+ text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
1489
+
1490
+ logits_per_text = torch.matmul(text_embeds, image_embeds.t().to(text_embeds.device))
1491
+
1492
+ logit_scale, logit_bias = self.logit_scale.to(text_embeds.device), self.logit_bias.to(text_embeds.device)
1493
+ logits_per_text = logits_per_text * logit_scale.exp() + logit_bias
1494
+
1495
+ logits_per_image = logits_per_text.t()
1496
+
1497
+ loss = None
1498
+ if return_loss:
1499
+ eye = torch.eye(logits_per_text.size(0), device=logits_per_text.device)
1500
+ m1_diag1 = -torch.ones_like(logits_per_text) + 2 * eye
1501
+ loglik = torch.nn.functional.logsigmoid(m1_diag1 * logits_per_text)
1502
+ nll = -torch.sum(loglik, dim=-1)
1503
+ loss = nll.mean()
1504
+
1505
+ return Siglip2Output(
1506
+ loss=loss,
1507
+ logits_per_image=logits_per_image,
1508
+ logits_per_text=logits_per_text,
1509
+ text_embeds=text_embeds,
1510
+ image_embeds=image_embeds,
1511
+ text_model_output=text_outputs,
1512
+ vision_model_output=vision_outputs,
1513
+ )
1514
+
1515
+
1516
+ @add_start_docstrings(
1517
+ """
1518
+ Siglip2 vision encoder with an image classification head on top (a
1519
+ linear layer on top of the pooled final hidden states of
1520
+ the patch tokens) e.g. for ImageNet.
1521
+ """,
1522
+ SIGLIP2_START_DOCSTRING,
1523
+ )
1524
+ class Siglip2ForImageClassification(Siglip2PreTrainedModel):
1525
+ main_input_name = "pixel_values"
1526
+
1527
+ def __init__(self, config: Siglip2Config) -> None:
1528
+ super().__init__(config)
1529
+
1530
+ self.num_labels = config.num_labels
1531
+
1532
+ vision_model = Siglip2VisionModel._from_config(config.vision_config)
1533
+ self.vision_model = vision_model.vision_model
1534
+
1535
+ self.classifier = (
1536
+ nn.Linear(config.vision_config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()
1537
+ )
1538
+
1539
+ self.post_init()
1540
+
1541
+ @can_return_tuple
1542
+ @add_start_docstrings_to_model_forward(SIGLIP2_INPUTS_DOCSTRING)
1543
+ @replace_return_docstrings(output_type=ImageClassifierOutput, config_class=_CONFIG_FOR_DOC)
1544
+ def forward(
1545
+ self,
1546
+ pixel_values: Optional[torch.Tensor] = None,
1547
+ pixel_attention_mask: Optional[torch.Tensor] = None,
1548
+ spatial_shapes: Optional[torch.LongTensor] = None,
1549
+ labels: Optional[torch.Tensor] = None,
1550
+ output_attentions: Optional[bool] = None,
1551
+ output_hidden_states: Optional[bool] = None,
1552
+ ) -> ImageClassifierOutput:
1553
+ r"""
1554
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1555
+ Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
1556
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1557
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1558
+
1559
+ Returns:
1560
+
1561
+ """
1562
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1563
+ output_hidden_states = (
1564
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1565
+ )
1566
+
1567
+ outputs: BaseModelOutputWithPooling = self.vision_model(
1568
+ pixel_values,
1569
+ attention_mask=pixel_attention_mask,
1570
+ spatial_shapes=spatial_shapes,
1571
+ output_attentions=output_attentions,
1572
+ output_hidden_states=output_hidden_states,
1573
+ )
1574
+
1575
+ sequence_output = outputs.last_hidden_state
1576
+
1577
+ if pixel_attention_mask is not None:
1578
+ pool_mask = pixel_attention_mask[..., None].to(sequence_output.device)
1579
+ sequence_output = torch.sum(sequence_output * pool_mask, dim=1) / torch.sum(pool_mask, dim=1)
1580
+ else:
1581
+ sequence_output = torch.mean(sequence_output, dim=1)
1582
+
1583
+ logits = self.classifier(sequence_output)
1584
+
1585
+ loss = None
1586
+ if labels is not None:
1587
+ labels = labels.to(logits.device)
1588
+ if self.config.problem_type is None:
1589
+ if self.num_labels == 1:
1590
+ self.config.problem_type = "regression"
1591
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1592
+ self.config.problem_type = "single_label_classification"
1593
+ else:
1594
+ self.config.problem_type = "multi_label_classification"
1595
+
1596
+ if self.config.problem_type == "regression":
1597
+ loss_fct = MSELoss()
1598
+ if self.num_labels == 1:
1599
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1600
+ else:
1601
+ loss = loss_fct(logits, labels)
1602
+ elif self.config.problem_type == "single_label_classification":
1603
+ loss_fct = CrossEntropyLoss()
1604
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1605
+ elif self.config.problem_type == "multi_label_classification":
1606
+ loss_fct = BCEWithLogitsLoss()
1607
+ loss = loss_fct(logits, labels)
1608
+
1609
+ return ImageClassifierOutput(
1610
+ loss=loss,
1611
+ logits=logits,
1612
+ hidden_states=outputs.hidden_states,
1613
+ attentions=outputs.attentions,
1614
+ )
1615
+
1616
+
1617
+ __all__ = [
1618
+ "Siglip2Model",
1619
+ "Siglip2PreTrainedModel",
1620
+ "Siglip2TextModel",
1621
+ "Siglip2VisionModel",
1622
+ "Siglip2ForImageClassification",
1623
+ ]
modeling_youtu_vl.py ADDED
@@ -0,0 +1,1338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 Tencent Youtu lab, DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ import math
21
+ import os
22
+ from functools import partial
23
+ from typing import Callable, Optional, Tuple, Union
24
+
25
+ import torch
26
+ import torch.nn.functional as F
27
+ from torch import nn
28
+ import numpy as np
29
+ import pydensecrf.densecrf as dcrf
30
+ from pydensecrf.utils import unary_from_softmax
31
+ from PIL import Image
32
+ import requests
33
+ from io import BytesIO
34
+ import base64
35
+ import cv2
36
+
37
+ from transformers.activations import ACT2FN
38
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
39
+ from transformers.generation import GenerationMixin
40
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
41
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
42
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
43
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
44
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
45
+ from transformers.processing_utils import Unpack
46
+ from transformers.utils import (
47
+ add_start_docstrings,
48
+ add_start_docstrings_to_model_forward,
49
+ can_return_tuple,
50
+ is_torch_flex_attn_available,
51
+ logging,
52
+ replace_return_docstrings,
53
+ is_flash_attn_2_available,
54
+ )
55
+ from transformers.utils.deprecation import deprecate_kwarg
56
+ from .configuration_youtu_vl import YoutuVLConfig
57
+
58
+ from .modeling_siglip2 import Siglip2VisionModel, Siglip2VisionEmbeddings
59
+ from .configuration_siglip2 import Siglip2VisionConfig
60
+
61
+
62
+
63
+ if is_torch_flex_attn_available():
64
+ from torch.nn.attention.flex_attention import BlockMask
65
+ from transformers.integrations.flex_attention import make_flex_block_causal_mask
66
+
67
+ is_aiter_available = False
68
+
69
+ if is_flash_attn_2_available():
70
+ try:
71
+ from aiter import flash_attn_varlen_func
72
+ is_aiter_available = True
73
+ except ImportError:
74
+ from flash_attn import flash_attn_varlen_func
75
+ else:
76
+ flash_attn_varlen_func = None
77
+
78
+ logger = logging.get_logger(__name__)
79
+ _CONFIG_FOR_DOC = "YoutuVLConfig"
80
+
81
+ class YoutuRMSNorm(nn.Module):
82
+ def __init__(self, hidden_size, eps=1e-6):
83
+ super().__init__()
84
+ self.weight = nn.Parameter(torch.ones(hidden_size))
85
+ self.variance_epsilon = eps
86
+
87
+ def forward(self, hidden_states):
88
+ input_dtype = hidden_states.dtype
89
+
90
+ hidden_states = hidden_states.to(torch.float32)
91
+
92
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
93
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
94
+ return self.weight * hidden_states.to(input_dtype)
95
+
96
+ def extra_repr(self):
97
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
98
+
99
+
100
+ class YoutuRotaryEmbedding(nn.Module):
101
+ def __init__(self, config: YoutuVLConfig, device=None):
102
+ super().__init__()
103
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
104
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
105
+ else:
106
+ self.rope_type = "default"
107
+ self.max_seq_len_cached = config.max_position_embeddings
108
+ self.original_max_seq_len = config.max_position_embeddings
109
+
110
+ self.config = config
111
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
112
+
113
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
114
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
115
+ self.original_inv_freq = self.inv_freq
116
+
117
+ @torch.no_grad()
118
+ @dynamic_rope_update
119
+ def forward(self, x, position_ids):
120
+ """
121
+ Compute rotary positional embeddings.
122
+ Args:
123
+ x (torch.Tensor): Input tensor, shape (batch_size, seq_len, feature_dim)
124
+ position_ids (torch.LongTensor): Position indices, shape (batch_size, seq_len)
125
+
126
+ """
127
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
128
+ position_ids_expanded = position_ids[:, None, :].float()
129
+
130
+ device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
131
+ with torch.autocast(device_type=device_type, enabled=False): # Force float32
132
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
133
+ emb = torch.cat((freqs, freqs), dim=-1)
134
+ cos = emb.cos() * self.attention_scaling
135
+ sin = emb.sin() * self.attention_scaling
136
+
137
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
138
+
139
+
140
+ class YoutuMLP(nn.Module):
141
+ def __init__(self, config, hidden_size=None, intermediate_size=None):
142
+ super().__init__()
143
+ self.config = config
144
+ self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
145
+ self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
146
+
147
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
148
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
149
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
150
+ self.act_fn = ACT2FN[config.hidden_act]
151
+
152
+ def forward(self, x):
153
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
154
+ return down_proj
155
+
156
+
157
+ def rotate_half(x):
158
+ """
159
+ Rotates half the hidden dims of the input.
160
+ """
161
+ x1 = x[..., : x.shape[-1] // 2]
162
+ x2 = x[..., x.shape[-1] // 2 :]
163
+ return torch.cat((-x2, x1), dim=-1)
164
+
165
+
166
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
167
+ """Applies Rotary Position Embedding to the query and key tensors.
168
+
169
+ Args:
170
+ q (`torch.Tensor`): The query tensor.
171
+ k (`torch.Tensor`): The key tensor.
172
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
173
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
174
+ position_ids (`torch.Tensor`, *optional*):
175
+ Deprecated and unused.
176
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
177
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
178
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
179
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
180
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
181
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
182
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
183
+ Returns:
184
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
185
+ """
186
+ cos = cos.unsqueeze(unsqueeze_dim)
187
+ sin = sin.unsqueeze(unsqueeze_dim)
188
+ q_embed = (q * cos) + (rotate_half(q) * sin)
189
+ k_embed = (k * cos) + (rotate_half(k) * sin)
190
+ return q_embed, k_embed
191
+
192
+
193
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
194
+ """
195
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
196
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
197
+ """
198
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
199
+ if n_rep == 1:
200
+ return hidden_states
201
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
202
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
203
+
204
+
205
+ def eager_attention_forward(
206
+ module: nn.Module,
207
+ query: torch.Tensor,
208
+ key: torch.Tensor,
209
+ value: torch.Tensor,
210
+ attention_mask: Optional[torch.Tensor],
211
+ scaling: float,
212
+ dropout: float = 0.0,
213
+ **kwargs,
214
+ ):
215
+ key_states = repeat_kv(key, module.num_key_value_groups)
216
+ value_states = repeat_kv(value, module.num_key_value_groups)
217
+
218
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
219
+ if attention_mask is not None:
220
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
221
+ attn_weights = attn_weights + causal_mask
222
+
223
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
224
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
225
+ attn_output = torch.matmul(attn_weights, value_states)
226
+ attn_output = attn_output.transpose(1, 2).contiguous()
227
+
228
+ return attn_output, attn_weights
229
+
230
+
231
+ def apply_rotary_pos_emb_interleave(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
232
+ r"""
233
+ Args:
234
+ q (`torch.Tensor`): The query tensor.
235
+ k (`torch.Tensor`): The key tensor.
236
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
237
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
238
+ position_ids (`torch.Tensor`):
239
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
240
+ used to pass offsetted position ids when working with a KV-cache.
241
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
242
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
243
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
244
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
245
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
246
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
247
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
248
+ Returns:
249
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
250
+ """
251
+ cos = cos.unsqueeze(unsqueeze_dim)
252
+ sin = sin.unsqueeze(unsqueeze_dim)
253
+
254
+ b, h, s, d = q.shape
255
+ q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
256
+
257
+ b, h, s, d = k.shape
258
+ k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
259
+
260
+ q_embed = (q * cos) + (rotate_half(q) * sin)
261
+ k_embed = (k * cos) + (rotate_half(k) * sin)
262
+ return q_embed, k_embed
263
+
264
+
265
+ def yarn_get_mscale(scale=1, mscale=1):
266
+ if scale <= 1:
267
+ return 1.0
268
+ return 0.1 * mscale * math.log(scale) + 1.0
269
+
270
+
271
+ class YoutuMLAttention(nn.Module):
272
+ """Multi-latent attention from 'DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model' paper"""
273
+
274
+ def __init__(self, config: YoutuVLConfig, layer_idx: int):
275
+ super().__init__()
276
+ self.config = config
277
+ self.layer_idx = layer_idx
278
+ self.num_key_value_groups = 1 # needed for eager attentions
279
+ self.attention_dropout = config.attention_dropout
280
+ self.num_heads = config.num_attention_heads
281
+ self.rope_theta = config.rope_theta
282
+ self.q_lora_rank = config.q_lora_rank
283
+ self.qk_rope_head_dim = config.qk_rope_head_dim
284
+ self.kv_lora_rank = config.kv_lora_rank
285
+ self.v_head_dim = config.v_head_dim
286
+ self.qk_nope_head_dim = config.qk_nope_head_dim
287
+ self.qk_head_dim = config.qk_head_dim
288
+ self.flash_att_sliding_window = config.flash_att_sliding_window
289
+ self.is_causal = True
290
+
291
+
292
+ if self.q_lora_rank is None:
293
+ self.q_proj = nn.Linear(config.hidden_size, self.num_heads * self.qk_head_dim, bias=False)
294
+ else:
295
+ self.q_a_proj = nn.Linear(config.hidden_size, config.q_lora_rank, bias=config.attention_bias)
296
+ self.q_a_layernorm = YoutuRMSNorm(config.q_lora_rank)
297
+ self.q_b_proj = nn.Linear(config.q_lora_rank, self.num_heads * self.qk_head_dim, bias=False)
298
+
299
+ self.kv_a_proj_with_mqa = nn.Linear(
300
+ config.hidden_size,
301
+ self.kv_lora_rank + self.qk_rope_head_dim,
302
+ bias=config.attention_bias,
303
+ )
304
+ self.kv_a_layernorm = YoutuRMSNorm(self.kv_lora_rank)
305
+ self.kv_b_proj = nn.Linear(
306
+ self.kv_lora_rank,
307
+ self.num_heads * (self.qk_nope_head_dim + self.v_head_dim),
308
+ bias=False,
309
+ )
310
+
311
+ self.o_proj = nn.Linear(
312
+ self.num_heads * self.v_head_dim,
313
+ config.hidden_size,
314
+ bias=config.attention_bias,
315
+ )
316
+
317
+ self.scaling = self.qk_head_dim ** (-0.5)
318
+ if self.config.rope_scaling is not None:
319
+ mscale_all_dim = self.config.rope_scaling.get("mscale_all_dim", 0)
320
+ scaling_factor = self.config.rope_scaling["factor"]
321
+ if mscale_all_dim:
322
+ mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
323
+ self.scaling = self.scaling * mscale * mscale
324
+
325
+ def forward(
326
+ self,
327
+ hidden_states: torch.Tensor,
328
+ position_embeddings: Tuple[torch.Tensor, torch.Tensor],
329
+ attention_mask: Optional[torch.Tensor],
330
+ instance_length: Optional[torch.LongTensor] = None,
331
+ past_key_value: Optional[Cache] = None,
332
+ cache_position: Optional[torch.LongTensor] = None,
333
+ **kwargs: Unpack[FlashAttentionKwargs],
334
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
335
+ batch_size, seq_length = hidden_states.shape[:-1]
336
+ query_shape = (batch_size, seq_length, -1, self.qk_head_dim)
337
+ key_shape = (batch_size, seq_length, -1, self.qk_nope_head_dim + self.v_head_dim)
338
+
339
+ if self.q_lora_rank is None:
340
+ q_states = self.q_proj(hidden_states).view(query_shape).transpose(1, 2)
341
+ else:
342
+ q_states = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states))).view(query_shape).transpose(1, 2)
343
+ q_pass, q_rot = torch.split(q_states, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
344
+
345
+ compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
346
+ k_pass, k_rot = torch.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
347
+
348
+ k_pass = self.kv_b_proj(self.kv_a_layernorm(k_pass)).view(key_shape).transpose(1, 2)
349
+ k_pass, value_states = torch.split(k_pass, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
350
+
351
+ k_rot = k_rot.view(batch_size, 1, seq_length, self.qk_rope_head_dim)
352
+
353
+ cos, sin = position_embeddings
354
+ if self.config.rope_interleave: # support using interleaved weights for efficiency
355
+ q_rot, k_rot = apply_rotary_pos_emb_interleave(q_rot, k_rot, cos, sin)
356
+ else:
357
+ q_rot, k_rot = apply_rotary_pos_emb(q_rot, k_rot, cos, sin)
358
+ k_rot = k_rot.expand(*k_pass.shape[:-1], -1)
359
+
360
+ query_states = torch.cat((q_pass, q_rot), dim=-1)
361
+ key_states = torch.cat((k_pass, k_rot), dim=-1)
362
+
363
+ if past_key_value is not None:
364
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
365
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
366
+
367
+ if self.config._attn_implementation == "flash_attention_2" and self.qk_head_dim != self.v_head_dim:
368
+ value_states = F.pad(value_states, [0, self.qk_head_dim - self.v_head_dim])
369
+
370
+ attention_interface: Callable = eager_attention_forward
371
+ if self.config._attn_implementation != "eager":
372
+ if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
373
+ logger.warning_once(
374
+ "`torch.nn.functional.scaled_dot_product_attention` does not support"
375
+ "`output_attentions=True`. Falling back to 'eager attention. This warning"
376
+ 'can be removed using the argument `attn_implementation="eager"` when loading the model.'
377
+ )
378
+ else:
379
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
380
+
381
+ if instance_length is None or flash_attn_varlen_func is None:
382
+ attn_output, attn_weights = attention_interface(
383
+ self,
384
+ query_states,
385
+ key_states,
386
+ value_states,
387
+ attention_mask,
388
+ dropout=0.0 if not self.training else self.attention_dropout,
389
+ scaling=self.scaling,
390
+ **kwargs,
391
+ )
392
+ if self.config._attn_implementation == "flash_attention_2" and self.qk_head_dim != self.v_head_dim:
393
+ attn_output = attn_output[:, :, :, : self.v_head_dim]
394
+ else:
395
+ instance_length = instance_length.view(-1)
396
+ query_states = query_states.squeeze(0).transpose(0,1)
397
+ key_states = key_states.squeeze(0).transpose(0,1)
398
+ value_states = value_states.squeeze(0).transpose(0,1)
399
+ max_seqlen_in_batch = instance_length.max().item()
400
+ cu_seqlens = F.pad(torch.cumsum(instance_length, dim=0, dtype=torch.int32), (1, 0))
401
+ if is_aiter_available:
402
+ attn_output = flash_attn_varlen_func(query_states, key_states, value_states, cu_seqlens,
403
+ cu_seqlens, max_seqlen_in_batch, max_seqlen_in_batch,
404
+ dropout_p=0.0 if not self.training else self.attention_dropout,
405
+ softmax_scale=self.scaling,
406
+ causal=self.is_causal, return_lse=True)[0]
407
+ else:
408
+ attn_output = flash_attn_varlen_func(query_states, key_states, value_states, cu_seqlens,
409
+ cu_seqlens, max_seqlen_in_batch, max_seqlen_in_batch,
410
+ dropout_p=0.0 if not self.training else self.attention_dropout,
411
+ softmax_scale=self.scaling,
412
+ causal=self.is_causal)
413
+
414
+ attn_output = attn_output.unsqueeze(0)
415
+ attn_output = attn_output[:, :, :, : self.v_head_dim]
416
+ attn_weights = None
417
+
418
+ attn_output = attn_output.reshape(batch_size, seq_length, -1).contiguous()
419
+ attn_output = self.o_proj(attn_output)
420
+ return attn_output, attn_weights
421
+
422
+
423
+ class YoutuDecoderLayer(nn.Module):
424
+ def __init__(self, config: YoutuVLConfig, layer_idx: int):
425
+ super().__init__()
426
+ self.hidden_size = config.hidden_size
427
+ self.self_attn = YoutuMLAttention(config=config, layer_idx=layer_idx)
428
+ self.mlp = YoutuMLP(config)
429
+
430
+ self.input_layernorm = YoutuRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
431
+ self.post_attention_layernorm = YoutuRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
432
+
433
+ def forward(
434
+ self,
435
+ hidden_states: torch.Tensor,
436
+ attention_mask: Optional[torch.Tensor] = None,
437
+ position_ids: Optional[torch.LongTensor] = None,
438
+ past_key_value: Optional[Cache] = None,
439
+ output_attentions: Optional[bool] = False,
440
+ instance_length: Optional[torch.LongTensor] = None,
441
+ use_cache: Optional[bool] = False,
442
+ cache_position: Optional[torch.LongTensor] = None,
443
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
444
+ **kwargs: Unpack[FlashAttentionKwargs],
445
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
446
+ residual = hidden_states
447
+
448
+ hidden_states = self.input_layernorm(hidden_states)
449
+
450
+ hidden_states, self_attn_weights = self.self_attn(
451
+ hidden_states=hidden_states,
452
+ attention_mask=attention_mask,
453
+ position_ids=position_ids,
454
+ past_key_value=past_key_value,
455
+ output_attentions=output_attentions,
456
+ instance_length=instance_length,
457
+ use_cache=use_cache,
458
+ cache_position=cache_position,
459
+ position_embeddings=position_embeddings,
460
+ **kwargs,
461
+ )
462
+ hidden_states = residual + hidden_states
463
+
464
+ residual = hidden_states
465
+ hidden_states = self.post_attention_layernorm(hidden_states)
466
+ hidden_states = self.mlp(hidden_states)
467
+ hidden_states = residual + hidden_states
468
+
469
+ outputs = (hidden_states,)
470
+ if output_attentions:
471
+ outputs += (self_attn_weights,)
472
+
473
+ return outputs
474
+
475
+
476
+ YOUTU_VL_START_DOCSTRING = r"""
477
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
478
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
479
+ etc.)
480
+
481
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
482
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
483
+ and behavior.
484
+
485
+ Parameters:
486
+ config ([`YoutuVLConfig`]):
487
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
488
+ load the weights associated with the model, only the configuration. Check out the
489
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
490
+ """
491
+
492
+
493
+ @add_start_docstrings(
494
+ "The bare Youtu Model outputting raw hidden-states without any specific head on top.",
495
+ YOUTU_VL_START_DOCSTRING,
496
+ )
497
+ class YoutuPreTrainedModel(PreTrainedModel):
498
+ config_class = YoutuVLConfig
499
+ base_model_prefix = "model"
500
+ supports_gradient_checkpointing = True
501
+ _no_split_modules = ["YoutuDecoderLayer"]
502
+ _skip_keys_device_placement = ["past_key_values"]
503
+ _supports_flash_attn_2 = True
504
+ _supports_sdpa = True
505
+ _supports_flex_attn = True
506
+ _supports_cache_class = True
507
+ _supports_quantized_cache = True
508
+ _supports_static_cache = True
509
+ _supports_attention_backend = True
510
+
511
+ def init_weights(self):
512
+ if self.config.pruned_heads:
513
+ self.prune_heads(self.config.pruned_heads)
514
+
515
+ if "-init" in self.name_or_path:
516
+ self.apply(self._initialize_weights)
517
+
518
+ for name, module in self.named_modules():
519
+ if "o_proj" in name or "down_proj" in name:
520
+ scaled_std = self.config.initializer_range * (1.0 / self.config.num_hidden_layers) ** 0.5
521
+ module.weight.data.normal_(mean=0.0, std=scaled_std)
522
+
523
+ self.tie_weights()
524
+
525
+ def _init_weights(self, module):
526
+ std = self.config.initializer_range
527
+ embedding_std = self.config.embedding_initializer_range
528
+ if isinstance(module, nn.Linear):
529
+ module.weight.data.normal_(mean=0.0, std=std)
530
+ if module.bias is not None:
531
+ module.bias.data.zero_()
532
+ elif isinstance(module, nn.Embedding):
533
+ module.weight.data.normal_(mean=0.0, std=embedding_std)
534
+ if module.padding_idx is not None:
535
+ module.weight.data[module.padding_idx].zero_()
536
+ elif isinstance(module, nn.Parameter):
537
+ module.weight.data.normal_(mean=0.0, std=std)
538
+ elif isinstance(module, YoutuRMSNorm):
539
+ module.weight.data.fill_(1.0)
540
+
541
+
542
+ YOUTU_VL_INPUTS_DOCSTRING = r"""
543
+ Args:
544
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
545
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
546
+ it.
547
+
548
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
549
+ [`PreTrainedTokenizer.__call__`] for details.
550
+
551
+ [What are input IDs?](../glossary#input-ids)
552
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
553
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
554
+
555
+ - 1 for tokens that are **not masked**,
556
+ - 0 for tokens that are **masked**.
557
+
558
+ [What are attention masks?](../glossary#attention-mask)
559
+
560
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
561
+ [`PreTrainedTokenizer.__call__`] for details.
562
+
563
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
564
+ `past_key_values`).
565
+
566
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
567
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
568
+ information on the default strategy.
569
+
570
+ - 1 indicates the head is **not masked**,
571
+ - 0 indicates the head is **masked**.
572
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
573
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
574
+ config.n_positions - 1]`.
575
+
576
+ [What are position IDs?](../glossary#position-ids)
577
+ past_key_values (`Cache`, *optional*):
578
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
579
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
580
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
581
+
582
+ It is a [`~cache_utils.Cache`] instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).
583
+
584
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
585
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
586
+ of shape `(batch_size, sequence_length)`.
587
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
588
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
589
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
590
+ model's internal embedding lookup matrix.
591
+ use_cache (`bool`, *optional*):
592
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
593
+ `past_key_values`).
594
+ output_attentions (`bool`, *optional*):
595
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
596
+ tensors for more detail.
597
+ output_hidden_states (`bool`, *optional*):
598
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
599
+ more detail.
600
+ return_dict (`bool`, *optional*):
601
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
602
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
603
+ Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
604
+ this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
605
+ the complete sequence length.
606
+ """
607
+
608
+
609
+ @add_start_docstrings(
610
+ "The bare Youtu Model outputting raw hidden-states without any specific head on top.",
611
+ YOUTU_VL_START_DOCSTRING,
612
+ )
613
+ class YoutuModel(YoutuPreTrainedModel):
614
+ _keys_to_ignore_on_load_unexpected = [r"model\.layers\.61.*"]
615
+
616
+ def __init__(self, config: YoutuVLConfig):
617
+ super().__init__(config)
618
+ self.padding_idx = config.pad_token_id
619
+ self.vocab_size = config.vocab_size
620
+
621
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
622
+ self.layers = nn.ModuleList(
623
+ [YoutuDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
624
+ )
625
+ self.norm = YoutuRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
626
+ self.rotary_emb = YoutuRotaryEmbedding(config=config)
627
+ self.gradient_checkpointing = False
628
+
629
+ # Initialize weights and apply final processing
630
+ self.post_init()
631
+
632
+ def get_input_embeddings(self):
633
+ return self.embed_tokens
634
+
635
+ def set_input_embeddings(self, value):
636
+ self.embed_tokens = value
637
+
638
+ @can_return_tuple
639
+ @add_start_docstrings_to_model_forward(YOUTU_VL_INPUTS_DOCSTRING)
640
+ def forward(
641
+ self,
642
+ input_ids: Optional[torch.LongTensor] = None,
643
+ attention_mask: Optional[torch.Tensor] = None,
644
+ position_ids: Optional[torch.LongTensor] = None,
645
+ past_key_values: Optional[Cache] = None,
646
+ inputs_embeds: Optional[torch.FloatTensor] = None,
647
+ use_cache: Optional[bool] = None,
648
+ instance_length: Optional[torch.LongTensor] = None,
649
+ output_attentions: Optional[bool] = None,
650
+ output_hidden_states: Optional[bool] = None,
651
+ cache_position: Optional[torch.LongTensor] = None,
652
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
653
+ ) -> BaseModelOutputWithPast:
654
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
655
+ output_hidden_states = (
656
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
657
+ )
658
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
659
+
660
+ if (input_ids is None) ^ (inputs_embeds is not None):
661
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
662
+
663
+ if inputs_embeds is None:
664
+ inputs_embeds = self.embed_tokens(input_ids)
665
+
666
+ if use_cache and past_key_values is None:
667
+ past_key_values = DynamicCache()
668
+
669
+ if cache_position is None:
670
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
671
+ cache_position = torch.arange(
672
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
673
+ )
674
+
675
+ if position_ids is None:
676
+ position_ids = cache_position.unsqueeze(0)
677
+
678
+ causal_mask = self._update_causal_mask(
679
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
680
+ )
681
+
682
+ hidden_states = inputs_embeds
683
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
684
+
685
+ all_hidden_states = () if output_hidden_states else None
686
+ all_self_attns = () if output_attentions else None
687
+
688
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
689
+ if output_hidden_states:
690
+ all_hidden_states += (hidden_states,)
691
+ layer_outputs = decoder_layer(
692
+ hidden_states,
693
+ attention_mask=causal_mask,
694
+ position_ids=position_ids,
695
+ past_key_value=past_key_values,
696
+ output_attentions=output_attentions,
697
+ instance_length=instance_length,
698
+ use_cache=use_cache,
699
+ cache_position=cache_position,
700
+ position_embeddings=position_embeddings,
701
+ **flash_attn_kwargs,
702
+ )
703
+ hidden_states = layer_outputs[0]
704
+ if output_attentions:
705
+ all_self_attns += (layer_outputs[1],)
706
+
707
+ hidden_states = self.norm(hidden_states)
708
+
709
+ if output_hidden_states:
710
+ all_hidden_states += (hidden_states,)
711
+
712
+ return BaseModelOutputWithPast(
713
+ last_hidden_state=hidden_states,
714
+ past_key_values=past_key_values if use_cache else None,
715
+ hidden_states=all_hidden_states,
716
+ attentions=all_self_attns,
717
+ )
718
+
719
+ def _update_causal_mask(
720
+ self,
721
+ attention_mask: torch.Tensor,
722
+ input_tensor: torch.Tensor,
723
+ cache_position: torch.Tensor,
724
+ past_key_values: Cache,
725
+ output_attentions: bool = False,
726
+ ):
727
+ if self.config._attn_implementation == "flash_attention_2":
728
+ if attention_mask is not None and (attention_mask == 0.0).any():
729
+ return attention_mask
730
+ return None
731
+
732
+ if self.config._attn_implementation == "flex_attention":
733
+ if isinstance(attention_mask, torch.Tensor):
734
+ attention_mask = make_flex_block_causal_mask(attention_mask)
735
+ if isinstance(attention_mask, BlockMask):
736
+ return attention_mask
737
+
738
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
739
+ using_static_cache = isinstance(past_key_values, StaticCache)
740
+
741
+ if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
742
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
743
+ attention_mask,
744
+ inputs_embeds=input_tensor,
745
+ past_key_values_length=past_seen_tokens,
746
+ is_training=self.training,
747
+ ):
748
+ return None
749
+
750
+ dtype, device = input_tensor.dtype, input_tensor.device
751
+ sequence_length = input_tensor.shape[1]
752
+ if using_static_cache:
753
+ target_length = past_key_values.get_max_cache_shape()
754
+ else:
755
+ target_length = (
756
+ attention_mask.shape[-1]
757
+ if isinstance(attention_mask, torch.Tensor)
758
+ else past_seen_tokens + sequence_length + 1
759
+ )
760
+
761
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
762
+ attention_mask,
763
+ sequence_length=sequence_length,
764
+ target_length=target_length,
765
+ dtype=dtype,
766
+ device=device,
767
+ cache_position=cache_position,
768
+ batch_size=input_tensor.shape[0],
769
+ )
770
+
771
+ if (
772
+ self.config._attn_implementation == "sdpa"
773
+ and attention_mask is not None
774
+ and attention_mask.device.type in ["cuda", "xpu"]
775
+ and not output_attentions
776
+ ):
777
+ min_dtype = torch.finfo(dtype).min
778
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
779
+
780
+ return causal_mask
781
+
782
+ @staticmethod
783
+ def _prepare_4d_causal_attention_mask_with_cache_position(
784
+ attention_mask: torch.Tensor,
785
+ sequence_length: int,
786
+ target_length: int,
787
+ dtype: torch.dtype,
788
+ device: torch.device,
789
+ cache_position: torch.Tensor,
790
+ batch_size: int,
791
+ **kwargs,
792
+ ):
793
+ """
794
+ Args:
795
+ attention_mask (`torch.Tensor`):
796
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
797
+ `(batch_size, 1, query_length, key_value_length)`.
798
+ sequence_length (`int`):
799
+ The sequence length being processed.
800
+ target_length (`int`):
801
+ The target length: when generating with static cache, the mask should be as long as the static cache,
802
+ to account for the 0 padding, the part of the cache that is not filled yet.
803
+ dtype (`torch.dtype`):
804
+ The dtype to use for the 4D attention mask.
805
+ device (`torch.device`):
806
+ The device to place the 4D attention mask on.
807
+ cache_position (`torch.Tensor`):
808
+ Indices depicting the position of the input sequence tokens in the sequence.
809
+ batch_size (`torch.Tensor`):
810
+ Batch size.
811
+ """
812
+ if attention_mask is not None and attention_mask.dim() == 4:
813
+ causal_mask = attention_mask
814
+ else:
815
+ min_dtype = torch.finfo(dtype).min
816
+ causal_mask = torch.full(
817
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
818
+ )
819
+ if sequence_length != 1:
820
+ causal_mask = torch.triu(causal_mask, diagonal=1)
821
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
822
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
823
+ if attention_mask is not None:
824
+ causal_mask = causal_mask.clone()
825
+ mask_length = attention_mask.shape[-1]
826
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
827
+ causal_mask.device
828
+ )
829
+ padding_mask = padding_mask == 0
830
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
831
+ padding_mask, min_dtype
832
+ )
833
+
834
+ return causal_mask
835
+
836
+
837
+ class KwargsForCausalLM(FlashAttentionKwargs): ...
838
+
839
+
840
+ class YoutuForCausalLM(YoutuPreTrainedModel, GenerationMixin):
841
+ _tied_weights_keys = ["lm_head.weight"]
842
+ _tp_plan = {"lm_head": "colwise_rep"}
843
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
844
+
845
+ def __init__(self, config):
846
+ super().__init__(config)
847
+
848
+ self.model = YoutuModel(config)
849
+ self.vocab_size = config.vocab_size
850
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
851
+
852
+ self.post_init()
853
+
854
+ def get_input_embeddings(self):
855
+ return self.model.embed_tokens
856
+
857
+ def set_input_embeddings(self, value):
858
+ self.model.embed_tokens = value
859
+
860
+ def get_output_embeddings(self):
861
+ return self.lm_head
862
+
863
+ def set_output_embeddings(self, new_embeddings):
864
+ self.lm_head = new_embeddings
865
+
866
+ def set_decoder(self, decoder):
867
+ self.model = decoder
868
+
869
+ def get_decoder(self):
870
+ return self.model
871
+
872
+ @can_return_tuple
873
+ @deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
874
+ @add_start_docstrings_to_model_forward(YOUTU_VL_INPUTS_DOCSTRING)
875
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
876
+ def forward(
877
+ self,
878
+ input_ids: Optional[torch.LongTensor] = None,
879
+ attention_mask: Optional[torch.Tensor] = None,
880
+ position_ids: Optional[torch.LongTensor] = None,
881
+ past_key_values: Optional[Cache] = None,
882
+ inputs_embeds: Optional[torch.FloatTensor] = None,
883
+ labels: Optional[torch.LongTensor] = None,
884
+ use_cache: Optional[bool] = None,
885
+ output_attentions: Optional[bool] = None,
886
+ output_hidden_states: Optional[bool] = None,
887
+ cache_position: Optional[torch.LongTensor] = None,
888
+ logits_to_keep: Union[int, torch.Tensor] = 0,
889
+ **kwargs: Unpack[KwargsForCausalLM],
890
+ ) -> CausalLMOutputWithPast:
891
+ r"""
892
+ Returns:
893
+
894
+ """
895
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
896
+ output_hidden_states = (
897
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
898
+ )
899
+
900
+ outputs: BaseModelOutputWithPast = self.model(
901
+ input_ids=input_ids,
902
+ attention_mask=attention_mask,
903
+ position_ids=position_ids,
904
+ past_key_values=past_key_values,
905
+ inputs_embeds=inputs_embeds,
906
+ use_cache=use_cache,
907
+ output_attentions=output_attentions,
908
+ output_hidden_states=output_hidden_states,
909
+ cache_position=cache_position,
910
+ **kwargs,
911
+ )
912
+
913
+ hidden_states = outputs.last_hidden_state
914
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
915
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
916
+
917
+ loss = None
918
+
919
+ return CausalLMOutputWithPast(
920
+ loss=loss,
921
+ logits=logits,
922
+ past_key_values=outputs.past_key_values,
923
+ hidden_states=outputs.hidden_states,
924
+ attentions=outputs.attentions,
925
+ )
926
+
927
+ class VLPatchMerger(nn.Module):
928
+ def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:
929
+ super().__init__()
930
+ self.hidden_size = context_dim * (spatial_merge_size**2)
931
+ self.ln_q = YoutuRMSNorm(context_dim, eps=1e-06)
932
+ self.mlp = nn.Sequential(
933
+ nn.Linear(self.hidden_size, self.hidden_size),
934
+ nn.GELU(),
935
+ nn.Linear(self.hidden_size, dim),
936
+ )
937
+
938
+ def forward(self, x: torch.Tensor, spatial_shapes: torch.Tensor) -> torch.Tensor:
939
+ x = self.ln_q(x).view(-1, self.hidden_size)
940
+ x = self.mlp(x)
941
+ return x
942
+
943
+
944
+ class YoutuDensePrediction(object):
945
+ def __init__(self, custom_tokens):
946
+ self.custom_tokens = custom_tokens
947
+ self.custom_ids = list(range(self.custom_tokens["<custom_1>"][0], self.custom_tokens["<custom_1>"][0] + 1000))
948
+
949
+ def dense_crf(self, probs, img, iters=10, kernel='both'):
950
+ C, H, W = probs.shape
951
+ img = np.array(img)
952
+ d = dcrf.DenseCRF2D(W, H, C)
953
+ U = unary_from_softmax(probs)
954
+ d.setUnaryEnergy(U)
955
+ d.addPairwiseGaussian(sxy=(3, 3), compat=3, kernel=dcrf.DIAG_KERNEL, normalization=dcrf.NORMALIZE_SYMMETRIC)
956
+ if kernel in ['bilateral', 'both']:
957
+ d.addPairwiseBilateral(sxy=(80, 80), srgb=(13, 13, 13), rgbim=img, compat=10,
958
+ kernel=dcrf.DIAG_KERNEL, normalization=dcrf.NORMALIZE_SYMMETRIC)
959
+ Q = d.inference(iters)
960
+ pred = np.argmax(Q, 0)
961
+ return pred
962
+
963
+ def contains_subsequence(self, seq, sub):
964
+ if not seq or not sub:
965
+ return False
966
+ if isinstance(sub[0], int):
967
+ subs = [sub]
968
+ else:
969
+ subs = sub
970
+ n = len(seq)
971
+ for s in subs:
972
+ m = len(s)
973
+ if m == 0 or m > n:
974
+ continue
975
+ for i in range(n - m + 1):
976
+ if seq[i: i + m] == s:
977
+ return True
978
+ return False
979
+
980
+
981
+ def extract_ref_spans(self, token_list):
982
+ spans = []
983
+ i = 0
984
+ while i < len(token_list):
985
+ if token_list[i] != self.custom_tokens["<ref>"][0]:
986
+ i += 1
987
+ continue
988
+ j = i + 1
989
+ while j < len(token_list) and token_list[j] != self.custom_tokens["</ref>"][0]:
990
+ j += 1
991
+ if j < len(token_list):
992
+ spans.append(token_list[i + 1 : j])
993
+ i = j + 1
994
+ else:
995
+ break
996
+ return spans
997
+
998
+ def dense_decoding(self, inp_ids, output, inp_shape=None, dense_logits=None, raw_img=None, use_crf=False):
999
+ img_token_id = self.custom_tokens["<|image_pad|>"][0]
1000
+ img_token_mask = inp_ids[0] == img_token_id
1001
+
1002
+ logits = dense_logits[0]
1003
+ img_logits = logits[img_token_mask]
1004
+ target_logits = []
1005
+ w, h = inp_shape
1006
+ raw_w, raw_h = raw_img.size
1007
+
1008
+ if self.contains_subsequence(output, self.custom_tokens["<depth>"]):
1009
+ target_logits = img_logits[:, self.custom_ids]
1010
+ pred = target_logits.reshape(1, h, w, -1).permute(0, 3, 1, 2)
1011
+ pred = F.interpolate(pred, size=(h*2, w*2), mode='bilinear', align_corners=False)
1012
+ pred = pred[0].argmax(0).cpu().numpy().astype('uint16')
1013
+ pred = pred.reshape(-1)
1014
+ else:
1015
+ labels = self.extract_ref_spans(output)
1016
+ for tokens in labels:
1017
+ if tokens:
1018
+ target_logits.append(img_logits[:, tokens].mean(-1))
1019
+ if target_logits != []:
1020
+ pred = torch.stack(target_logits, 0)
1021
+ if inp_shape != None:
1022
+ if self.custom_tokens["<OTHERS>"] in labels:
1023
+ pred = torch.sigmoid(pred)
1024
+ others_idx = labels.index(self.custom_tokens["<OTHERS>"])
1025
+ pred[others_idx] = 0.5
1026
+ else:
1027
+ pred = pred / 0.2
1028
+ pred = (torch.exp(pred) / torch.sum(torch.exp(pred), dim=0, keepdims=True))
1029
+
1030
+ pred_reshape = pred.reshape((-1, h, w))
1031
+ pred_resize = F.interpolate(pred_reshape.unsqueeze(0), size=(raw_h, raw_w), mode='bilinear', align_corners=False)
1032
+ pred_resize = pred_resize.float().cpu().numpy()
1033
+ if use_crf:
1034
+ pred = self.dense_crf(pred_resize[0], raw_img)
1035
+ else:
1036
+ pred = pred_resize[0].argmax(0).reshape(-1)
1037
+ else:
1038
+ pred = pred.argmax(0)
1039
+
1040
+ def encode_int_as_digit_tokens(x: int):
1041
+ s = str(int(x))
1042
+ return [self.custom_tokens["digit_start"][0] + (ord(ch) - ord("0")) for ch in s]
1043
+
1044
+ def encode_int_as_digit_tokens(x: int):
1045
+ s = str(int(x))
1046
+ return [self.custom_tokens["digit_start"][0] + (ord(ch) - ord("0")) for ch in s]
1047
+
1048
+ def rle_value_run(arr):
1049
+ if isinstance(arr, torch.Tensor):
1050
+ arr = arr.detach().cpu().numpy()
1051
+ runs = []
1052
+ n = len(arr)
1053
+ if n == 0:
1054
+ return runs
1055
+ prev = int(arr[0])
1056
+ cnt = 1
1057
+ for i in range(1, n):
1058
+ v = int(arr[i])
1059
+ if v == prev:
1060
+ cnt += 1
1061
+ else:
1062
+ runs.append((prev, cnt))
1063
+ prev = v
1064
+ cnt = 1
1065
+ runs.append((prev, cnt))
1066
+ return runs
1067
+
1068
+ def build_mask_rle_token_ids_from_runs(runs):
1069
+ body = []
1070
+ m = len(runs)
1071
+ for i, (v, c) in enumerate(runs):
1072
+ body.append(self.custom_tokens["<mask_rle>"][0])
1073
+ body.extend(encode_int_as_digit_tokens(v))
1074
+ body.append(self.custom_tokens["comma"][0])
1075
+ body.extend(encode_int_as_digit_tokens(c))
1076
+ body.append(self.custom_tokens["</mask_rle>"][0])
1077
+ if i != m - 1:
1078
+ body.append(self.custom_tokens["comma"][0])
1079
+ return self.custom_tokens["<mask>"] + body + self.custom_tokens["</mask>"]
1080
+
1081
+ runs = rle_value_run(pred if isinstance(pred, torch.Tensor) else torch.as_tensor(pred))
1082
+ mask_token_ids = build_mask_rle_token_ids_from_runs(runs)
1083
+ return mask_token_ids
1084
+
1085
+ def convert_coord_ids(self, ids, scale_x, scale_y, max_coord=2047):
1086
+ x0_id = self.custom_tokens["<x_0>"][0]
1087
+ y_max_id = self.custom_tokens[f"<y_2047>"][0]
1088
+ out = []
1089
+ for tid in ids:
1090
+ if x0_id <= tid <= y_max_id:
1091
+ offset = tid - x0_id
1092
+ is_y = (offset & 1) == 1
1093
+ i = offset >> 1
1094
+ if 0 <= i <= max_coord:
1095
+ if not is_y:
1096
+ new_i = int(round(i * scale_x))
1097
+ new_i = 0 if new_i < 0 else (max_coord if new_i > max_coord else new_i)
1098
+ new_tid = x0_id + (new_i << 1)
1099
+ else:
1100
+ new_i = int(round(i * scale_y))
1101
+ new_i = 0 if new_i < 0 else (max_coord if new_i > max_coord else new_i)
1102
+ new_tid = x0_id + (new_i << 1) + 1
1103
+ out.append(new_tid)
1104
+ continue
1105
+ out.append(tid)
1106
+ return out
1107
+
1108
+ def _is_url(self, s):
1109
+ return s.startswith("http://") or s.startswith("https://")
1110
+
1111
+ def load_image(self, img_input):
1112
+ if img_input is None:
1113
+ raise ValueError("img_input is None")
1114
+ if not isinstance(img_input, str):
1115
+ raise TypeError(
1116
+ f"Unsupported img_input type (only str supported): {type(img_input)}"
1117
+ )
1118
+ s = img_input.strip()
1119
+ if not s:
1120
+ raise ValueError("img_input is empty string")
1121
+ if self._is_url(s):
1122
+ resp = requests.get(s)
1123
+ resp.raise_for_status()
1124
+ img = Image.open(BytesIO(resp.content))
1125
+ return img.convert("RGB")
1126
+ if os.path.isfile(s):
1127
+ with open(s, "rb") as f:
1128
+ img = Image.open(f)
1129
+ return img.convert("RGB")
1130
+ try:
1131
+ b64 = "".join(s.split())
1132
+ img_bytes = base64.b64decode(b64, validate=True)
1133
+ except Exception as e:
1134
+ raise ValueError(
1135
+ "img_input is not a valid URL, file path, or pure base64 string"
1136
+ ) from e
1137
+ try:
1138
+ img = Image.open(BytesIO(img_bytes))
1139
+ return img.convert("RGB")
1140
+ except Exception as e:
1141
+ raise ValueError(
1142
+ "Base64 decoded successfully, but content is not a valid image"
1143
+ ) from e
1144
+
1145
+ def __call__(self, input_ids, spatial_shapes, dense_logits, output, img_input, use_crf):
1146
+ output_ids = output[0, input_ids.shape[1]:].tolist()
1147
+ if any(self.custom_tokens["<x_0>"][0] <= tid <= self.custom_tokens["<y_2047>"][0] for tid in output_ids):
1148
+ img = self.load_image(img_input)
1149
+ raw_w, raw_h = img.size
1150
+ inp_w, inp_h = spatial_shapes[0][1].item() * 16, spatial_shapes[0][0].item() * 16
1151
+ scale_w, scale_h = float(raw_w) / inp_w, float(raw_h) / inp_h
1152
+ coord_ids = self.convert_coord_ids(output_ids, scale_w, scale_h)
1153
+ coord_tensor = torch.tensor(coord_ids, dtype=output.dtype, device=output.device).unsqueeze(0)
1154
+ output = torch.cat([output[:, :input_ids.shape[1]], coord_tensor], dim=1)
1155
+ elif ((self.custom_tokens["<ref>"][0] in output_ids and self.custom_tokens["<ins>"][0] not in output_ids) or self.contains_subsequence(output_ids, self.custom_tokens["<depth>"])):
1156
+ img = self.load_image(img_input)
1157
+ inp_w, inp_h = spatial_shapes[0][1].item() // 2, spatial_shapes[0][0].item() // 2
1158
+ mask_ids = self.dense_decoding(input_ids, output_ids, (inp_w, inp_h), dense_logits, img, use_crf)
1159
+ mask_tensor = torch.tensor(mask_ids, dtype=output.dtype, device=output.device).unsqueeze(0)
1160
+ output = torch.cat([output, mask_tensor], dim=1)
1161
+ return output
1162
+
1163
+ class YoutuVLForConditionalGeneration(YoutuPreTrainedModel, GenerationMixin):
1164
+ _tied_weights_keys = ["lm_head.weight"]
1165
+ _tp_plan = {"lm_head": "colwise_rep"}
1166
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
1167
+
1168
+ def __init__(self, config):
1169
+ super().__init__(config)
1170
+
1171
+ config.vision_config.out_hidden_size = config.hidden_size
1172
+ config.vision_config.vision_use_head = False
1173
+ self.siglip2 = Siglip2VisionModel._from_config(config.vision_config)
1174
+ self.merger = VLPatchMerger(
1175
+ dim=config.hidden_size,
1176
+ context_dim=config.vision_config.hidden_size,
1177
+ spatial_merge_size=2,
1178
+ )
1179
+ self.rope_deltas = None
1180
+
1181
+ self.model = YoutuModel(config)
1182
+ self.vocab_size = config.vocab_size
1183
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1184
+ self.dense_logits = None
1185
+ self.dense_prediction = YoutuDensePrediction(config.custom_tokens)
1186
+ self.post_init()
1187
+
1188
+
1189
+ def get_input_embeddings(self):
1190
+ return self.model.embed_tokens
1191
+
1192
+ def set_input_embeddings(self, value):
1193
+ self.model.embed_tokens = value
1194
+
1195
+ def get_output_embeddings(self):
1196
+ return self.lm_head
1197
+
1198
+ def set_output_embeddings(self, new_embeddings):
1199
+ self.lm_head = new_embeddings
1200
+
1201
+ def set_decoder(self, decoder):
1202
+ self.model = decoder
1203
+
1204
+ def get_decoder(self):
1205
+ return self.model
1206
+
1207
+
1208
+ def generate(self, *args, img_input=None, use_crf=False, **kwargs):
1209
+ kwargs.pop("img_input", None)
1210
+ kwargs.pop("use_crf", None)
1211
+ output = super().generate(*args, **kwargs)
1212
+ if img_input == None:
1213
+ return output
1214
+ if isinstance(output, torch.Tensor):
1215
+ sequences = output
1216
+ generate_output = None
1217
+ else:
1218
+ sequences = output.sequences
1219
+ generate_output = output
1220
+
1221
+ input_ids = kwargs.get("input_ids", None)
1222
+ spatial_shapes = kwargs.get("spatial_shapes", None)
1223
+ sequences_with_mask = self.dense_prediction(
1224
+ input_ids,
1225
+ spatial_shapes,
1226
+ self.dense_logits,
1227
+ sequences,
1228
+ img_input,
1229
+ use_crf
1230
+ )
1231
+ if generate_output is None:
1232
+ return sequences_with_mask
1233
+ else:
1234
+ generate_output.sequences = sequences_with_mask
1235
+ return generate_output
1236
+
1237
+
1238
+ @can_return_tuple
1239
+ @deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
1240
+ @add_start_docstrings_to_model_forward(YOUTU_VL_INPUTS_DOCSTRING)
1241
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1242
+ def forward(
1243
+ self,
1244
+ input_ids: Optional[torch.LongTensor] = None,
1245
+ attention_mask: Optional[torch.Tensor] = None,
1246
+ position_ids: Optional[torch.LongTensor] = None,
1247
+ past_key_values: Optional[Cache] = None,
1248
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1249
+ labels: Optional[torch.LongTensor] = None,
1250
+ use_cache: Optional[bool] = None,
1251
+ output_attentions: Optional[bool] = None,
1252
+ output_hidden_states: Optional[bool] = None,
1253
+ pixel_values: Optional[torch.Tensor] = None,
1254
+ pixel_attention_mask: Optional[torch.LongTensor] = None,
1255
+ spatial_shapes: Optional[torch.LongTensor] = None,
1256
+ instance_length: Optional[torch.LongTensor] = None,
1257
+ coefficients: Optional[torch.FloatTensor] = None,
1258
+ rope_deltas: Optional[torch.LongTensor] = None,
1259
+ cache_position: Optional[torch.LongTensor] = None,
1260
+ logits_to_keep: Union[int, torch.Tensor] = 0,
1261
+ **kwargs: Unpack[KwargsForCausalLM],
1262
+ ) -> CausalLMOutputWithPast:
1263
+ r"""
1264
+ Example:
1265
+ TODO: Add example
1266
+
1267
+ Returns:
1268
+ """
1269
+
1270
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1271
+ output_hidden_states = (
1272
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1273
+ )
1274
+
1275
+ if inputs_embeds is None:
1276
+ inputs_embeds = self.model.embed_tokens(input_ids)
1277
+
1278
+ if pixel_values is not None:
1279
+ bs, length, dim_size = inputs_embeds.shape
1280
+ pixel_values = pixel_values.type(self.siglip2.dtype)
1281
+
1282
+ image_embeds = self.siglip2(pixel_values, pixel_attention_mask, spatial_shapes).last_hidden_state
1283
+ image_embeds = self.merger(image_embeds, spatial_shapes)
1284
+
1285
+ n_image_tokens = (input_ids == self.config.image_token_id).sum().item()
1286
+ n_image_features = image_embeds.shape[0]
1287
+
1288
+ if n_image_tokens > n_image_features:
1289
+ raise ValueError(
1290
+ "Image features and image tokens do not match: tokens: {}, features {}".format(
1291
+ n_image_tokens, n_image_features
1292
+ )
1293
+ )
1294
+
1295
+ mask = input_ids == self.config.image_token_id
1296
+ mask_unsqueezed = mask.unsqueeze(-1)
1297
+ mask_expanded = mask_unsqueezed.expand_as(inputs_embeds)
1298
+ image_mask = mask_expanded.to(inputs_embeds.device)
1299
+ image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
1300
+
1301
+ if bs != 1:
1302
+ raise ValueError("Only support batch size = 1")
1303
+
1304
+ image_embeds = image_embeds.unsqueeze(0)
1305
+ inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
1306
+
1307
+ if attention_mask is not None:
1308
+ attention_mask = attention_mask.to(inputs_embeds.device)
1309
+
1310
+ outputs: BaseModelOutputWithPast = self.model(
1311
+ input_ids=None,
1312
+ attention_mask=attention_mask,
1313
+ position_ids=position_ids,
1314
+ past_key_values=past_key_values,
1315
+ inputs_embeds=inputs_embeds,
1316
+ use_cache=use_cache,
1317
+ output_attentions=output_attentions,
1318
+ output_hidden_states=output_hidden_states,
1319
+ cache_position=cache_position,
1320
+ instance_length=instance_length,
1321
+ **kwargs,
1322
+ )
1323
+
1324
+ hidden_states = outputs.last_hidden_state
1325
+ logits = self.lm_head(hidden_states)
1326
+ if logits.shape[1] != 1:
1327
+ self.dense_logits = logits
1328
+ loss = None
1329
+
1330
+ return CausalLMOutputWithPast(
1331
+ loss=loss,
1332
+ logits=logits,
1333
+ past_key_values=outputs.past_key_values,
1334
+ hidden_states=outputs.hidden_states,
1335
+ attentions=outputs.attentions,
1336
+ )
1337
+
1338
+ __all__ = ["YoutuPreTrainedModel", "YoutuModel", "YoutuVLForConditionalGeneration"]
preprocessor_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_youtu_vl.YoutuVLProcessor",
4
+ "AutoImageProcessor": "image_processing_siglip2_fast.Siglip2ImageProcessorFast"
5
+ },
6
+ "processor_class": "YoutuVLProcessor",
7
+ "do_convert_rgb": null,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "image_mean": [
12
+ 0.5,
13
+ 0.5,
14
+ 0.5
15
+ ],
16
+ "image_processor_type": "Siglip2ImageProcessorFast",
17
+ "image_std": [
18
+ 0.5,
19
+ 0.5,
20
+ 0.5
21
+ ],
22
+ "max_num_patches": 256,
23
+ "patch_size": 16,
24
+ "resample": 2,
25
+ "rescale_factor": 0.00392156862745098
26
+ }
processing_youtu_vl.py ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 Tencent Youtu Lab and the HuggingFace Inc. team. All rights reserved.
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from typing import List, Union
16
+ import numpy
17
+ from transformers.feature_extraction_utils import BatchFeature
18
+ from transformers.image_utils import ImageInput
19
+ from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, VideosKwargs
20
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
21
+
22
+ class YoutuVLVideosProcessorKwargs(VideosKwargs, total=False):
23
+ fps: Union[List[float], float]
24
+
25
+
26
+ class YoutuVLProcessorKwargs(ProcessingKwargs, total=False):
27
+ videos_kwargs: YoutuVLVideosProcessorKwargs
28
+ _defaults = {
29
+ "text_kwargs": {
30
+ "padding": False,
31
+ },
32
+ "videos_kwargs": {"fps": 2.0},
33
+ }
34
+
35
+
36
+ class YoutuVLProcessor(ProcessorMixin):
37
+
38
+ attributes = ["image_processor", "tokenizer"]
39
+ valid_kwargs = ["chat_template"]
40
+
41
+ image_processor_class = "AutoImageProcessor"
42
+ tokenizer_class = ("PreTrainedTokenizer", "PreTrainedTokenizerFast")
43
+
44
+ def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
45
+ self.image_token = "<|image_pad|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
46
+ self.video_token = "<|video_pad|>" if not hasattr(tokenizer, "video_token") else tokenizer.video_token
47
+ super().__init__(image_processor, tokenizer, chat_template=chat_template)
48
+
49
+ def __call__(
50
+ self,
51
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
52
+ images: ImageInput = None,
53
+ max_image_patches: int=36864,
54
+ **kwargs: Unpack[YoutuVLProcessorKwargs],
55
+ ) -> BatchFeature:
56
+ """
57
+ Args:
58
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`,
59
+ `List[np.ndarray]`, `List[torch.Tensor]`):
60
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
61
+ tensor. Both channels-first and channels-last formats are supported.
62
+ text (`str`, `List[str]`, `List[List[str]]`):
63
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
64
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
65
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
66
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
67
+ If set, will return tensors of a particular framework. Acceptable values are:
68
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
69
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
70
+ - `'np'`: Return NumPy `np.ndarray` objects.
71
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
72
+
73
+ Returns:
74
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
75
+
76
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
77
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
78
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
79
+ `None`).
80
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
81
+ - **pixel_values_videos** -- Pixel values of videos to be fed to a model.
82
+ Returned when `videos` is not `None`.
83
+ - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
84
+ - **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
85
+ - **second_per_grid_ts** -- List of video seconds per time grid. Returned when `videos` is not `None`.
86
+ """
87
+ output_kwargs = self._merge_kwargs(
88
+ YoutuVLProcessorKwargs,
89
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
90
+ **kwargs,
91
+ )
92
+ if images is not None:
93
+ image_inputs = self.image_processor(images=images, max_num_patches=max_image_patches, return_tensors="pt")
94
+ else:
95
+ image_inputs = {}
96
+ image_grid_thw = None
97
+
98
+ videos_inputs = {}
99
+ video_grid_thw = None
100
+
101
+ if not isinstance(text, list):
102
+ text = [text]
103
+
104
+ image_tokens = []
105
+ if images is not None:
106
+ merge_length = 4
107
+ index = 0
108
+ for i in range(len(text)):
109
+ while self.image_token in text[i]:
110
+ h = image_inputs['spatial_shapes'][index][0]
111
+ w = image_inputs['spatial_shapes'][index][1]
112
+ repeats = h* w // merge_length
113
+ text[i] = text[i].replace(
114
+ self.image_token,
115
+ "<|placeholder|>" * repeats,
116
+ 1,
117
+ )
118
+ index += 1
119
+ text[i] = text[i].replace("<|placeholder|>", self.image_token)
120
+ assert(index == image_inputs['spatial_shapes'].shape[0])
121
+
122
+
123
+ if video_grid_thw is not None:
124
+ merge_length = self.image_processor.merge_size ** 2
125
+ index = 0
126
+ for i in range(len(text)):
127
+ while self.video_token in text[i]:
128
+ text[i] = text[i].replace(
129
+ self.video_token,
130
+ "<|placeholder|>" * (video_grid_thw[index].prod() // merge_length),
131
+ 1,
132
+ )
133
+ index += 1
134
+ text[i] = text[i].replace("<|placeholder|>", self.video_token)
135
+
136
+ text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
137
+
138
+ return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs})
139
+
140
+ def get_max_image_patches(self, images):
141
+ return self.image_processor.get_max_image_patches(images)
142
+
143
+ def batch_decode(self, *args, **kwargs):
144
+ return self.tokenizer.batch_decode(*args, **kwargs)
145
+
146
+ def decode(self, *args, **kwargs):
147
+ return self.tokenizer.decode(*args, **kwargs)
148
+
149
+ def post_process_image_text_to_text(
150
+ self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
151
+ ):
152
+ """
153
+ Post-process the output of the model to decode the text.
154
+
155
+ Args:
156
+ generated_outputs (`torch.Tensor` or `np.ndarray`):
157
+ The output of the model `generate` function. The output is
158
+ expected to be a tensor of shape `(batch_size, sequence_length)`
159
+ or `(sequence_length,)`.
160
+ skip_special_tokens (`bool`, *optional*, defaults to `True`):
161
+ Whether or not to remove special tokens in the output. Argument
162
+ passed to the tokenizer's `batch_decode` method.
163
+ Clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
164
+ Whether or not to clean up the tokenization spaces. Argument
165
+ passed to the tokenizer's `batch_decode` method.
166
+ **kwargs:
167
+ Additional arguments to be passed to the tokenizer's `batch_decode method`.
168
+
169
+ Returns:
170
+ `List[str]`: The decoded text.
171
+ """
172
+ return self.tokenizer.batch_decode(
173
+ generated_outputs,
174
+ skip_special_tokens=skip_special_tokens,
175
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
176
+ **kwargs,
177
+ )
178
+
179
+ @property
180
+ def model_input_names(self):
181
+ tokenizer_input_names = self.tokenizer.model_input_names
182
+ image_processor_input_names = self.image_processor.model_input_names
183
+ names_from_processor = list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
184
+ return names_from_processor + ["second_per_grid_ts"]
185
+
186
+
187
+ __all__ = ["YoutuVLProcessor"]
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|begin_of_text|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|end_of_text|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|end_of_text|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:41998384e9cea31ab97207e2ed59fed66b5481bf0c85fd04f8c7bbd3f7648a6d
3
+ size 39743446
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|begin_of_text|>",
3
+ "clean_up_tokenization_spaces": false,
4
+ "eos_token": "<|end_of_text|>",
5
+ "extra_special_tokens": {},
6
+ "model_input_names": [
7
+ "input_ids",
8
+ "attention_mask"
9
+ ],
10
+ "model_max_length": 131072,
11
+ "pad_token": "<|end_of_text|>",
12
+ "tokenizer_class": "PreTrainedTokenizerFast",
13
+ "truncation_side": "left",
14
+ "use_fast": true
15
+ }