OscarGD6 commited on
Commit
f810e46
·
verified ·
1 Parent(s): 74a3bd8

Upload IsaacForConditionalGeneration

Browse files
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "IsaacForConditionalGeneration"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "modular_isaac.IsaacConfig",
9
+ "AutoModelForCausalLM": "modular_isaac.IsaacForConditionalGeneration"
10
+ },
11
+ "bos_token_id": 151643,
12
+ "dtype": "float32",
13
+ "eos_token_id": 151645,
14
+ "head_dim": 128,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 2048,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 6144,
19
+ "layer_types": [
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention",
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention"
48
+ ],
49
+ "max_position_embeddings": 40960,
50
+ "max_sequence_length": 16384,
51
+ "max_window_layers": 28,
52
+ "model_type": "isaac",
53
+ "num_attention_heads": 16,
54
+ "num_hidden_layers": 28,
55
+ "num_key_value_heads": 8,
56
+ "pixel_shuffle_scale": 2,
57
+ "rms_norm_eps": 1e-06,
58
+ "rope_scaling": {
59
+ "mrope_interleaved": true,
60
+ "mrope_section": null,
61
+ "rope_type": "default"
62
+ },
63
+ "rope_theta": 1000000.0,
64
+ "sliding_window": null,
65
+ "tie_word_embeddings": false,
66
+ "transformers_version": "4.56.2",
67
+ "use_cache": true,
68
+ "use_sliding_window": false,
69
+ "video_patch_size": 16,
70
+ "vision_config": {
71
+ "attention_dropout": 0.0,
72
+ "dtype": "float32",
73
+ "hidden_act": "gelu_pytorch_tanh",
74
+ "hidden_size": 1152,
75
+ "image_size": 256,
76
+ "intermediate_size": 4304,
77
+ "layer_norm_eps": 1e-06,
78
+ "model_type": "pixel_shuffle_siglip2",
79
+ "num_attention_heads": 16,
80
+ "num_channels": 3,
81
+ "num_hidden_layers": 27,
82
+ "num_patches": 256,
83
+ "patch_size": 16,
84
+ "pixel_shuffle_scale_factor": 2
85
+ },
86
+ "vision_max_num_patches": 6144,
87
+ "vision_min_num_patches": 256,
88
+ "vision_token": "<image>",
89
+ "vocab_size": 151936
90
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "transformers_version": "4.56.2"
6
+ }
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f60b6bc3c8ed16d95c88b5b6d33101d0aa9464f5f3f33e204342859b12e371bb
3
+ size 4969539560
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b73a606d306a09519e3fbe7bfd29077d39db48fee47ce19521b6b5c398cdcc32
3
+ size 4054187824
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6941d35ff1feae1603946f8746a71205bb86343b57968402df2e737faf9258a2
3
+ size 1244659840
model.safetensors.index.json ADDED
@@ -0,0 +1,758 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 2567073008,
4
+ "total_size": 10268292032
5
+ },
6
+ "weight_map": {
7
+ "lm_head.weight": "model-00003-of-00003.safetensors",
8
+ "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
9
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
10
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
11
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
12
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
13
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
14
+ "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
15
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
16
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
17
+ "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
18
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
25
+ "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
28
+ "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
30
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
31
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00003.safetensors",
32
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
33
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
34
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
35
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
36
+ "model.layers.10.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
37
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
38
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
39
+ "model.layers.10.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
40
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
41
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
42
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00003.safetensors",
43
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
44
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
45
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
46
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
47
+ "model.layers.11.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
48
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
49
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
50
+ "model.layers.11.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
51
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
52
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
53
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00003.safetensors",
54
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
55
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
56
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
57
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
58
+ "model.layers.12.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
59
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
60
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
61
+ "model.layers.12.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
62
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
63
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
64
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00003.safetensors",
65
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
66
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
67
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
68
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
69
+ "model.layers.13.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
70
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
71
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
72
+ "model.layers.13.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
73
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
74
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
75
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00003.safetensors",
76
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
77
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
78
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
79
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
80
+ "model.layers.14.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
81
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
82
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
83
+ "model.layers.14.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
84
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
85
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
86
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00003.safetensors",
87
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
88
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
89
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
90
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
91
+ "model.layers.15.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
92
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
93
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
94
+ "model.layers.15.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
95
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
96
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
97
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00003.safetensors",
98
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
99
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
100
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
101
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
102
+ "model.layers.16.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
103
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
104
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
105
+ "model.layers.16.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
106
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
107
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
108
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00003.safetensors",
109
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
110
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
111
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
112
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
113
+ "model.layers.17.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
114
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
115
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
116
+ "model.layers.17.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
117
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
118
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
119
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
120
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
121
+ "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
122
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
123
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
124
+ "model.layers.18.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
125
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
126
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
127
+ "model.layers.18.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
128
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
129
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
130
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00003.safetensors",
131
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
132
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
133
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
134
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
135
+ "model.layers.19.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
136
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
137
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
138
+ "model.layers.19.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
139
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
140
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
141
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
142
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
143
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
144
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
145
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
146
+ "model.layers.2.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
147
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
148
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
149
+ "model.layers.2.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
150
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
151
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
152
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00003.safetensors",
153
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
154
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
155
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
156
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
157
+ "model.layers.20.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
158
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
159
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
160
+ "model.layers.20.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
161
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
162
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
163
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00003.safetensors",
164
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
165
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
166
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
167
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
168
+ "model.layers.21.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
169
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
170
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
171
+ "model.layers.21.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
172
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
173
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
174
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00003.safetensors",
175
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
176
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
177
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
178
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
179
+ "model.layers.22.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
180
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
181
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
182
+ "model.layers.22.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
183
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
184
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
185
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00003.safetensors",
186
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
187
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
188
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
189
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
190
+ "model.layers.23.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
191
+ "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
192
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
193
+ "model.layers.23.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
194
+ "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
195
+ "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
196
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00003.safetensors",
197
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
198
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
199
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
200
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
201
+ "model.layers.24.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
202
+ "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
203
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
204
+ "model.layers.24.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
205
+ "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
206
+ "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
207
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00003.safetensors",
208
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
209
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
210
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
211
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
212
+ "model.layers.25.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
213
+ "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
214
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
215
+ "model.layers.25.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
216
+ "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
217
+ "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
218
+ "model.layers.26.input_layernorm.weight": "model-00002-of-00003.safetensors",
219
+ "model.layers.26.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
220
+ "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
221
+ "model.layers.26.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
222
+ "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
223
+ "model.layers.26.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
224
+ "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
225
+ "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
226
+ "model.layers.26.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
227
+ "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
228
+ "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
229
+ "model.layers.27.input_layernorm.weight": "model-00002-of-00003.safetensors",
230
+ "model.layers.27.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
231
+ "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
232
+ "model.layers.27.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
233
+ "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
234
+ "model.layers.27.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
235
+ "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
236
+ "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
237
+ "model.layers.27.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
238
+ "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
239
+ "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
240
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
241
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
242
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
243
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
244
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
245
+ "model.layers.3.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
246
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
247
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
248
+ "model.layers.3.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
249
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
250
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
251
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
252
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
253
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
254
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
255
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
256
+ "model.layers.4.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
257
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
258
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
259
+ "model.layers.4.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
260
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
261
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
262
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
263
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
264
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
265
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
266
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
267
+ "model.layers.5.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
268
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
269
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
270
+ "model.layers.5.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
271
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
272
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
273
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
274
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
275
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
276
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
277
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
278
+ "model.layers.6.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
279
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
280
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
281
+ "model.layers.6.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
282
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
283
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
284
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
285
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
286
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
287
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
288
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
289
+ "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
290
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
291
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
292
+ "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
293
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
294
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
295
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
296
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
297
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
298
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
299
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
300
+ "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
301
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
302
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
303
+ "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
304
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
305
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
306
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
307
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
308
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
309
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
310
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
311
+ "model.layers.9.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
312
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
313
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
314
+ "model.layers.9.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
315
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
316
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
317
+ "model.norm.weight": "model-00002-of-00003.safetensors",
318
+ "model.vision_embedding.0.embeddings.patch_embedding.bias": "model-00002-of-00003.safetensors",
319
+ "model.vision_embedding.0.embeddings.patch_embedding.weight": "model-00002-of-00003.safetensors",
320
+ "model.vision_embedding.0.embeddings.position_embedding.weight": "model-00002-of-00003.safetensors",
321
+ "model.vision_embedding.0.encoder.layers.0.layer_norm1.bias": "model-00002-of-00003.safetensors",
322
+ "model.vision_embedding.0.encoder.layers.0.layer_norm1.weight": "model-00002-of-00003.safetensors",
323
+ "model.vision_embedding.0.encoder.layers.0.layer_norm2.bias": "model-00002-of-00003.safetensors",
324
+ "model.vision_embedding.0.encoder.layers.0.layer_norm2.weight": "model-00002-of-00003.safetensors",
325
+ "model.vision_embedding.0.encoder.layers.0.mlp.fc1.bias": "model-00002-of-00003.safetensors",
326
+ "model.vision_embedding.0.encoder.layers.0.mlp.fc1.weight": "model-00002-of-00003.safetensors",
327
+ "model.vision_embedding.0.encoder.layers.0.mlp.fc2.bias": "model-00002-of-00003.safetensors",
328
+ "model.vision_embedding.0.encoder.layers.0.mlp.fc2.weight": "model-00002-of-00003.safetensors",
329
+ "model.vision_embedding.0.encoder.layers.0.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
330
+ "model.vision_embedding.0.encoder.layers.0.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
331
+ "model.vision_embedding.0.encoder.layers.0.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
332
+ "model.vision_embedding.0.encoder.layers.0.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
333
+ "model.vision_embedding.0.encoder.layers.0.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
334
+ "model.vision_embedding.0.encoder.layers.0.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
335
+ "model.vision_embedding.0.encoder.layers.0.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
336
+ "model.vision_embedding.0.encoder.layers.0.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
337
+ "model.vision_embedding.0.encoder.layers.1.layer_norm1.bias": "model-00002-of-00003.safetensors",
338
+ "model.vision_embedding.0.encoder.layers.1.layer_norm1.weight": "model-00002-of-00003.safetensors",
339
+ "model.vision_embedding.0.encoder.layers.1.layer_norm2.bias": "model-00002-of-00003.safetensors",
340
+ "model.vision_embedding.0.encoder.layers.1.layer_norm2.weight": "model-00002-of-00003.safetensors",
341
+ "model.vision_embedding.0.encoder.layers.1.mlp.fc1.bias": "model-00002-of-00003.safetensors",
342
+ "model.vision_embedding.0.encoder.layers.1.mlp.fc1.weight": "model-00002-of-00003.safetensors",
343
+ "model.vision_embedding.0.encoder.layers.1.mlp.fc2.bias": "model-00002-of-00003.safetensors",
344
+ "model.vision_embedding.0.encoder.layers.1.mlp.fc2.weight": "model-00002-of-00003.safetensors",
345
+ "model.vision_embedding.0.encoder.layers.1.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
346
+ "model.vision_embedding.0.encoder.layers.1.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
347
+ "model.vision_embedding.0.encoder.layers.1.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
348
+ "model.vision_embedding.0.encoder.layers.1.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
349
+ "model.vision_embedding.0.encoder.layers.1.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
350
+ "model.vision_embedding.0.encoder.layers.1.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
351
+ "model.vision_embedding.0.encoder.layers.1.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
352
+ "model.vision_embedding.0.encoder.layers.1.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
353
+ "model.vision_embedding.0.encoder.layers.10.layer_norm1.bias": "model-00002-of-00003.safetensors",
354
+ "model.vision_embedding.0.encoder.layers.10.layer_norm1.weight": "model-00002-of-00003.safetensors",
355
+ "model.vision_embedding.0.encoder.layers.10.layer_norm2.bias": "model-00002-of-00003.safetensors",
356
+ "model.vision_embedding.0.encoder.layers.10.layer_norm2.weight": "model-00002-of-00003.safetensors",
357
+ "model.vision_embedding.0.encoder.layers.10.mlp.fc1.bias": "model-00002-of-00003.safetensors",
358
+ "model.vision_embedding.0.encoder.layers.10.mlp.fc1.weight": "model-00002-of-00003.safetensors",
359
+ "model.vision_embedding.0.encoder.layers.10.mlp.fc2.bias": "model-00002-of-00003.safetensors",
360
+ "model.vision_embedding.0.encoder.layers.10.mlp.fc2.weight": "model-00002-of-00003.safetensors",
361
+ "model.vision_embedding.0.encoder.layers.10.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
362
+ "model.vision_embedding.0.encoder.layers.10.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
363
+ "model.vision_embedding.0.encoder.layers.10.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
364
+ "model.vision_embedding.0.encoder.layers.10.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
365
+ "model.vision_embedding.0.encoder.layers.10.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
366
+ "model.vision_embedding.0.encoder.layers.10.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
367
+ "model.vision_embedding.0.encoder.layers.10.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
368
+ "model.vision_embedding.0.encoder.layers.10.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
369
+ "model.vision_embedding.0.encoder.layers.11.layer_norm1.bias": "model-00002-of-00003.safetensors",
370
+ "model.vision_embedding.0.encoder.layers.11.layer_norm1.weight": "model-00002-of-00003.safetensors",
371
+ "model.vision_embedding.0.encoder.layers.11.layer_norm2.bias": "model-00002-of-00003.safetensors",
372
+ "model.vision_embedding.0.encoder.layers.11.layer_norm2.weight": "model-00002-of-00003.safetensors",
373
+ "model.vision_embedding.0.encoder.layers.11.mlp.fc1.bias": "model-00002-of-00003.safetensors",
374
+ "model.vision_embedding.0.encoder.layers.11.mlp.fc1.weight": "model-00002-of-00003.safetensors",
375
+ "model.vision_embedding.0.encoder.layers.11.mlp.fc2.bias": "model-00002-of-00003.safetensors",
376
+ "model.vision_embedding.0.encoder.layers.11.mlp.fc2.weight": "model-00002-of-00003.safetensors",
377
+ "model.vision_embedding.0.encoder.layers.11.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
378
+ "model.vision_embedding.0.encoder.layers.11.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
379
+ "model.vision_embedding.0.encoder.layers.11.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
380
+ "model.vision_embedding.0.encoder.layers.11.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
381
+ "model.vision_embedding.0.encoder.layers.11.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
382
+ "model.vision_embedding.0.encoder.layers.11.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
383
+ "model.vision_embedding.0.encoder.layers.11.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
384
+ "model.vision_embedding.0.encoder.layers.11.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
385
+ "model.vision_embedding.0.encoder.layers.12.layer_norm1.bias": "model-00002-of-00003.safetensors",
386
+ "model.vision_embedding.0.encoder.layers.12.layer_norm1.weight": "model-00002-of-00003.safetensors",
387
+ "model.vision_embedding.0.encoder.layers.12.layer_norm2.bias": "model-00002-of-00003.safetensors",
388
+ "model.vision_embedding.0.encoder.layers.12.layer_norm2.weight": "model-00002-of-00003.safetensors",
389
+ "model.vision_embedding.0.encoder.layers.12.mlp.fc1.bias": "model-00002-of-00003.safetensors",
390
+ "model.vision_embedding.0.encoder.layers.12.mlp.fc1.weight": "model-00002-of-00003.safetensors",
391
+ "model.vision_embedding.0.encoder.layers.12.mlp.fc2.bias": "model-00002-of-00003.safetensors",
392
+ "model.vision_embedding.0.encoder.layers.12.mlp.fc2.weight": "model-00002-of-00003.safetensors",
393
+ "model.vision_embedding.0.encoder.layers.12.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
394
+ "model.vision_embedding.0.encoder.layers.12.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
395
+ "model.vision_embedding.0.encoder.layers.12.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
396
+ "model.vision_embedding.0.encoder.layers.12.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
397
+ "model.vision_embedding.0.encoder.layers.12.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
398
+ "model.vision_embedding.0.encoder.layers.12.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
399
+ "model.vision_embedding.0.encoder.layers.12.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
400
+ "model.vision_embedding.0.encoder.layers.12.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
401
+ "model.vision_embedding.0.encoder.layers.13.layer_norm1.bias": "model-00002-of-00003.safetensors",
402
+ "model.vision_embedding.0.encoder.layers.13.layer_norm1.weight": "model-00002-of-00003.safetensors",
403
+ "model.vision_embedding.0.encoder.layers.13.layer_norm2.bias": "model-00002-of-00003.safetensors",
404
+ "model.vision_embedding.0.encoder.layers.13.layer_norm2.weight": "model-00002-of-00003.safetensors",
405
+ "model.vision_embedding.0.encoder.layers.13.mlp.fc1.bias": "model-00002-of-00003.safetensors",
406
+ "model.vision_embedding.0.encoder.layers.13.mlp.fc1.weight": "model-00002-of-00003.safetensors",
407
+ "model.vision_embedding.0.encoder.layers.13.mlp.fc2.bias": "model-00002-of-00003.safetensors",
408
+ "model.vision_embedding.0.encoder.layers.13.mlp.fc2.weight": "model-00002-of-00003.safetensors",
409
+ "model.vision_embedding.0.encoder.layers.13.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
410
+ "model.vision_embedding.0.encoder.layers.13.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
411
+ "model.vision_embedding.0.encoder.layers.13.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
412
+ "model.vision_embedding.0.encoder.layers.13.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
413
+ "model.vision_embedding.0.encoder.layers.13.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
414
+ "model.vision_embedding.0.encoder.layers.13.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
415
+ "model.vision_embedding.0.encoder.layers.13.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
416
+ "model.vision_embedding.0.encoder.layers.13.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
417
+ "model.vision_embedding.0.encoder.layers.14.layer_norm1.bias": "model-00002-of-00003.safetensors",
418
+ "model.vision_embedding.0.encoder.layers.14.layer_norm1.weight": "model-00002-of-00003.safetensors",
419
+ "model.vision_embedding.0.encoder.layers.14.layer_norm2.bias": "model-00002-of-00003.safetensors",
420
+ "model.vision_embedding.0.encoder.layers.14.layer_norm2.weight": "model-00002-of-00003.safetensors",
421
+ "model.vision_embedding.0.encoder.layers.14.mlp.fc1.bias": "model-00002-of-00003.safetensors",
422
+ "model.vision_embedding.0.encoder.layers.14.mlp.fc1.weight": "model-00002-of-00003.safetensors",
423
+ "model.vision_embedding.0.encoder.layers.14.mlp.fc2.bias": "model-00002-of-00003.safetensors",
424
+ "model.vision_embedding.0.encoder.layers.14.mlp.fc2.weight": "model-00002-of-00003.safetensors",
425
+ "model.vision_embedding.0.encoder.layers.14.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
426
+ "model.vision_embedding.0.encoder.layers.14.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
427
+ "model.vision_embedding.0.encoder.layers.14.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
428
+ "model.vision_embedding.0.encoder.layers.14.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
429
+ "model.vision_embedding.0.encoder.layers.14.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
430
+ "model.vision_embedding.0.encoder.layers.14.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
431
+ "model.vision_embedding.0.encoder.layers.14.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
432
+ "model.vision_embedding.0.encoder.layers.14.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
433
+ "model.vision_embedding.0.encoder.layers.15.layer_norm1.bias": "model-00002-of-00003.safetensors",
434
+ "model.vision_embedding.0.encoder.layers.15.layer_norm1.weight": "model-00002-of-00003.safetensors",
435
+ "model.vision_embedding.0.encoder.layers.15.layer_norm2.bias": "model-00002-of-00003.safetensors",
436
+ "model.vision_embedding.0.encoder.layers.15.layer_norm2.weight": "model-00002-of-00003.safetensors",
437
+ "model.vision_embedding.0.encoder.layers.15.mlp.fc1.bias": "model-00002-of-00003.safetensors",
438
+ "model.vision_embedding.0.encoder.layers.15.mlp.fc1.weight": "model-00002-of-00003.safetensors",
439
+ "model.vision_embedding.0.encoder.layers.15.mlp.fc2.bias": "model-00002-of-00003.safetensors",
440
+ "model.vision_embedding.0.encoder.layers.15.mlp.fc2.weight": "model-00002-of-00003.safetensors",
441
+ "model.vision_embedding.0.encoder.layers.15.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
442
+ "model.vision_embedding.0.encoder.layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
443
+ "model.vision_embedding.0.encoder.layers.15.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
444
+ "model.vision_embedding.0.encoder.layers.15.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
445
+ "model.vision_embedding.0.encoder.layers.15.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
446
+ "model.vision_embedding.0.encoder.layers.15.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
447
+ "model.vision_embedding.0.encoder.layers.15.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
448
+ "model.vision_embedding.0.encoder.layers.15.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
449
+ "model.vision_embedding.0.encoder.layers.16.layer_norm1.bias": "model-00002-of-00003.safetensors",
450
+ "model.vision_embedding.0.encoder.layers.16.layer_norm1.weight": "model-00002-of-00003.safetensors",
451
+ "model.vision_embedding.0.encoder.layers.16.layer_norm2.bias": "model-00002-of-00003.safetensors",
452
+ "model.vision_embedding.0.encoder.layers.16.layer_norm2.weight": "model-00002-of-00003.safetensors",
453
+ "model.vision_embedding.0.encoder.layers.16.mlp.fc1.bias": "model-00002-of-00003.safetensors",
454
+ "model.vision_embedding.0.encoder.layers.16.mlp.fc1.weight": "model-00002-of-00003.safetensors",
455
+ "model.vision_embedding.0.encoder.layers.16.mlp.fc2.bias": "model-00002-of-00003.safetensors",
456
+ "model.vision_embedding.0.encoder.layers.16.mlp.fc2.weight": "model-00002-of-00003.safetensors",
457
+ "model.vision_embedding.0.encoder.layers.16.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
458
+ "model.vision_embedding.0.encoder.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
459
+ "model.vision_embedding.0.encoder.layers.16.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
460
+ "model.vision_embedding.0.encoder.layers.16.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
461
+ "model.vision_embedding.0.encoder.layers.16.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
462
+ "model.vision_embedding.0.encoder.layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
463
+ "model.vision_embedding.0.encoder.layers.16.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
464
+ "model.vision_embedding.0.encoder.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
465
+ "model.vision_embedding.0.encoder.layers.17.layer_norm1.bias": "model-00002-of-00003.safetensors",
466
+ "model.vision_embedding.0.encoder.layers.17.layer_norm1.weight": "model-00002-of-00003.safetensors",
467
+ "model.vision_embedding.0.encoder.layers.17.layer_norm2.bias": "model-00002-of-00003.safetensors",
468
+ "model.vision_embedding.0.encoder.layers.17.layer_norm2.weight": "model-00002-of-00003.safetensors",
469
+ "model.vision_embedding.0.encoder.layers.17.mlp.fc1.bias": "model-00002-of-00003.safetensors",
470
+ "model.vision_embedding.0.encoder.layers.17.mlp.fc1.weight": "model-00002-of-00003.safetensors",
471
+ "model.vision_embedding.0.encoder.layers.17.mlp.fc2.bias": "model-00002-of-00003.safetensors",
472
+ "model.vision_embedding.0.encoder.layers.17.mlp.fc2.weight": "model-00002-of-00003.safetensors",
473
+ "model.vision_embedding.0.encoder.layers.17.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
474
+ "model.vision_embedding.0.encoder.layers.17.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
475
+ "model.vision_embedding.0.encoder.layers.17.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
476
+ "model.vision_embedding.0.encoder.layers.17.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
477
+ "model.vision_embedding.0.encoder.layers.17.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
478
+ "model.vision_embedding.0.encoder.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
479
+ "model.vision_embedding.0.encoder.layers.17.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
480
+ "model.vision_embedding.0.encoder.layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
481
+ "model.vision_embedding.0.encoder.layers.18.layer_norm1.bias": "model-00002-of-00003.safetensors",
482
+ "model.vision_embedding.0.encoder.layers.18.layer_norm1.weight": "model-00002-of-00003.safetensors",
483
+ "model.vision_embedding.0.encoder.layers.18.layer_norm2.bias": "model-00002-of-00003.safetensors",
484
+ "model.vision_embedding.0.encoder.layers.18.layer_norm2.weight": "model-00002-of-00003.safetensors",
485
+ "model.vision_embedding.0.encoder.layers.18.mlp.fc1.bias": "model-00002-of-00003.safetensors",
486
+ "model.vision_embedding.0.encoder.layers.18.mlp.fc1.weight": "model-00002-of-00003.safetensors",
487
+ "model.vision_embedding.0.encoder.layers.18.mlp.fc2.bias": "model-00002-of-00003.safetensors",
488
+ "model.vision_embedding.0.encoder.layers.18.mlp.fc2.weight": "model-00002-of-00003.safetensors",
489
+ "model.vision_embedding.0.encoder.layers.18.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
490
+ "model.vision_embedding.0.encoder.layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
491
+ "model.vision_embedding.0.encoder.layers.18.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
492
+ "model.vision_embedding.0.encoder.layers.18.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
493
+ "model.vision_embedding.0.encoder.layers.18.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
494
+ "model.vision_embedding.0.encoder.layers.18.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
495
+ "model.vision_embedding.0.encoder.layers.18.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
496
+ "model.vision_embedding.0.encoder.layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
497
+ "model.vision_embedding.0.encoder.layers.19.layer_norm1.bias": "model-00002-of-00003.safetensors",
498
+ "model.vision_embedding.0.encoder.layers.19.layer_norm1.weight": "model-00002-of-00003.safetensors",
499
+ "model.vision_embedding.0.encoder.layers.19.layer_norm2.bias": "model-00002-of-00003.safetensors",
500
+ "model.vision_embedding.0.encoder.layers.19.layer_norm2.weight": "model-00002-of-00003.safetensors",
501
+ "model.vision_embedding.0.encoder.layers.19.mlp.fc1.bias": "model-00002-of-00003.safetensors",
502
+ "model.vision_embedding.0.encoder.layers.19.mlp.fc1.weight": "model-00002-of-00003.safetensors",
503
+ "model.vision_embedding.0.encoder.layers.19.mlp.fc2.bias": "model-00002-of-00003.safetensors",
504
+ "model.vision_embedding.0.encoder.layers.19.mlp.fc2.weight": "model-00002-of-00003.safetensors",
505
+ "model.vision_embedding.0.encoder.layers.19.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
506
+ "model.vision_embedding.0.encoder.layers.19.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
507
+ "model.vision_embedding.0.encoder.layers.19.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
508
+ "model.vision_embedding.0.encoder.layers.19.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
509
+ "model.vision_embedding.0.encoder.layers.19.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
510
+ "model.vision_embedding.0.encoder.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
511
+ "model.vision_embedding.0.encoder.layers.19.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
512
+ "model.vision_embedding.0.encoder.layers.19.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
513
+ "model.vision_embedding.0.encoder.layers.2.layer_norm1.bias": "model-00002-of-00003.safetensors",
514
+ "model.vision_embedding.0.encoder.layers.2.layer_norm1.weight": "model-00002-of-00003.safetensors",
515
+ "model.vision_embedding.0.encoder.layers.2.layer_norm2.bias": "model-00002-of-00003.safetensors",
516
+ "model.vision_embedding.0.encoder.layers.2.layer_norm2.weight": "model-00002-of-00003.safetensors",
517
+ "model.vision_embedding.0.encoder.layers.2.mlp.fc1.bias": "model-00002-of-00003.safetensors",
518
+ "model.vision_embedding.0.encoder.layers.2.mlp.fc1.weight": "model-00002-of-00003.safetensors",
519
+ "model.vision_embedding.0.encoder.layers.2.mlp.fc2.bias": "model-00002-of-00003.safetensors",
520
+ "model.vision_embedding.0.encoder.layers.2.mlp.fc2.weight": "model-00002-of-00003.safetensors",
521
+ "model.vision_embedding.0.encoder.layers.2.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
522
+ "model.vision_embedding.0.encoder.layers.2.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
523
+ "model.vision_embedding.0.encoder.layers.2.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
524
+ "model.vision_embedding.0.encoder.layers.2.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
525
+ "model.vision_embedding.0.encoder.layers.2.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
526
+ "model.vision_embedding.0.encoder.layers.2.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
527
+ "model.vision_embedding.0.encoder.layers.2.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
528
+ "model.vision_embedding.0.encoder.layers.2.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
529
+ "model.vision_embedding.0.encoder.layers.20.layer_norm1.bias": "model-00002-of-00003.safetensors",
530
+ "model.vision_embedding.0.encoder.layers.20.layer_norm1.weight": "model-00002-of-00003.safetensors",
531
+ "model.vision_embedding.0.encoder.layers.20.layer_norm2.bias": "model-00002-of-00003.safetensors",
532
+ "model.vision_embedding.0.encoder.layers.20.layer_norm2.weight": "model-00002-of-00003.safetensors",
533
+ "model.vision_embedding.0.encoder.layers.20.mlp.fc1.bias": "model-00002-of-00003.safetensors",
534
+ "model.vision_embedding.0.encoder.layers.20.mlp.fc1.weight": "model-00002-of-00003.safetensors",
535
+ "model.vision_embedding.0.encoder.layers.20.mlp.fc2.bias": "model-00002-of-00003.safetensors",
536
+ "model.vision_embedding.0.encoder.layers.20.mlp.fc2.weight": "model-00002-of-00003.safetensors",
537
+ "model.vision_embedding.0.encoder.layers.20.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
538
+ "model.vision_embedding.0.encoder.layers.20.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
539
+ "model.vision_embedding.0.encoder.layers.20.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
540
+ "model.vision_embedding.0.encoder.layers.20.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
541
+ "model.vision_embedding.0.encoder.layers.20.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
542
+ "model.vision_embedding.0.encoder.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
543
+ "model.vision_embedding.0.encoder.layers.20.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
544
+ "model.vision_embedding.0.encoder.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
545
+ "model.vision_embedding.0.encoder.layers.21.layer_norm1.bias": "model-00002-of-00003.safetensors",
546
+ "model.vision_embedding.0.encoder.layers.21.layer_norm1.weight": "model-00002-of-00003.safetensors",
547
+ "model.vision_embedding.0.encoder.layers.21.layer_norm2.bias": "model-00002-of-00003.safetensors",
548
+ "model.vision_embedding.0.encoder.layers.21.layer_norm2.weight": "model-00002-of-00003.safetensors",
549
+ "model.vision_embedding.0.encoder.layers.21.mlp.fc1.bias": "model-00002-of-00003.safetensors",
550
+ "model.vision_embedding.0.encoder.layers.21.mlp.fc1.weight": "model-00002-of-00003.safetensors",
551
+ "model.vision_embedding.0.encoder.layers.21.mlp.fc2.bias": "model-00002-of-00003.safetensors",
552
+ "model.vision_embedding.0.encoder.layers.21.mlp.fc2.weight": "model-00002-of-00003.safetensors",
553
+ "model.vision_embedding.0.encoder.layers.21.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
554
+ "model.vision_embedding.0.encoder.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
555
+ "model.vision_embedding.0.encoder.layers.21.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
556
+ "model.vision_embedding.0.encoder.layers.21.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
557
+ "model.vision_embedding.0.encoder.layers.21.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
558
+ "model.vision_embedding.0.encoder.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
559
+ "model.vision_embedding.0.encoder.layers.21.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
560
+ "model.vision_embedding.0.encoder.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
561
+ "model.vision_embedding.0.encoder.layers.22.layer_norm1.bias": "model-00002-of-00003.safetensors",
562
+ "model.vision_embedding.0.encoder.layers.22.layer_norm1.weight": "model-00002-of-00003.safetensors",
563
+ "model.vision_embedding.0.encoder.layers.22.layer_norm2.bias": "model-00002-of-00003.safetensors",
564
+ "model.vision_embedding.0.encoder.layers.22.layer_norm2.weight": "model-00002-of-00003.safetensors",
565
+ "model.vision_embedding.0.encoder.layers.22.mlp.fc1.bias": "model-00002-of-00003.safetensors",
566
+ "model.vision_embedding.0.encoder.layers.22.mlp.fc1.weight": "model-00002-of-00003.safetensors",
567
+ "model.vision_embedding.0.encoder.layers.22.mlp.fc2.bias": "model-00002-of-00003.safetensors",
568
+ "model.vision_embedding.0.encoder.layers.22.mlp.fc2.weight": "model-00002-of-00003.safetensors",
569
+ "model.vision_embedding.0.encoder.layers.22.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
570
+ "model.vision_embedding.0.encoder.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
571
+ "model.vision_embedding.0.encoder.layers.22.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
572
+ "model.vision_embedding.0.encoder.layers.22.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
573
+ "model.vision_embedding.0.encoder.layers.22.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
574
+ "model.vision_embedding.0.encoder.layers.22.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
575
+ "model.vision_embedding.0.encoder.layers.22.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
576
+ "model.vision_embedding.0.encoder.layers.22.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
577
+ "model.vision_embedding.0.encoder.layers.23.layer_norm1.bias": "model-00002-of-00003.safetensors",
578
+ "model.vision_embedding.0.encoder.layers.23.layer_norm1.weight": "model-00002-of-00003.safetensors",
579
+ "model.vision_embedding.0.encoder.layers.23.layer_norm2.bias": "model-00002-of-00003.safetensors",
580
+ "model.vision_embedding.0.encoder.layers.23.layer_norm2.weight": "model-00002-of-00003.safetensors",
581
+ "model.vision_embedding.0.encoder.layers.23.mlp.fc1.bias": "model-00002-of-00003.safetensors",
582
+ "model.vision_embedding.0.encoder.layers.23.mlp.fc1.weight": "model-00002-of-00003.safetensors",
583
+ "model.vision_embedding.0.encoder.layers.23.mlp.fc2.bias": "model-00002-of-00003.safetensors",
584
+ "model.vision_embedding.0.encoder.layers.23.mlp.fc2.weight": "model-00002-of-00003.safetensors",
585
+ "model.vision_embedding.0.encoder.layers.23.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
586
+ "model.vision_embedding.0.encoder.layers.23.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
587
+ "model.vision_embedding.0.encoder.layers.23.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
588
+ "model.vision_embedding.0.encoder.layers.23.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
589
+ "model.vision_embedding.0.encoder.layers.23.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
590
+ "model.vision_embedding.0.encoder.layers.23.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
591
+ "model.vision_embedding.0.encoder.layers.23.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
592
+ "model.vision_embedding.0.encoder.layers.23.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
593
+ "model.vision_embedding.0.encoder.layers.24.layer_norm1.bias": "model-00002-of-00003.safetensors",
594
+ "model.vision_embedding.0.encoder.layers.24.layer_norm1.weight": "model-00002-of-00003.safetensors",
595
+ "model.vision_embedding.0.encoder.layers.24.layer_norm2.bias": "model-00002-of-00003.safetensors",
596
+ "model.vision_embedding.0.encoder.layers.24.layer_norm2.weight": "model-00002-of-00003.safetensors",
597
+ "model.vision_embedding.0.encoder.layers.24.mlp.fc1.bias": "model-00002-of-00003.safetensors",
598
+ "model.vision_embedding.0.encoder.layers.24.mlp.fc1.weight": "model-00002-of-00003.safetensors",
599
+ "model.vision_embedding.0.encoder.layers.24.mlp.fc2.bias": "model-00002-of-00003.safetensors",
600
+ "model.vision_embedding.0.encoder.layers.24.mlp.fc2.weight": "model-00002-of-00003.safetensors",
601
+ "model.vision_embedding.0.encoder.layers.24.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
602
+ "model.vision_embedding.0.encoder.layers.24.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
603
+ "model.vision_embedding.0.encoder.layers.24.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
604
+ "model.vision_embedding.0.encoder.layers.24.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
605
+ "model.vision_embedding.0.encoder.layers.24.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
606
+ "model.vision_embedding.0.encoder.layers.24.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
607
+ "model.vision_embedding.0.encoder.layers.24.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
608
+ "model.vision_embedding.0.encoder.layers.24.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
609
+ "model.vision_embedding.0.encoder.layers.25.layer_norm1.bias": "model-00002-of-00003.safetensors",
610
+ "model.vision_embedding.0.encoder.layers.25.layer_norm1.weight": "model-00002-of-00003.safetensors",
611
+ "model.vision_embedding.0.encoder.layers.25.layer_norm2.bias": "model-00002-of-00003.safetensors",
612
+ "model.vision_embedding.0.encoder.layers.25.layer_norm2.weight": "model-00002-of-00003.safetensors",
613
+ "model.vision_embedding.0.encoder.layers.25.mlp.fc1.bias": "model-00002-of-00003.safetensors",
614
+ "model.vision_embedding.0.encoder.layers.25.mlp.fc1.weight": "model-00002-of-00003.safetensors",
615
+ "model.vision_embedding.0.encoder.layers.25.mlp.fc2.bias": "model-00002-of-00003.safetensors",
616
+ "model.vision_embedding.0.encoder.layers.25.mlp.fc2.weight": "model-00002-of-00003.safetensors",
617
+ "model.vision_embedding.0.encoder.layers.25.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
618
+ "model.vision_embedding.0.encoder.layers.25.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
619
+ "model.vision_embedding.0.encoder.layers.25.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
620
+ "model.vision_embedding.0.encoder.layers.25.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
621
+ "model.vision_embedding.0.encoder.layers.25.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
622
+ "model.vision_embedding.0.encoder.layers.25.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
623
+ "model.vision_embedding.0.encoder.layers.25.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
624
+ "model.vision_embedding.0.encoder.layers.25.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
625
+ "model.vision_embedding.0.encoder.layers.26.layer_norm1.bias": "model-00002-of-00003.safetensors",
626
+ "model.vision_embedding.0.encoder.layers.26.layer_norm1.weight": "model-00002-of-00003.safetensors",
627
+ "model.vision_embedding.0.encoder.layers.26.layer_norm2.bias": "model-00002-of-00003.safetensors",
628
+ "model.vision_embedding.0.encoder.layers.26.layer_norm2.weight": "model-00002-of-00003.safetensors",
629
+ "model.vision_embedding.0.encoder.layers.26.mlp.fc1.bias": "model-00002-of-00003.safetensors",
630
+ "model.vision_embedding.0.encoder.layers.26.mlp.fc1.weight": "model-00002-of-00003.safetensors",
631
+ "model.vision_embedding.0.encoder.layers.26.mlp.fc2.bias": "model-00002-of-00003.safetensors",
632
+ "model.vision_embedding.0.encoder.layers.26.mlp.fc2.weight": "model-00002-of-00003.safetensors",
633
+ "model.vision_embedding.0.encoder.layers.26.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
634
+ "model.vision_embedding.0.encoder.layers.26.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
635
+ "model.vision_embedding.0.encoder.layers.26.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
636
+ "model.vision_embedding.0.encoder.layers.26.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
637
+ "model.vision_embedding.0.encoder.layers.26.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
638
+ "model.vision_embedding.0.encoder.layers.26.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
639
+ "model.vision_embedding.0.encoder.layers.26.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
640
+ "model.vision_embedding.0.encoder.layers.26.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
641
+ "model.vision_embedding.0.encoder.layers.3.layer_norm1.bias": "model-00002-of-00003.safetensors",
642
+ "model.vision_embedding.0.encoder.layers.3.layer_norm1.weight": "model-00002-of-00003.safetensors",
643
+ "model.vision_embedding.0.encoder.layers.3.layer_norm2.bias": "model-00002-of-00003.safetensors",
644
+ "model.vision_embedding.0.encoder.layers.3.layer_norm2.weight": "model-00002-of-00003.safetensors",
645
+ "model.vision_embedding.0.encoder.layers.3.mlp.fc1.bias": "model-00002-of-00003.safetensors",
646
+ "model.vision_embedding.0.encoder.layers.3.mlp.fc1.weight": "model-00002-of-00003.safetensors",
647
+ "model.vision_embedding.0.encoder.layers.3.mlp.fc2.bias": "model-00002-of-00003.safetensors",
648
+ "model.vision_embedding.0.encoder.layers.3.mlp.fc2.weight": "model-00002-of-00003.safetensors",
649
+ "model.vision_embedding.0.encoder.layers.3.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
650
+ "model.vision_embedding.0.encoder.layers.3.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
651
+ "model.vision_embedding.0.encoder.layers.3.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
652
+ "model.vision_embedding.0.encoder.layers.3.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
653
+ "model.vision_embedding.0.encoder.layers.3.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
654
+ "model.vision_embedding.0.encoder.layers.3.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
655
+ "model.vision_embedding.0.encoder.layers.3.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
656
+ "model.vision_embedding.0.encoder.layers.3.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
657
+ "model.vision_embedding.0.encoder.layers.4.layer_norm1.bias": "model-00002-of-00003.safetensors",
658
+ "model.vision_embedding.0.encoder.layers.4.layer_norm1.weight": "model-00002-of-00003.safetensors",
659
+ "model.vision_embedding.0.encoder.layers.4.layer_norm2.bias": "model-00002-of-00003.safetensors",
660
+ "model.vision_embedding.0.encoder.layers.4.layer_norm2.weight": "model-00002-of-00003.safetensors",
661
+ "model.vision_embedding.0.encoder.layers.4.mlp.fc1.bias": "model-00002-of-00003.safetensors",
662
+ "model.vision_embedding.0.encoder.layers.4.mlp.fc1.weight": "model-00002-of-00003.safetensors",
663
+ "model.vision_embedding.0.encoder.layers.4.mlp.fc2.bias": "model-00002-of-00003.safetensors",
664
+ "model.vision_embedding.0.encoder.layers.4.mlp.fc2.weight": "model-00002-of-00003.safetensors",
665
+ "model.vision_embedding.0.encoder.layers.4.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
666
+ "model.vision_embedding.0.encoder.layers.4.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
667
+ "model.vision_embedding.0.encoder.layers.4.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
668
+ "model.vision_embedding.0.encoder.layers.4.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
669
+ "model.vision_embedding.0.encoder.layers.4.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
670
+ "model.vision_embedding.0.encoder.layers.4.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
671
+ "model.vision_embedding.0.encoder.layers.4.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
672
+ "model.vision_embedding.0.encoder.layers.4.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
673
+ "model.vision_embedding.0.encoder.layers.5.layer_norm1.bias": "model-00002-of-00003.safetensors",
674
+ "model.vision_embedding.0.encoder.layers.5.layer_norm1.weight": "model-00002-of-00003.safetensors",
675
+ "model.vision_embedding.0.encoder.layers.5.layer_norm2.bias": "model-00002-of-00003.safetensors",
676
+ "model.vision_embedding.0.encoder.layers.5.layer_norm2.weight": "model-00002-of-00003.safetensors",
677
+ "model.vision_embedding.0.encoder.layers.5.mlp.fc1.bias": "model-00002-of-00003.safetensors",
678
+ "model.vision_embedding.0.encoder.layers.5.mlp.fc1.weight": "model-00002-of-00003.safetensors",
679
+ "model.vision_embedding.0.encoder.layers.5.mlp.fc2.bias": "model-00002-of-00003.safetensors",
680
+ "model.vision_embedding.0.encoder.layers.5.mlp.fc2.weight": "model-00002-of-00003.safetensors",
681
+ "model.vision_embedding.0.encoder.layers.5.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
682
+ "model.vision_embedding.0.encoder.layers.5.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
683
+ "model.vision_embedding.0.encoder.layers.5.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
684
+ "model.vision_embedding.0.encoder.layers.5.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
685
+ "model.vision_embedding.0.encoder.layers.5.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
686
+ "model.vision_embedding.0.encoder.layers.5.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
687
+ "model.vision_embedding.0.encoder.layers.5.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
688
+ "model.vision_embedding.0.encoder.layers.5.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
689
+ "model.vision_embedding.0.encoder.layers.6.layer_norm1.bias": "model-00002-of-00003.safetensors",
690
+ "model.vision_embedding.0.encoder.layers.6.layer_norm1.weight": "model-00002-of-00003.safetensors",
691
+ "model.vision_embedding.0.encoder.layers.6.layer_norm2.bias": "model-00002-of-00003.safetensors",
692
+ "model.vision_embedding.0.encoder.layers.6.layer_norm2.weight": "model-00002-of-00003.safetensors",
693
+ "model.vision_embedding.0.encoder.layers.6.mlp.fc1.bias": "model-00002-of-00003.safetensors",
694
+ "model.vision_embedding.0.encoder.layers.6.mlp.fc1.weight": "model-00002-of-00003.safetensors",
695
+ "model.vision_embedding.0.encoder.layers.6.mlp.fc2.bias": "model-00002-of-00003.safetensors",
696
+ "model.vision_embedding.0.encoder.layers.6.mlp.fc2.weight": "model-00002-of-00003.safetensors",
697
+ "model.vision_embedding.0.encoder.layers.6.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
698
+ "model.vision_embedding.0.encoder.layers.6.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
699
+ "model.vision_embedding.0.encoder.layers.6.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
700
+ "model.vision_embedding.0.encoder.layers.6.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
701
+ "model.vision_embedding.0.encoder.layers.6.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
702
+ "model.vision_embedding.0.encoder.layers.6.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
703
+ "model.vision_embedding.0.encoder.layers.6.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
704
+ "model.vision_embedding.0.encoder.layers.6.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
705
+ "model.vision_embedding.0.encoder.layers.7.layer_norm1.bias": "model-00002-of-00003.safetensors",
706
+ "model.vision_embedding.0.encoder.layers.7.layer_norm1.weight": "model-00002-of-00003.safetensors",
707
+ "model.vision_embedding.0.encoder.layers.7.layer_norm2.bias": "model-00002-of-00003.safetensors",
708
+ "model.vision_embedding.0.encoder.layers.7.layer_norm2.weight": "model-00002-of-00003.safetensors",
709
+ "model.vision_embedding.0.encoder.layers.7.mlp.fc1.bias": "model-00002-of-00003.safetensors",
710
+ "model.vision_embedding.0.encoder.layers.7.mlp.fc1.weight": "model-00002-of-00003.safetensors",
711
+ "model.vision_embedding.0.encoder.layers.7.mlp.fc2.bias": "model-00002-of-00003.safetensors",
712
+ "model.vision_embedding.0.encoder.layers.7.mlp.fc2.weight": "model-00002-of-00003.safetensors",
713
+ "model.vision_embedding.0.encoder.layers.7.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
714
+ "model.vision_embedding.0.encoder.layers.7.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
715
+ "model.vision_embedding.0.encoder.layers.7.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
716
+ "model.vision_embedding.0.encoder.layers.7.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
717
+ "model.vision_embedding.0.encoder.layers.7.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
718
+ "model.vision_embedding.0.encoder.layers.7.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
719
+ "model.vision_embedding.0.encoder.layers.7.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
720
+ "model.vision_embedding.0.encoder.layers.7.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
721
+ "model.vision_embedding.0.encoder.layers.8.layer_norm1.bias": "model-00002-of-00003.safetensors",
722
+ "model.vision_embedding.0.encoder.layers.8.layer_norm1.weight": "model-00002-of-00003.safetensors",
723
+ "model.vision_embedding.0.encoder.layers.8.layer_norm2.bias": "model-00002-of-00003.safetensors",
724
+ "model.vision_embedding.0.encoder.layers.8.layer_norm2.weight": "model-00002-of-00003.safetensors",
725
+ "model.vision_embedding.0.encoder.layers.8.mlp.fc1.bias": "model-00002-of-00003.safetensors",
726
+ "model.vision_embedding.0.encoder.layers.8.mlp.fc1.weight": "model-00002-of-00003.safetensors",
727
+ "model.vision_embedding.0.encoder.layers.8.mlp.fc2.bias": "model-00002-of-00003.safetensors",
728
+ "model.vision_embedding.0.encoder.layers.8.mlp.fc2.weight": "model-00002-of-00003.safetensors",
729
+ "model.vision_embedding.0.encoder.layers.8.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
730
+ "model.vision_embedding.0.encoder.layers.8.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
731
+ "model.vision_embedding.0.encoder.layers.8.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
732
+ "model.vision_embedding.0.encoder.layers.8.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
733
+ "model.vision_embedding.0.encoder.layers.8.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
734
+ "model.vision_embedding.0.encoder.layers.8.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
735
+ "model.vision_embedding.0.encoder.layers.8.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
736
+ "model.vision_embedding.0.encoder.layers.8.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
737
+ "model.vision_embedding.0.encoder.layers.9.layer_norm1.bias": "model-00002-of-00003.safetensors",
738
+ "model.vision_embedding.0.encoder.layers.9.layer_norm1.weight": "model-00002-of-00003.safetensors",
739
+ "model.vision_embedding.0.encoder.layers.9.layer_norm2.bias": "model-00002-of-00003.safetensors",
740
+ "model.vision_embedding.0.encoder.layers.9.layer_norm2.weight": "model-00002-of-00003.safetensors",
741
+ "model.vision_embedding.0.encoder.layers.9.mlp.fc1.bias": "model-00002-of-00003.safetensors",
742
+ "model.vision_embedding.0.encoder.layers.9.mlp.fc1.weight": "model-00002-of-00003.safetensors",
743
+ "model.vision_embedding.0.encoder.layers.9.mlp.fc2.bias": "model-00002-of-00003.safetensors",
744
+ "model.vision_embedding.0.encoder.layers.9.mlp.fc2.weight": "model-00002-of-00003.safetensors",
745
+ "model.vision_embedding.0.encoder.layers.9.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
746
+ "model.vision_embedding.0.encoder.layers.9.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
747
+ "model.vision_embedding.0.encoder.layers.9.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
748
+ "model.vision_embedding.0.encoder.layers.9.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
749
+ "model.vision_embedding.0.encoder.layers.9.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
750
+ "model.vision_embedding.0.encoder.layers.9.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
751
+ "model.vision_embedding.0.encoder.layers.9.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
752
+ "model.vision_embedding.0.encoder.layers.9.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
753
+ "model.vision_embedding.0.post_layernorm.bias": "model-00002-of-00003.safetensors",
754
+ "model.vision_embedding.0.post_layernorm.weight": "model-00002-of-00003.safetensors",
755
+ "model.vision_embedding.1.weight": "model-00002-of-00003.safetensors",
756
+ "model.vision_embedding.3.weight": "model-00002-of-00003.safetensors"
757
+ }
758
+ }
modular_isaac.py ADDED
@@ -0,0 +1,1626 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from collections import defaultdict
4
+ from typing import Any, Union, TypedDict
5
+
6
+ import math
7
+ import numpy as np
8
+ import torch
9
+ import torch.nn as nn
10
+ import torch.nn.functional as F
11
+ import PIL.Image
12
+
13
+
14
+ from transformers import (
15
+ AutoTokenizer,
16
+ BatchFeature,
17
+ Cache,
18
+ Qwen3Config,
19
+ Qwen3ForCausalLM,
20
+ Qwen3PreTrainedModel,
21
+ )
22
+ from transformers.cache_utils import SlidingWindowCache, StaticCache
23
+ from transformers.generation.utils import GenerationMixin
24
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
25
+ from transformers.models.qwen3.modeling_qwen3 import Qwen3DecoderLayer, Qwen3Model
26
+ from transformers.models.qwen2.tokenization_qwen2 import Qwen2Tokenizer
27
+ from transformers.processing_utils import ProcessorMixin
28
+ from transformers.tokenization_utils import TensorType
29
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
30
+ import re
31
+
32
+ from transformers.models.siglip2.modeling_siglip2 import (
33
+ Siglip2MLP,
34
+ )
35
+ from transformers.models.siglip2.configuration_siglip2 import Siglip2VisionConfig
36
+ from perceptron.tensorstream import (
37
+ Event,
38
+ Stream,
39
+ TensorStream,
40
+ TextType,
41
+ VisionType,
42
+ create_stream,
43
+ group_streams,
44
+ )
45
+ from perceptron.tensorstream.ops import (
46
+ compute_mrope_pos_tensor,
47
+ modality_mask,
48
+ reconstruct_tensor_stream_from_compact_dict,
49
+ slice as ts_slice,
50
+ tensor_stream_token_view,
51
+ )
52
+
53
+
54
+ class PixelShuffleSiglip2VisionConfig(Siglip2VisionConfig):
55
+ """Vision configuration for Isaac with Pixel Shuffle support.
56
+
57
+ Extends Siglip2VisionConfig with additional fields for pixel shuffle.
58
+ """
59
+
60
+ model_type = "pixel_shuffle_siglip2"
61
+ base_config_key = "vision_config"
62
+
63
+ def __init__(
64
+ self,
65
+ pixel_shuffle_scale_factor: int = 1,
66
+ num_patches: int = 256,
67
+ **kwargs,
68
+ ):
69
+ super().__init__(**kwargs)
70
+
71
+ # Add our custom fields
72
+ self.pixel_shuffle_scale_factor = pixel_shuffle_scale_factor
73
+ self.num_patches = num_patches
74
+
75
+
76
+ def create_cumulative_seq_lengths(seq_sizes: torch.Tensor, device: torch.device) -> tuple[torch.Tensor, int]:
77
+ """Create cumulative sequence lengths for variable-length attention."""
78
+ cu_seqlens = torch.zeros(len(seq_sizes) + 1, dtype=torch.int32, device=device)
79
+ cu_seqlens[1:] = seq_sizes.cumsum(0)
80
+ max_seqlen = int(seq_sizes.max().item()) if len(seq_sizes) > 0 else 0
81
+ return cu_seqlens, max_seqlen
82
+
83
+
84
+ class Siglip2VariableSequenceEmbeddings(nn.Module):
85
+ def __init__(self, config: PixelShuffleSiglip2VisionConfig):
86
+ super().__init__()
87
+ self.config = config
88
+ self.embed_dim = config.hidden_size
89
+ self.patch_size = config.patch_size
90
+
91
+ self.patch_embedding = nn.Linear(
92
+ in_features=config.num_channels * self.patch_size * self.patch_size,
93
+ out_features=self.embed_dim,
94
+ )
95
+
96
+ self.num_patches = config.num_patches
97
+ self.position_embedding_size = int(self.num_patches**0.5)
98
+ self.position_embedding = nn.Embedding(self.num_patches, self.embed_dim)
99
+
100
+ def positional_embeddings(
101
+ self, packed_seq_patches: tuple[torch.Tensor, torch.Tensor, torch.Tensor]
102
+ ) -> torch.Tensor:
103
+ # Prepare positional embeddings grid: (1, embed_dim, h, w)
104
+ positional_embeddings = (
105
+ self.position_embedding.weight.reshape(self.position_embedding_size, self.position_embedding_size, -1)
106
+ .permute(2, 0, 1)
107
+ .unsqueeze(0)
108
+ )
109
+
110
+ _seq_patches, _seq_sizes, spatial_shapes = packed_seq_patches
111
+ pos_embeds_list = []
112
+ mode = "bilinear"
113
+ align_corners = False
114
+ antialias = True
115
+ for spatial_shape in spatial_shapes:
116
+ height, width = spatial_shape
117
+ # Guard to ensure height and width are positive for torch.compile
118
+ if height > 0 and width > 0:
119
+ resized_pos_embed = F.interpolate(
120
+ positional_embeddings,
121
+ size=(height, width),
122
+ mode=mode,
123
+ align_corners=align_corners,
124
+ antialias=antialias,
125
+ )
126
+ # Reshape from (1, embed_dim, height, width) to (height*width, embed_dim)
127
+ resized_pos_embed = resized_pos_embed.reshape(self.embed_dim, height * width).transpose(0, 1)
128
+ else:
129
+ # Fallback - should never happen in practice
130
+ resized_pos_embed = positional_embeddings.reshape(
131
+ self.embed_dim, self.position_embedding_size * self.position_embedding_size
132
+ ).transpose(0, 1)[: height * width]
133
+ pos_embeds_list.append(resized_pos_embed)
134
+
135
+ # Concatenate all positional embeddings along the sequence dimension
136
+ pos_embeds = torch.cat(pos_embeds_list, dim=0)
137
+ return pos_embeds
138
+
139
+ def forward(self, packed_seq_patches: tuple[torch.Tensor, torch.Tensor, torch.Tensor]):
140
+ seq_patches, _seq_sizes, _spatial_shapes = packed_seq_patches
141
+
142
+ # Apply patch embeddings
143
+ target_dtype = self.patch_embedding.weight.dtype
144
+ patch_embeds = self.patch_embedding(seq_patches.to(dtype=target_dtype))
145
+ pos_embeds = self.positional_embeddings(packed_seq_patches)
146
+
147
+ # Add positional embeddings to patch embeddings
148
+ embeddings = patch_embeds + pos_embeds
149
+ return embeddings
150
+
151
+
152
+ class Siglip2VariableLengthAttention(nn.Module):
153
+ """Custom attention that supports variable-length sequences with flash attention."""
154
+
155
+ def __init__(self, config):
156
+ super().__init__()
157
+ self.config = config
158
+ self.embed_dim = config.hidden_size
159
+ self.num_heads = config.num_attention_heads
160
+ self.head_dim = self.embed_dim // self.num_heads
161
+ if self.head_dim * self.num_heads != self.embed_dim:
162
+ raise ValueError(
163
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
164
+ f" {self.num_heads})."
165
+ )
166
+ self.scale = self.head_dim**-0.5
167
+ self.dropout = config.attention_dropout
168
+
169
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
170
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
171
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
172
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
173
+
174
+ def forward(self, hidden_states, cu_seqlens=None, max_seqlen=None):
175
+ batch_size, seq_len, _ = hidden_states.size()
176
+
177
+ # For variable-length attention, we need to reshape to (total_tokens, embed_dim)
178
+ if batch_size != 1:
179
+ raise ValueError("Variable-length attention expects batch_size=1 for packed sequences")
180
+ hidden_states = hidden_states.squeeze(0) # Remove batch dimension: (seq_len, embed_dim)
181
+
182
+ # Store original dtype
183
+ orig_dtype = hidden_states.dtype
184
+
185
+ # 1. Linear projections
186
+ Q = self.q_proj(hidden_states) # (seq_len, embed_dim)
187
+ K = self.k_proj(hidden_states) # (seq_len, embed_dim)
188
+ V = self.v_proj(hidden_states) # (seq_len, embed_dim)
189
+
190
+ # 2. Reshape for multi-head attention: (seq_len, n_heads, head_dim)
191
+ Q = Q.view(-1, self.num_heads, self.embed_dim // self.num_heads)
192
+ K = K.view(-1, self.num_heads, self.embed_dim // self.num_heads)
193
+ V = V.view(-1, self.num_heads, self.embed_dim // self.num_heads)
194
+
195
+ # 3. Apply variable-length attention using flash attention
196
+ attn_output, _, _, _, _ = torch.ops.aten._flash_attention_forward(
197
+ query=Q,
198
+ key=K,
199
+ value=V,
200
+ cum_seq_q=cu_seqlens,
201
+ cum_seq_k=cu_seqlens,
202
+ max_q=max_seqlen,
203
+ max_k=max_seqlen,
204
+ dropout_p=self.dropout if self.training else 0.0,
205
+ is_causal=False,
206
+ return_debug_mask=False,
207
+ scale=self.scale,
208
+ window_size_left=-1,
209
+ window_size_right=-1,
210
+ alibi_slopes=None,
211
+ )
212
+
213
+ # 4. Reshape attention output from (seq_len, n_heads, head_dim) to (seq_len, embed_dim)
214
+ attn_output = attn_output.reshape(seq_len, self.embed_dim)
215
+
216
+ # 5. Convert back to original dtype if needed
217
+ if attn_output.dtype != orig_dtype:
218
+ attn_output = attn_output.to(orig_dtype)
219
+
220
+ # 6. Project output
221
+ attn_output = self.out_proj(attn_output) # (seq_len, embed_dim)
222
+
223
+ # 7. Add back batch dimension for compatibility
224
+ attn_output = attn_output.unsqueeze(0) # (1, seq_len, embed_dim)
225
+
226
+ return attn_output, None
227
+
228
+
229
+ class IsaacSiglip2EncoderLayer(nn.Module):
230
+ """Siglip2 encoder layer with variable-length attention."""
231
+
232
+ def __init__(self, config: PixelShuffleSiglip2VisionConfig):
233
+ super().__init__()
234
+ self.embed_dim = config.hidden_size
235
+ self.self_attn = Siglip2VariableLengthAttention(config)
236
+
237
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
238
+ self.mlp = Siglip2MLP(config) # Use HF's Siglip2MLP
239
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
240
+
241
+ def forward(
242
+ self,
243
+ hidden_states: torch.Tensor,
244
+ cu_seqlens: torch.Tensor = None,
245
+ max_seqlen: int = None,
246
+ ) -> tuple[torch.FloatTensor]:
247
+ residual = hidden_states
248
+
249
+ hidden_states = self.layer_norm1(hidden_states)
250
+
251
+ hidden_states, attn_weights = self.self_attn(
252
+ hidden_states=hidden_states,
253
+ cu_seqlens=cu_seqlens,
254
+ max_seqlen=max_seqlen,
255
+ )
256
+
257
+ hidden_states = residual + hidden_states
258
+
259
+ residual = hidden_states
260
+ hidden_states = self.layer_norm2(hidden_states)
261
+ hidden_states = self.mlp(hidden_states)
262
+ hidden_states = residual + hidden_states
263
+
264
+ return (hidden_states,)
265
+
266
+
267
+ class IsaacEncoder(nn.Module):
268
+ """Encoder using Isaac encoder layers with variable-length attention support."""
269
+
270
+ def __init__(self, config: PixelShuffleSiglip2VisionConfig):
271
+ super().__init__()
272
+ self.config = config
273
+ self.layers = nn.ModuleList([IsaacSiglip2EncoderLayer(config) for _ in range(config.num_hidden_layers)])
274
+
275
+ def forward(
276
+ self,
277
+ inputs_embeds,
278
+ cu_seqlens: torch.Tensor | None = None,
279
+ max_seqlen: int | None = None,
280
+ output_hidden_states: bool = False,
281
+ ):
282
+ all_hidden_states = () if output_hidden_states else None
283
+
284
+ hidden_states = inputs_embeds
285
+
286
+ for encoder_layer in self.layers:
287
+ if output_hidden_states:
288
+ all_hidden_states = all_hidden_states + (hidden_states,)
289
+
290
+ layer_outputs = encoder_layer(
291
+ hidden_states,
292
+ cu_seqlens,
293
+ max_seqlen,
294
+ )
295
+
296
+ hidden_states = layer_outputs[0]
297
+
298
+ if output_hidden_states:
299
+ all_hidden_states = all_hidden_states + (hidden_states,)
300
+
301
+ return hidden_states, all_hidden_states, None
302
+
303
+
304
+ def create_pixel_shuffle_index_map(
305
+ seq_sizes: torch.Tensor,
306
+ token_grids: torch.Tensor,
307
+ scale_factor: int = 1,
308
+ device: torch.device | None = None,
309
+ ) -> torch.Tensor:
310
+ """
311
+ Build a gather-index map that tells us, for every *output* token after
312
+ pixel-shuffle, which `scale_factor**2` *input* tokens are being merged.
313
+
314
+ Args
315
+ ----
316
+ seq_sizes : (num_images,) - #patches in each image (row-major order)
317
+ token_grids : (num_images,2) - (height, width) for every image
318
+ scale_factor : spatial down-scale factor (≥2)
319
+ device : (optional) overrides `seq_sizes.device`
320
+
321
+ Returns
322
+ -------
323
+ gather_idx : (new_total_seq_len, scale_factor**2) int64 tensor.
324
+ gather_idx[i, j] is the *flat* index into the *original*
325
+ packed sequence for the j-th sub-patch that forms the
326
+ i-th output token.
327
+ """
328
+ if device is None:
329
+ device = seq_sizes.device
330
+
331
+ r = int(scale_factor)
332
+ if r < 2:
333
+ raise ValueError("`scale_factor` must be ≥ 2")
334
+
335
+ # Safety: all spatial dims must be divisible by r
336
+ # Cannot run under torch compile fullgraph mode hence
337
+ if not torch.compiler.is_compiling():
338
+ if not ((token_grids[:, 0] % r == 0).all() and (token_grids[:, 1] % r == 0).all()):
339
+ raise AssertionError(
340
+ f"Every (H,W) in `token_grids` must be divisible by scale_factor={r}, got {token_grids.tolist()}"
341
+ )
342
+
343
+ gather_chunks: list[torch.Tensor] = []
344
+ tok_offset = 0
345
+
346
+ for seq_len, (h, w) in zip(seq_sizes.tolist(), token_grids.tolist(), strict=False):
347
+ # Build the (H, W) grid of flat indices for this image
348
+ grid = torch.arange(seq_len, device=device, dtype=torch.int64) + tok_offset
349
+ grid = grid.view(h, w) # (H, W)
350
+
351
+ # -------- identical ordering to your fixed-res routine --------
352
+ # Step 1: split width into blocks of r
353
+ grid = grid.view(h, w // r, r) # (H, W/r, r)
354
+ # Step 2: now split height into blocks of r
355
+ grid = grid.view(h // r, r, w // r, r) # (H/r, r, W/r, r)
356
+ # Step 3: final permutation to (H/r, W/r, r, r)
357
+ grid = grid.permute(0, 2, 1, 3).contiguous() # (H/r, W/r, r, r)
358
+ # Step 4: each (r, r) block forms one output token
359
+ gather_chunks.append(grid.reshape(-1, r * r)) # (H*W / r², r²)
360
+
361
+ tok_offset += seq_len
362
+
363
+ # Concatenate over all images in the packed batch
364
+ gather_idx = torch.cat(gather_chunks, dim=0) # (Σ_i HᵢWᵢ/r², r²)
365
+ return gather_idx
366
+
367
+
368
+ def pixel_shuffle_varlen(
369
+ x: torch.Tensor,
370
+ token_grids: torch.Tensor,
371
+ scale_factor: int = 1,
372
+ ) -> torch.Tensor:
373
+ r"""Apply pixel shuffle to a packed vision sequence without unpacking per image.
374
+
375
+ Args:
376
+ x (`torch.Tensor`):
377
+ Concatenated vision embeddings. Accepts `(seq_len, hidden_size)` or `(1, seq_len, hidden_size)` shapes
378
+ produced by stacking image patches.
379
+ token_grids (`torch.Tensor`):
380
+ Integer tensor of shape `(num_images, 2)` whose rows give the `(height, width)` patch grid sizes
381
+ corresponding to each image segment inside `x`.
382
+ scale_factor (`int`, *optional*, defaults to 1):
383
+ Spatial down-sampling factor specific to pixel shuffle. Values greater than one merge `scale_factor**2` neighboring patches into a
384
+ single embedding channel-group.
385
+
386
+ Returns:
387
+ `torch.Tensor`: Pixel-shuffled embeddings with shape matching the input convention:
388
+ `(seq_len, hidden_size * scale_factor**2)` when the input was 2D, or `(1, seq_len, hidden_size * scale_factor**2)`
389
+ if the singleton batch dimension was present.
390
+
391
+ Raises:
392
+ ValueError: If more than one batch item is provided.
393
+ """
394
+ keep_batch_dim = x.dim() == 3
395
+ if keep_batch_dim:
396
+ if x.size(0) != 1:
397
+ raise AssertionError("Packed sequence is expected to have batch_size == 1")
398
+ x_ = x.squeeze(0) # (seq, embed)
399
+ else:
400
+ x_ = x # (seq, embed)
401
+
402
+ embed_dim = x_.size(-1)
403
+ r = int(scale_factor)
404
+
405
+ # Calculate seq_sizes from token_grids
406
+ seq_sizes = torch.prod(token_grids, dim=-1)
407
+
408
+ # Build index map and gather in one go
409
+ gather_idx = create_pixel_shuffle_index_map(
410
+ seq_sizes=seq_sizes,
411
+ token_grids=token_grids,
412
+ scale_factor=r,
413
+ device=x_.device,
414
+ ) # (new_seq, r²)
415
+
416
+ # Gather → (new_seq, r², embed_dim)
417
+ gathered = x_[gather_idx] # fancy indexing keeps gradient
418
+
419
+ # Merge the r² group dimension into channels to finish the shuffle
420
+ out = gathered.reshape(gathered.size(0), embed_dim * r * r)
421
+
422
+ # Restore batch dimension if needed
423
+ if keep_batch_dim:
424
+ out = out.unsqueeze(0)
425
+ return out
426
+
427
+
428
+ class Siglip2SequenceVisionTransformer(nn.Module):
429
+ def __init__(self, config: PixelShuffleSiglip2VisionConfig):
430
+ super().__init__()
431
+ self.config = config
432
+ self.embeddings = Siglip2VariableSequenceEmbeddings(config)
433
+ self.encoder = IsaacEncoder(config)
434
+ self.post_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
435
+ self.pixel_shuffle_scale_factor = config.pixel_shuffle_scale_factor
436
+
437
+ def forward(self, packed_seq_patches: tuple[torch.Tensor, torch.Tensor]):
438
+ seq_patches, token_grids = packed_seq_patches
439
+ seq_sizes = torch.prod(token_grids, dim=-1)
440
+
441
+ # Get embeddings from packed sequence
442
+ hidden_states = self.embeddings((seq_patches, seq_sizes, token_grids))
443
+
444
+ # Add a pseudo batch dimension for the encoder
445
+ hidden_states = hidden_states.unsqueeze(0)
446
+
447
+ # Generate cumulative sequence lengths for variable-length attention
448
+ cu_seqlens, max_seqlen = create_cumulative_seq_lengths(seq_sizes, hidden_states.device)
449
+
450
+ # Pass through encoder with variable-length attention parameters
451
+ hidden_states, _, _ = self.encoder(
452
+ inputs_embeds=hidden_states,
453
+ cu_seqlens=cu_seqlens,
454
+ max_seqlen=max_seqlen,
455
+ )
456
+
457
+ # Apply final layer normalization
458
+ hidden_states = self.post_layernorm(hidden_states)
459
+
460
+ if self.pixel_shuffle_scale_factor > 1:
461
+ hidden_states = pixel_shuffle_varlen(
462
+ x=hidden_states,
463
+ token_grids=token_grids,
464
+ scale_factor=self.pixel_shuffle_scale_factor,
465
+ )
466
+ # Remove the pseudo batch dimension we added earlier
467
+ hidden_states = hidden_states.squeeze(0)
468
+
469
+ # Return the full sequence of embeddings
470
+ return hidden_states
471
+
472
+
473
+ # ============================================================================
474
+ # Configuration
475
+ # ============================================================================
476
+
477
+ MAX_PIXELS = 60_000_000 # 60‑megapixel ceiling ≈ 8200 × 7300 px
478
+
479
+ # Vision preprocessing constants
480
+ VISION_MEAN = (0.5, 0.5, 0.5)
481
+ VISION_STD = (0.5, 0.5, 0.5)
482
+ VISION_SCALE = 1 / 255
483
+
484
+
485
+ def _make_writeable(arr: np.ndarray) -> np.ndarray:
486
+ """Return *arr* itself if it is already writeable, otherwise try to flip the
487
+ write flag in-place and finally fall back to `arr.copy()`.
488
+ This guarantees the buffer handed to `torch.from_numpy()` is always
489
+ writeable, silencing the PyTorch warning about undefined behaviour.
490
+ """
491
+ if arr.flags.writeable:
492
+ return arr
493
+
494
+ # First, try the cheap path — in‑place flag toggle (works for mmap'd arrays
495
+ # and some shared memory buffers):
496
+ try:
497
+ arr.setflags(write=True)
498
+ return arr # success: no data copy
499
+ except ValueError:
500
+ # Buffer is inherently read‑only (e.g. backed by PyAV / PIL): make copy
501
+ return arr.copy()
502
+
503
+
504
+ def extract_image_pil(image: PIL.Image.Image) -> torch.Tensor | None:
505
+ if image.width * image.height > MAX_PIXELS:
506
+ raise ValueError(f"Image (w={image.width}, h={image.height}) > MAX=`{MAX_PIXELS}`")
507
+ img = image if image.mode == "RGB" else image.convert("RGB")
508
+ arr = np.asarray(img)
509
+ arr = _make_writeable(arr)
510
+ return torch.from_numpy(arr)
511
+
512
+
513
+ def get_image_size_for_max_num_patches(
514
+ image_height: int,
515
+ image_width: int,
516
+ patch_size: int,
517
+ max_num_patches: int,
518
+ min_num_patches: int | None = None,
519
+ eps: float = 1e-5,
520
+ pixel_shuffle_scale: int = 1,
521
+ ) -> tuple[int, int]:
522
+ r"""Compute a target resolution whose patch grid satisfies patching parametrization.
523
+
524
+ Args:
525
+ image_height (`int`):
526
+ Height in pixels of the source image prior to any resizing.
527
+ image_width (`int`):
528
+ Width in pixels of the source image prior to any resizing.
529
+ patch_size (`int`):
530
+ Size of the square patch used by the vision encoder.
531
+ max_num_patches (`int`):
532
+ Upper bound on `(height / patch_size) * (width / patch_size)` after resizing.
533
+ min_num_patches (`int`, *optional*):
534
+ Lower bound on the number of patches. When provided the image will be scaled up if necessary.
535
+ eps (`float`, *optional*, defaults to 1e-5):
536
+ Convergence tolerance for the internal binary search to determing the target dimensions.
537
+ pixel_shuffle_scale (`int`, *optional*, defaults to 1):
538
+ Additional stride multiplier applied when pixel shuffle later reduces spatial resolution.
539
+
540
+ Returns:
541
+ `tuple[int, int]`: Height and width (in pixels) that are multiples of `patch_size * pixel_shuffle_scale`
542
+ and respect both the maximum and optional minimum patch-count constraints.
543
+ """
544
+
545
+ def get_scaled_image_size(scale, original_size, patch_size, pixel_shuffle_scale):
546
+ scaled_size = scale * original_size
547
+ divisor = patch_size * pixel_shuffle_scale
548
+ scaled_size = math.ceil(scaled_size / divisor) * divisor
549
+ scaled_size = max(divisor, scaled_size)
550
+ return int(scaled_size)
551
+
552
+ # Ensure divisibility
553
+ divisor = patch_size * pixel_shuffle_scale
554
+ adjusted_height = math.ceil(image_height / divisor) * divisor
555
+ adjusted_height = max(divisor, adjusted_height)
556
+ adjusted_width = math.ceil(image_width / divisor) * divisor
557
+ adjusted_width = max(divisor, adjusted_width)
558
+
559
+ num_patches = (adjusted_height / patch_size) * (adjusted_width / patch_size)
560
+
561
+ if min_num_patches is not None and num_patches < min_num_patches:
562
+ # Scale up
563
+ scale_min, scale_max = 1.0, 100.0
564
+ while (scale_max - scale_min) >= eps:
565
+ scale = (scale_min + scale_max) / 2
566
+ target_height = get_scaled_image_size(scale, image_height, patch_size, pixel_shuffle_scale)
567
+ target_width = get_scaled_image_size(scale, image_width, patch_size, pixel_shuffle_scale)
568
+ num_patches = (target_height / patch_size) * (target_width / patch_size)
569
+ if num_patches >= min_num_patches:
570
+ scale_max = scale
571
+ else:
572
+ scale_min = scale
573
+ scale = scale_max
574
+ target_height = get_scaled_image_size(scale, image_height, patch_size, pixel_shuffle_scale)
575
+ target_width = get_scaled_image_size(scale, image_width, patch_size, pixel_shuffle_scale)
576
+ return target_height, target_width
577
+ elif num_patches <= max_num_patches:
578
+ return adjusted_height, adjusted_width
579
+ else:
580
+ # Scale down
581
+ scale_min, scale_max = eps / 10, 1.0
582
+ while (scale_max - scale_min) >= eps:
583
+ scale = (scale_min + scale_max) / 2
584
+ target_height = get_scaled_image_size(scale, image_height, patch_size, pixel_shuffle_scale)
585
+ target_width = get_scaled_image_size(scale, image_width, patch_size, pixel_shuffle_scale)
586
+ num_patches = (target_height / patch_size) * (target_width / patch_size)
587
+ if num_patches <= max_num_patches:
588
+ scale_min = scale
589
+ else:
590
+ scale_max = scale
591
+ scale = scale_min
592
+ target_height = get_scaled_image_size(scale, image_height, patch_size, pixel_shuffle_scale)
593
+ target_width = get_scaled_image_size(scale, image_width, patch_size, pixel_shuffle_scale)
594
+ return target_height, target_width
595
+
596
+
597
+ _MEAN_TENSOR = torch.tensor(VISION_MEAN, dtype=torch.float32).view(1, 1, 1, -1)
598
+ _STD_TENSOR = torch.tensor(VISION_STD, dtype=torch.float32).view(1, 1, 1, -1)
599
+
600
+
601
+ def prepare_image_tensor(
602
+ image: torch.Tensor,
603
+ scale: float = VISION_SCALE,
604
+ ) -> torch.Tensor:
605
+ r"""Standardize RGB images prior to patch extraction via rescaling and whitening.
606
+
607
+ Args:
608
+ image (`torch.Tensor`):
609
+ Tensor with shape `(..., height, width, 3)` containing RGB values. The tensor is converted to floating
610
+ point if needed.
611
+ scale (`float`, *optional*, defaults to `VISION_SCALE`):
612
+ Scalar multiplier applied before normalization.
613
+ Returns:
614
+ `torch.Tensor`: Normalized tensor with the same shape as the input and dtype `torch.float32`.
615
+ """
616
+ if not torch.is_floating_point(image):
617
+ image = image.float()
618
+ rescaled = image * scale
619
+
620
+ # Use precomputed tensors and move to the correct device if needed
621
+ mean_tensor = _MEAN_TENSOR.to(image.device)
622
+ std_tensor = _STD_TENSOR.to(image.device)
623
+
624
+ normalized = (rescaled - mean_tensor) / std_tensor
625
+ return normalized
626
+
627
+
628
+ def patchify_vision(image: torch.Tensor, patch_size: int) -> torch.Tensor:
629
+ r"""Convert normalized images into flattened ViT-style patches.
630
+
631
+ Args:
632
+ image (`torch.Tensor`):
633
+ Tensor of shape `(num_images, height, width, channels)`.
634
+ patch_size (`int`):
635
+ Edge length of the square patches
636
+
637
+ Returns:
638
+ `torch.Tensor`:
639
+ Patch tensor where each position stores the flattened pixels belonging to that patch.
640
+
641
+ Raises:
642
+ ValueError: If `height` or `width` is not divisible by `patch_size`.
643
+ """
644
+ num_images, height, width, channels = image.shape
645
+ if height % patch_size or width % patch_size:
646
+ raise ValueError(f"Dimensions of images {image.shape} are not divisible by patch_size={patch_size}.")
647
+ patches = image.reshape(num_images, height // patch_size, patch_size, width // patch_size, patch_size, channels)
648
+ patches = patches.permute(0, 1, 3, 2, 4, 5)
649
+ patches = patches.reshape(num_images, height // patch_size, width // patch_size, channels * patch_size * patch_size)
650
+ return patches
651
+
652
+
653
+ def process_vision_for_patches(
654
+ images: torch.Tensor,
655
+ patch_size: int,
656
+ max_num_patches: int,
657
+ min_num_patches: int | None = None,
658
+ pixel_shuffle_scale: int = 1,
659
+ ) -> tuple[torch.Tensor, list[int]]:
660
+ r"""Resize, normalize, and patchify RGB images for the vision encoder.
661
+
662
+ Args:
663
+ images (`torch.Tensor`):
664
+ Either `(height, width, channels)` for a single image or `(num_images, height, width, channels)` for a
665
+ batch. Channels are expected to be RGB.
666
+ patch_size (`int`):
667
+ Edge length of square patches; implictly controls resize grid granularity.
668
+ max_num_patches (`int`):
669
+ Maximum number of patches allowed after resizing.
670
+ min_num_patches (`int`, *optional*):
671
+ Minimum number of patches. If provided, the routine upsamples images as needed to satisfy the lower bound.
672
+ pixel_shuffle_scale (`int`, *optional*, defaults to 1):
673
+ pixel shuffle scale factor; influences the target grid that the function produces.
674
+
675
+ Returns:
676
+ `tuple[torch.Tensor, list[int]]`: A pair `(patches, dims_virtual)` where `patches` has shape
677
+ `(num_images, target_h / patch_size, target_w / patch_size, channels * patch_size**2)` and `dims_virtual`
678
+ encodes effective `(images, height, width)` dimensions after optional pixel shuffling.
679
+ """
680
+ # Add batch dim if single image
681
+ if images.dim() == 3:
682
+ images = images.unsqueeze(0)
683
+
684
+ # Permute to channel first for resize
685
+ images = images.permute(0, 3, 1, 2)
686
+
687
+ # Get target dimensions
688
+ _, _, orig_height, orig_width = images.shape
689
+ target_height, target_width = get_image_size_for_max_num_patches(
690
+ orig_height,
691
+ orig_width,
692
+ patch_size,
693
+ max_num_patches,
694
+ min_num_patches=min_num_patches,
695
+ pixel_shuffle_scale=pixel_shuffle_scale,
696
+ )
697
+
698
+ # Resize
699
+ images = F.interpolate(
700
+ images,
701
+ size=(target_height, target_width),
702
+ mode="bilinear",
703
+ align_corners=False,
704
+ )
705
+
706
+ # Back to channel last
707
+ images = images.permute(0, 2, 3, 1)
708
+
709
+ # Normalize
710
+ images = prepare_image_tensor(images)
711
+
712
+ # Patchify
713
+ patches = patchify_vision(images, patch_size=patch_size)
714
+
715
+ # Calculate dimensions for the patches
716
+ n_images, h_patches, w_patches, _ = patches.shape
717
+ dims_virtual = (
718
+ [1, h_patches, w_patches]
719
+ if pixel_shuffle_scale == 1
720
+ else [1, h_patches // pixel_shuffle_scale, w_patches // pixel_shuffle_scale]
721
+ )
722
+
723
+ return patches, dims_virtual
724
+
725
+
726
+ def precompute_inv_freq(theta: float, dim: int) -> torch.Tensor:
727
+ """
728
+ Returns shape (dim//2,).
729
+ """
730
+ inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
731
+ return inv_freq # type: ignore[return-value]
732
+
733
+
734
+ def precompute_cos_sin_3d(
735
+ position_ids: torch.Tensor, # shape (3, B, T)
736
+ inv_freq: torch.Tensor, # shape (dim//2,)
737
+ mrope_half_section: list[int], # sum to dim//2
738
+ ) -> tuple[torch.Tensor, torch.Tensor]:
739
+ r"""Generate 3D rotary embeddings for multi-axis positions.
740
+
741
+ Args:
742
+ position_ids (`torch.Tensor`):
743
+ Tensor of shape `(3, batch_size, seq_len)` containing positional indices for the x/y/t axes.
744
+ inv_freq (`torch.Tensor`):
745
+ Precomputed inverse frequency vector used to derive rotary phases.
746
+ mrope_half_section (`list[int]`):
747
+ Sizes the axis-specific frequency blocks.
748
+
749
+ Returns:
750
+ `tuple[torch.Tensor, torch.Tensor]`: Cosine and sine tensors, each of shape `(batch_size, seq_len, dim)`, ready
751
+ to be passed into rotary attention layers.
752
+ """
753
+ B = position_ids.shape[1]
754
+ T = position_ids.shape[2]
755
+ dim_half = inv_freq.shape[0]
756
+ device = position_ids.device
757
+
758
+ # Initialize with full dimension (not half) to match LLaMA
759
+ cos_3d = torch.zeros((B, T, dim_half * 2), dtype=torch.float32, device=device)
760
+ sin_3d = torch.zeros((B, T, dim_half * 2), dtype=torch.float32, device=device)
761
+
762
+ offset = 0
763
+ for d in range(3):
764
+ block_size = mrope_half_section[d]
765
+ freq_slice = inv_freq[offset : offset + block_size] # shape => (block_size,)
766
+ # shape => (B, T, block_size)
767
+ phase = position_ids[d].unsqueeze(-1).float() * freq_slice
768
+
769
+ cos_part = phase.cos()
770
+ sin_part = phase.sin()
771
+
772
+ # Duplicate values for both halves of the dimension
773
+ cos_3d[:, :, offset : offset + block_size] = cos_part
774
+ cos_3d[:, :, dim_half + offset : dim_half + offset + block_size] = cos_part
775
+ sin_3d[:, :, offset : offset + block_size] = sin_part
776
+ sin_3d[:, :, dim_half + offset : dim_half + offset + block_size] = sin_part
777
+
778
+ offset += block_size
779
+
780
+ return cos_3d, sin_3d
781
+
782
+
783
+ class RopeScaling(TypedDict, total=False):
784
+ rope_type: str
785
+ factor: float
786
+ mrope_section: list[int]
787
+ mrope_interleaved: bool
788
+ low_freq_factor: float
789
+ high_freq_factor: float
790
+ original_max_position_embeddings: int
791
+
792
+
793
+ class IsaacConfig(Qwen3Config):
794
+ """Configuration class for Isaac multimodal model."""
795
+
796
+ model_type = "isaac"
797
+ sub_configs = {"vision_config": PixelShuffleSiglip2VisionConfig}
798
+
799
+ def __init__(
800
+ self,
801
+ vision_config=None,
802
+ vision_patch_size: int = 16,
803
+ vision_max_num_patches: int = 256,
804
+ vision_min_num_patches: int | None = None,
805
+ pixel_shuffle_scale: int = 1,
806
+ max_sequence_length: int = 16384,
807
+ vision_token: str = "<image>",
808
+ **kwargs,
809
+ ):
810
+ super().__init__(**kwargs)
811
+
812
+ # Handle vision config - either dict or PixelShuffleSiglip2VisionConfig instance
813
+ if isinstance(vision_config, dict):
814
+ self.vision_config = self.sub_configs["vision_config"](**vision_config)
815
+ elif vision_config is None:
816
+ self.vision_config = self.sub_configs["vision_config"]()
817
+ else:
818
+ self.vision_config = vision_config
819
+
820
+ # EventStreamProcessor parameters (for backward compatibility)
821
+ self.video_patch_size = vision_patch_size
822
+ self.vision_max_num_patches = vision_max_num_patches
823
+ self.vision_min_num_patches = vision_min_num_patches
824
+ self.pixel_shuffle_scale = pixel_shuffle_scale
825
+
826
+ # Processing parameters
827
+ self.max_sequence_length = max_sequence_length
828
+ self.vision_token = vision_token
829
+
830
+
831
+ # ============================================================================
832
+ # Processor Components
833
+ # ============================================================================
834
+
835
+
836
+ def create_text_event(tokenizer: AutoTokenizer, text: str, time: float = 0.0) -> Event:
837
+ r"""Wrap a text into an `Event` compatible with the multimodal TensorStream.
838
+
839
+ Args:
840
+ tokenizer (`AutoTokenizer`):
841
+ Tokenizer used to convert text into model vocabulary ids.
842
+ text (`str`):
843
+ Plain-text fragment to encode.
844
+ time (`float`, *optional*, defaults to 0.0):
845
+ Timeline coordinate associated with the event. Both start and end times use the same value because text
846
+ segments are instantaneous in the scheduler.
847
+
848
+ Returns:
849
+ `Event`: Event carrying a `(num_tokens, 1)` tensor of token ids with matching
850
+ metadata so that downstream processors can compute modality-specific embeddings.
851
+ """
852
+ tokens = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").squeeze(0)
853
+
854
+ # Calculate dimensions for the event
855
+ num_tokens = len(tokens)
856
+ dims_virtual = [num_tokens, 1] # [sequence_length, 1]
857
+ dims_real = dims_virtual.copy()
858
+
859
+ # Ensure tokens has the right shape for tensor_stream_token_view
860
+ # It expects a 2D tensor where sum(dim=-1) gives the token IDs
861
+ if tokens.dim() == 1:
862
+ tokens = tokens.unsqueeze(-1)
863
+
864
+ return Event(
865
+ data=tokens,
866
+ type=TextType.text,
867
+ time=(time, time),
868
+ dims_virtual=dims_virtual,
869
+ dims_real=dims_real,
870
+ idx_range=(0, num_tokens),
871
+ )
872
+
873
+
874
+ # ============================================================================
875
+ # Processor
876
+ # ============================================================================
877
+
878
+
879
+ class IsaacProcessor(ProcessorMixin):
880
+ attributes = ["tokenizer"]
881
+ tokenizer_class = ("Qwen2Tokenizer", "Qwen2TokenizerFast")
882
+
883
+
884
+ def __init__(
885
+ self,
886
+ tokenizer: Qwen2Tokenizer,
887
+ config: IsaacConfig | dict,
888
+ ):
889
+ super().__init__(tokenizer)
890
+ self.tokenizer = tokenizer
891
+
892
+ if isinstance(config, dict):
893
+ config = IsaacConfig(**config)
894
+ self.config = config
895
+
896
+ # Use vision token from config
897
+ self.vision_token = config.vision_token
898
+
899
+ # Processing parameters
900
+ self.max_sequence_length = config.max_sequence_length
901
+
902
+ # Vision processing parameters
903
+ self.patch_size = config.video_patch_size
904
+ self.max_num_patches = config.vision_max_num_patches
905
+ self.min_num_patches = config.vision_min_num_patches
906
+ self.pixel_shuffle_scale = config.pixel_shuffle_scale
907
+
908
+ def apply_chat_template(
909
+ self,
910
+ messages: list[dict[str, Any]],
911
+ tokenize: bool = False,
912
+ add_generation_prompt: bool = False,
913
+ **kwargs,
914
+ ) -> Any:
915
+ return self.tokenizer.apply_chat_template(
916
+ messages, tokenize=tokenize, add_generation_prompt=add_generation_prompt, **kwargs
917
+ )
918
+
919
+ def build_event_stream_simple(
920
+ self,
921
+ text: str,
922
+ images: list[PIL.Image.Image] | None = None,
923
+ ) -> Stream:
924
+ events = []
925
+ # Process text and images
926
+ # Find all occurrences of vision token
927
+
928
+ pattern = re.escape(self.vision_token)
929
+ parts = re.split(f"({pattern})", text) # Keep the delimiter in the result
930
+
931
+ image_idx = 0
932
+ for current_time, part in enumerate(parts):
933
+ if part == self.vision_token:
934
+ # Replace vision token with image event
935
+ if image_idx < len(images):
936
+ # Create vision event from PIL image
937
+ image_tensor = extract_image_pil(images[image_idx])
938
+ if image_tensor is not None:
939
+ # Create a vision event with the image tensor
940
+ vision_event = Event(
941
+ data=image_tensor.unsqueeze(0), # HWC format from extract_image_pil
942
+ type=VisionType.image, # I-frame
943
+ time=(current_time, current_time),
944
+ )
945
+ events.append(vision_event)
946
+ image_idx += 1
947
+ elif part: # Non-empty text part
948
+ # tokens = self.text_processor.tokenize(part, add_special_tokens=False)
949
+ text_event = create_text_event(self.tokenizer, part, time=current_time)
950
+ events.append(text_event)
951
+
952
+ # Process vision events if any
953
+ if any(event.type == VisionType.image for event in events):
954
+ # Separate text and vision events for processing
955
+ text_events = [event for event in events if event.type == TextType.text]
956
+ vision_events = [event for event in events if event.type == VisionType.image]
957
+
958
+ # Process vision events using functional approach
959
+ processed_vision_events = []
960
+ for vision_event in vision_events:
961
+ # Process the vision data
962
+ patches, dims_virtual = process_vision_for_patches(
963
+ vision_event.data.squeeze(0), # Remove the extra dimension
964
+ patch_size=self.patch_size,
965
+ max_num_patches=self.max_num_patches,
966
+ min_num_patches=self.min_num_patches,
967
+ pixel_shuffle_scale=self.pixel_shuffle_scale,
968
+ )
969
+
970
+ # Update event with processed data
971
+ vision_event.data = patches.unsqueeze(1) # Add back frame dimension
972
+ vision_event.dims_virtual = dims_virtual
973
+ vision_event.dims_real = (
974
+ dims_virtual
975
+ if self.pixel_shuffle_scale == 1
976
+ else [
977
+ dims_virtual[0],
978
+ dims_virtual[1] * self.pixel_shuffle_scale,
979
+ dims_virtual[2] * self.pixel_shuffle_scale,
980
+ ]
981
+ )
982
+ vision_event.idx_range = (0, math.prod(dims_virtual))
983
+
984
+ # Flatten the patches
985
+ vision_event.data = vision_event.data.reshape(-1, vision_event.data.shape[-1])
986
+ processed_vision_events.append(vision_event)
987
+
988
+ events = text_events + processed_vision_events
989
+
990
+ # Create stream without scheduling (events already in order)
991
+ return create_stream(events, priority=[TextType.text, VisionType.image], schedule=True)
992
+
993
+ def __call__(
994
+ self,
995
+ text: Union[str, list[str]],
996
+ images: Union[PIL.Image.Image, list[PIL.Image.Image], None] = None,
997
+ return_tensors: str | TensorType | None = TensorType.PYTORCH,
998
+ **kwargs,
999
+ ) -> BatchFeature:
1000
+ """
1001
+ Process text and images into TensorStream format.
1002
+ Args:
1003
+ text: Input text or list of texts with vision tokens
1004
+ images: PIL image or list of images (optional)
1005
+ return_tensors: Format for output tensors
1006
+
1007
+ Returns:
1008
+ BatchFeature with input_ids and tensor_stream
1009
+ """
1010
+ # Normalize inputs to lists
1011
+ if isinstance(text, str):
1012
+ texts = [text]
1013
+ else:
1014
+ texts = text
1015
+
1016
+ if images is not None:
1017
+ if isinstance(images, PIL.Image.Image):
1018
+ images_list = [images]
1019
+ else:
1020
+ images_list = images
1021
+ else:
1022
+ images_list = None
1023
+
1024
+ if len(texts) != 1:
1025
+ raise ValueError("IsaacProcessor currently supports batch_size=1")
1026
+ if images_list is not None:
1027
+ # Count vision tokens in text to validate image count
1028
+ vision_token_count = texts[0].count(self.vision_token)
1029
+ if vision_token_count != len(images_list):
1030
+ raise ValueError(
1031
+ f"Number of {self.vision_token} tokens in text ({vision_token_count}) "
1032
+ f"must match number of images ({len(images_list)})"
1033
+ )
1034
+
1035
+ # Build event stream
1036
+ stream = self.build_event_stream_simple(
1037
+ text=texts[0],
1038
+ images=images_list,
1039
+ )
1040
+
1041
+ # Create TensorStream
1042
+ tensor_stream = TensorStream([stream])
1043
+
1044
+ # Slice to max length if needed
1045
+ _, T = tensor_stream.shape
1046
+ if T > self.max_sequence_length:
1047
+ tensor_stream = ts_slice(tensor_stream, start=T - self.max_sequence_length, end=T)
1048
+
1049
+ # Get token view
1050
+ tokens = tensor_stream_token_view(tensor_stream)
1051
+ if return_tensors in (TensorType.PYTORCH, "pt"):
1052
+ input_ids = torch.as_tensor(tokens, dtype=torch.long)
1053
+ else:
1054
+ input_ids = tokens
1055
+
1056
+ data = {
1057
+ "input_ids": input_ids,
1058
+ "tensor_stream": tensor_stream,
1059
+ }
1060
+
1061
+ return BatchFeature(data=data)
1062
+
1063
+
1064
+ # ============================================================================
1065
+ # Model
1066
+ # ============================================================================
1067
+
1068
+
1069
+ def compute_position_ids_input_ids(input_ids: torch.Tensor) -> torch.Tensor:
1070
+ r"""Create 3D positional indices for token input.
1071
+
1072
+ Args:
1073
+ input_ids (`torch.Tensor`):
1074
+ Tensor of shape `(batch_size, seq_len)` containing token ids.
1075
+
1076
+ Returns:
1077
+ `torch.Tensor`: Positional indices with shape `(batch_size, seq_len, 3)` where each channel duplicates the
1078
+ 1D position so it can be consumed by the 3-axis MRoPE rotary embedding.
1079
+ """
1080
+ batch_size, seq_length = input_ids.shape
1081
+ position_ids = torch.arange(seq_length, device=input_ids.device)
1082
+ position_ids = position_ids.view(1, -1).expand(batch_size, -1)
1083
+ position_ids = position_ids.unsqueeze(2).expand(-1, -1, 3) # Add 3D for MRoPE
1084
+ return position_ids
1085
+
1086
+
1087
+ class IsaacRotaryEmbedding(nn.Module):
1088
+ def __init__(self, config: IsaacConfig, device=None):
1089
+ super().__init__()
1090
+
1091
+ # Extract dimensions from config
1092
+ self.hidden_size = config.hidden_size
1093
+ self.num_attention_heads = config.num_attention_heads
1094
+ self.head_dim = config.head_dim
1095
+
1096
+ # Get rope_scaling config - use direct access when available
1097
+ rope_scaling = getattr(config, "rope_scaling", None) or {}
1098
+
1099
+ # Read RopeScaling parameters
1100
+ self.rope_type = rope_scaling.get("rope_type", "default")
1101
+
1102
+ self.mrope_section = [
1103
+ self.head_dim // 4, # 2x more for temporal dim
1104
+ self.head_dim // 8,
1105
+ self.head_dim // 8,
1106
+ ]
1107
+
1108
+ rope_base = getattr(config, "rope_theta", 10000.0)
1109
+ inv_freq = precompute_inv_freq(rope_base, self.head_dim)
1110
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
1111
+
1112
+ def forward(self, position_ids: torch.Tensor, modality_tensor: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
1113
+ with torch.no_grad():
1114
+ # Ensure non-spatial tokens have 1D rotation equivalence
1115
+ not_spatial = ~(modality_tensor == VisionType.image.value)
1116
+ # shape is [N, 1]
1117
+ data_1d = position_ids[not_spatial][..., 0].unsqueeze(-1)
1118
+ # now broadcast it from [N, 1] -> [N, D] so it matches pos[not_spatial] exactly
1119
+ data_1d = data_1d.expand(-1, position_ids.shape[-1]) # expand along the last dim
1120
+ position_ids = position_ids.clone() # Clone to avoid warning about in-place operations on expanded tensors
1121
+ position_ids[not_spatial] = data_1d
1122
+ position_ids = position_ids.permute(2, 0, 1) # pos dim first -> (3, B, L)
1123
+ cos, sin = precompute_cos_sin_3d(position_ids, self.inv_freq, self.mrope_section)
1124
+
1125
+ return cos, sin
1126
+
1127
+
1128
+ class IsaacModel(Qwen3Model):
1129
+ def __init__(self, config: IsaacConfig):
1130
+ super().__init__(config)
1131
+ text_cfg = getattr(config, "get_text_config", lambda: config)()
1132
+ self.layers = torch.nn.ModuleList(
1133
+ [Qwen3DecoderLayer(text_cfg, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1134
+ )
1135
+ self.rotary_emb = IsaacRotaryEmbedding(config, device=self.device)
1136
+
1137
+ vision_cfg = config.vision_config
1138
+ if vision_cfg is None:
1139
+ raise ValueError("IsaacConfig should always have vision_config")
1140
+
1141
+ hidden_dim = vision_cfg.hidden_size * (vision_cfg.pixel_shuffle_scale_factor**2)
1142
+ self.vision_embedding = nn.Sequential(
1143
+ Siglip2SequenceVisionTransformer(vision_cfg),
1144
+ nn.Linear(
1145
+ hidden_dim,
1146
+ 4 * hidden_dim,
1147
+ bias=False,
1148
+ ),
1149
+ nn.SiLU(),
1150
+ nn.Linear(4 * hidden_dim, config.hidden_size, bias=False),
1151
+ )
1152
+
1153
+ # Dispatch table for TensorStream balanced embedding (text + vision)
1154
+ self.embed_fns = {
1155
+ TextType: self.embed_text_tokens,
1156
+ VisionType: self.embed_vision,
1157
+ }
1158
+
1159
+ def embed_text_tokens(self, token_ids: torch.Tensor) -> torch.Tensor:
1160
+ """Embed text tokens, squeezing singleton dimensions."""
1161
+ # Text events are shaped as (..., 1); squeeze the singleton index dim
1162
+ h = self.embed_tokens(token_ids)
1163
+ if h.dim() >= 2 and h.size(-2) == 1:
1164
+ h = h[..., 0, :]
1165
+ return h
1166
+
1167
+ def embed_vision(self, vision_tokens: tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
1168
+ """Embed vision tokens using the vision encoder."""
1169
+ # vision tokens is (seq_patches, token_grids)
1170
+ return self.vision_embedding(vision_tokens)
1171
+
1172
+ def embed_stream(self, tensor_stream: TensorStream) -> torch.Tensor:
1173
+ """
1174
+ Embed each modality stream independently, preserving the original TensorStream
1175
+ structure.
1176
+ """
1177
+ flat_stream = tensor_stream.flat_stream()
1178
+ per_modality_stream = group_streams(flat_stream, group_fn=lambda ev: ev.type, schedule=False)
1179
+ per_modality_compact_stream = {k: v.compact() for k, v in per_modality_stream.items()}
1180
+
1181
+ # Collect per-event grids for vision tokens (H, W like dims sans time)
1182
+ token_grids = defaultdict(list)
1183
+ for stream in tensor_stream.streams:
1184
+ for event in stream:
1185
+ token_grids[event.type].append(event.dims(virtual=False))
1186
+
1187
+ embedded_compact = {}
1188
+ for stream_type, modality_payload_tensor in per_modality_compact_stream.items():
1189
+ if stream_type.modality == VisionType:
1190
+ # Build a (N_events, 2) grid tensor with spatial dims only
1191
+ grids = token_grids.get(stream_type, [])
1192
+ if len(grids) == 0:
1193
+ input_tensor = modality_payload_tensor
1194
+ else:
1195
+ token_grids_tensor = torch.tensor(grids, dtype=torch.long, device=tensor_stream.device)[:, 1:]
1196
+ input_tensor = (modality_payload_tensor, token_grids_tensor)
1197
+ embedded_compact[stream_type] = self.embed_fns[stream_type.modality](input_tensor)
1198
+ else:
1199
+ embedded_compact[stream_type] = self.embed_fns[stream_type.modality](modality_payload_tensor)
1200
+
1201
+ # Reconstruct a TensorStream with embedded payloads and compact
1202
+ embedded_ts = reconstruct_tensor_stream_from_compact_dict(tensor_stream, embedded_compact)
1203
+ h = embedded_ts.compact() # (B, T, D)
1204
+ return h
1205
+
1206
+ def forward(
1207
+ self,
1208
+ input_ids: torch.LongTensor | None = None,
1209
+ tensor_stream: TensorStream | None = None,
1210
+ attention_mask: torch.Tensor | None = None,
1211
+ position_ids: torch.LongTensor | None = None,
1212
+ modality_tensor: torch.LongTensor | None = None,
1213
+ past_key_values: list[torch.FloatTensor] | None = None,
1214
+ inputs_embeds: torch.FloatTensor | None = None,
1215
+ use_cache: bool | None = None,
1216
+ output_hidden_states: bool | None = None,
1217
+ return_dict: bool | None = None,
1218
+ cache_position: torch.LongTensor | None = None,
1219
+ **kwargs,
1220
+ ) -> tuple | BaseModelOutputWithPast:
1221
+ """
1222
+ Forward pass with MRoPE position embeddings.
1223
+
1224
+ Computes position embeddings once and passes them through all layers.
1225
+ """
1226
+ output_hidden_states = (
1227
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1228
+ )
1229
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1230
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1231
+
1232
+ # Get inputs
1233
+ if tensor_stream is not None and inputs_embeds is not None:
1234
+ raise ValueError("You cannot specify both tensor_stream and inputs_embeds")
1235
+ elif tensor_stream is not None:
1236
+ # Embed TensorStream directly
1237
+ inputs_embeds = self.embed_stream(tensor_stream)
1238
+ # Create modality tensor if not provided
1239
+ if modality_tensor is None:
1240
+ modality_tensor = modality_mask(tensor_stream)
1241
+ elif input_ids is not None and inputs_embeds is not None:
1242
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
1243
+ elif input_ids is not None:
1244
+ inputs_embeds = self.embed_tokens(input_ids)
1245
+ # Create text modality tensor if not provided
1246
+ if modality_tensor is None:
1247
+ batch_size, seq_length = input_ids.shape
1248
+ modality_tensor = torch.full(
1249
+ (batch_size, seq_length), TextType.text.value, device=input_ids.device, dtype=torch.long
1250
+ )
1251
+ elif inputs_embeds is None:
1252
+ raise ValueError("You have to specify either tensor_stream, input_ids or inputs_embeds")
1253
+
1254
+ # Create default position_ids if not provided
1255
+ if position_ids is None:
1256
+ if tensor_stream is not None:
1257
+ position_ids = compute_mrope_pos_tensor(tensor_stream) # (B,L,3)
1258
+ else:
1259
+ position_ids = compute_position_ids_input_ids(input_ids)
1260
+
1261
+ # Compute MRoPE position embeddings if we have custom rotary_emb
1262
+ cos, sin = self.rotary_emb(position_ids, modality_tensor)
1263
+ cos = cos.to(inputs_embeds.dtype)
1264
+ sin = sin.to(inputs_embeds.dtype)
1265
+
1266
+ # Prepare attention mask
1267
+ if attention_mask is not None:
1268
+ attention_mask = self._update_causal_mask(
1269
+ attention_mask, inputs_embeds, cache_position, past_key_values, False
1270
+ )
1271
+
1272
+ # Initialize hidden states
1273
+ hidden_states = inputs_embeds
1274
+
1275
+ for decoder_layer in self.layers:
1276
+ layer_outputs = decoder_layer(
1277
+ hidden_states,
1278
+ attention_mask=attention_mask,
1279
+ position_ids=position_ids,
1280
+ past_key_value=past_key_values,
1281
+ use_cache=use_cache,
1282
+ cache_position=cache_position,
1283
+ position_embeddings=(cos, sin),
1284
+ **kwargs,
1285
+ )
1286
+
1287
+ hidden_states = layer_outputs[0] if isinstance(layer_outputs, tuple) else layer_outputs
1288
+
1289
+ # Final layer norm
1290
+ hidden_states = self.norm(hidden_states)
1291
+
1292
+ return BaseModelOutputWithPast(
1293
+ last_hidden_state=hidden_states,
1294
+ past_key_values=past_key_values,
1295
+ )
1296
+
1297
+ def _update_causal_mask(
1298
+ self,
1299
+ attention_mask: torch.Tensor,
1300
+ input_tensor: torch.Tensor,
1301
+ cache_position: torch.Tensor,
1302
+ past_key_values: Cache,
1303
+ output_attentions: bool = False,
1304
+ ):
1305
+ if self.config._attn_implementation == "flash_attention_2":
1306
+ if attention_mask is not None and past_key_values is not None:
1307
+ is_padding_right = attention_mask[:, -1].sum().item() != input_tensor.size()[0]
1308
+ if is_padding_right:
1309
+ raise ValueError(
1310
+ "You are attempting to perform batched generation with padding_side='right'"
1311
+ " this may lead to unexpected behaviour for Flash Attention version of Qwen3. Make sure to "
1312
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
1313
+ )
1314
+ if attention_mask is not None and 0.0 in attention_mask:
1315
+ return attention_mask
1316
+ return None
1317
+
1318
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
1319
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
1320
+ # to infer the attention mask.
1321
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
1322
+ using_static_cache = isinstance(past_key_values, StaticCache)
1323
+ using_sliding_window_cache = isinstance(past_key_values, SlidingWindowCache)
1324
+
1325
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
1326
+ if (
1327
+ self.config._attn_implementation == "sdpa"
1328
+ and not (using_static_cache or using_sliding_window_cache)
1329
+ and not output_attentions
1330
+ ):
1331
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
1332
+ attention_mask,
1333
+ inputs_embeds=input_tensor,
1334
+ past_key_values_length=past_seen_tokens,
1335
+ sliding_window=self.config.sliding_window,
1336
+ is_training=self.training,
1337
+ ):
1338
+ return None
1339
+
1340
+ dtype, device = input_tensor.dtype, input_tensor.device
1341
+ min_dtype = torch.finfo(dtype).min
1342
+ sequence_length = input_tensor.shape[1]
1343
+ # SlidingWindowCache or StaticCache
1344
+ if using_sliding_window_cache or using_static_cache:
1345
+ target_length = past_key_values.get_max_cache_shape()
1346
+ # DynamicCache or no cache
1347
+ else:
1348
+ target_length = (
1349
+ attention_mask.shape[-1]
1350
+ if isinstance(attention_mask, torch.Tensor)
1351
+ else past_seen_tokens + sequence_length + 1
1352
+ )
1353
+
1354
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
1355
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
1356
+ attention_mask,
1357
+ sequence_length=sequence_length,
1358
+ target_length=target_length,
1359
+ dtype=dtype,
1360
+ device=device,
1361
+ cache_position=cache_position,
1362
+ batch_size=input_tensor.shape[0],
1363
+ config=self.config,
1364
+ past_key_values=past_key_values,
1365
+ )
1366
+
1367
+ if (
1368
+ self.config._attn_implementation == "sdpa"
1369
+ and attention_mask is not None
1370
+ and attention_mask.device.type in ["cuda", "xpu", "npu"]
1371
+ and not output_attentions
1372
+ ):
1373
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
1374
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1375
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1376
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
1377
+
1378
+ return causal_mask
1379
+
1380
+ @staticmethod
1381
+ def _prepare_4d_causal_attention_mask_with_cache_position(
1382
+ attention_mask: torch.Tensor,
1383
+ sequence_length: int,
1384
+ target_length: int,
1385
+ dtype: torch.dtype,
1386
+ device: torch.device,
1387
+ cache_position: torch.Tensor,
1388
+ batch_size: int,
1389
+ config: Qwen3Config,
1390
+ past_key_values: Cache,
1391
+ ):
1392
+ """
1393
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
1394
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
1395
+
1396
+ Args:
1397
+ attention_mask (`torch.Tensor`):
1398
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
1399
+ sequence_length (`int`):
1400
+ The sequence length being processed.
1401
+ target_length (`int`):
1402
+ The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
1403
+ dtype (`torch.dtype`):
1404
+ The dtype to use for the 4D attention mask.
1405
+ device (`torch.device`):
1406
+ The device to place the 4D attention mask on.
1407
+ cache_position (`torch.Tensor`):
1408
+ Indices depicting the position of the input sequence tokens in the sequence.
1409
+ batch_size (`torch.Tensor`):
1410
+ Batch size.
1411
+ config (`Qwen3Config`):
1412
+ The model's configuration class
1413
+ past_key_values (`Cache`):
1414
+ The cache class that is being used currently to generate
1415
+ """
1416
+ if attention_mask is not None and attention_mask.dim() == 4:
1417
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
1418
+ causal_mask = attention_mask
1419
+ else:
1420
+ min_dtype = torch.finfo(dtype).min
1421
+ causal_mask = torch.full(
1422
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
1423
+ )
1424
+ diagonal_attend_mask = torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
1425
+ if config.sliding_window is not None:
1426
+ # if we have sliding window, we should not attend to tokens beyond sliding window length, so we mask them out also
1427
+ # the check is needed to verify is current checkpoint was trained with sliding window or not
1428
+ if not isinstance(past_key_values, SlidingWindowCache) or sequence_length > target_length:
1429
+ sliding_attend_mask = torch.arange(target_length, device=device) <= (
1430
+ cache_position.reshape(-1, 1) - config.sliding_window
1431
+ )
1432
+ diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
1433
+ causal_mask *= diagonal_attend_mask
1434
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
1435
+ if attention_mask is not None:
1436
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
1437
+ if attention_mask.shape[-1] > target_length:
1438
+ attention_mask = attention_mask[:, :target_length]
1439
+ mask_length = attention_mask.shape[-1]
1440
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
1441
+ causal_mask.device
1442
+ )
1443
+ padding_mask = padding_mask == 0
1444
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
1445
+ padding_mask, min_dtype
1446
+ )
1447
+ return causal_mask
1448
+
1449
+
1450
+
1451
+ class IsaacForConditionalGeneration(Qwen3ForCausalLM, GenerationMixin):
1452
+ """Isaac multimodal model for conditional generation."""
1453
+
1454
+ config_class = IsaacConfig
1455
+
1456
+ def __init__(self, config: IsaacConfig):
1457
+ Qwen3PreTrainedModel.__init__(self, config)
1458
+ self.model = IsaacModel(config) # Use our custom model
1459
+ self.vocab_size = config.vocab_size
1460
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1461
+ # Tracks rotary position offsets computed during a full forward pass so decode steps can reuse them.
1462
+ self.rope_deltas = None
1463
+
1464
+ self.config = config
1465
+
1466
+ def get_rope_index(
1467
+ self,
1468
+ input_ids: torch.Tensor | None,
1469
+ tensor_stream: TensorStream | None,
1470
+ attention_mask: torch.Tensor | None,
1471
+ ) -> tuple[torch.Tensor, torch.Tensor]:
1472
+ """Compute MRoPE position ids from a TensorStream (or 1D fallback).
1473
+
1474
+ Returns (position_ids, rope_deltas). position_ids is (B,L,3) for MRoPE.
1475
+ rope_deltas is (B,1) used to advance positions in decode.
1476
+ """
1477
+ # tensor_stream present: compute 3D coords
1478
+ if tensor_stream is None and input_ids is None:
1479
+ raise ValueError("`tensor_stream` or `input_ids` must be provided to compute rope indices")
1480
+
1481
+ if tensor_stream is not None:
1482
+ pos_3d = compute_mrope_pos_tensor(tensor_stream) # (B,L,3)
1483
+ else:
1484
+ pos_3d = compute_position_ids_input_ids(input_ids)
1485
+ B, L, _ = pos_3d.shape
1486
+
1487
+ # Max position per batch across the 3 planes and sequence dimension: (B,)
1488
+ m_per_batch = pos_3d.amax(dim=(1, 2))
1489
+
1490
+ # Sequence lengths per batch: (B,)
1491
+ if attention_mask is None:
1492
+ seq_lens = torch.full_like(m_per_batch, L)
1493
+ else:
1494
+ seq_lens = attention_mask.eq(1).sum(dim=-1).to(dtype=m_per_batch.dtype, device=m_per_batch.device)
1495
+
1496
+ rope_deltas = (m_per_batch + 1 - seq_lens).to(dtype=pos_3d.dtype).unsqueeze(1)
1497
+ return pos_3d, rope_deltas
1498
+
1499
+ def forward(
1500
+ self,
1501
+ input_ids: torch.LongTensor | None = None,
1502
+ tensor_stream: TensorStream | None = None,
1503
+ attention_mask: torch.Tensor | None = None,
1504
+ position_ids: torch.LongTensor | None = None,
1505
+ past_key_values: list[torch.FloatTensor] | None = None,
1506
+ inputs_embeds: torch.FloatTensor | None = None,
1507
+ labels: torch.LongTensor | None = None,
1508
+ use_cache: bool | None = None,
1509
+ output_hidden_states: bool | None = None,
1510
+ return_dict: bool | None = None,
1511
+ cache_position: torch.LongTensor | None = None,
1512
+ **kwargs,
1513
+ ) -> tuple | CausalLMOutputWithPast:
1514
+ """
1515
+ Forward pass for conditional generation supporting both standard inputs and TensorStream.
1516
+ Uses our embed_stream approach for multimodal inputs.
1517
+ """
1518
+
1519
+ # Don't compute embeddings here - let the model handle it
1520
+ if tensor_stream is not None:
1521
+ input_ids = None
1522
+ if input_ids is None and inputs_embeds is None and tensor_stream is None:
1523
+ raise ValueError("Either input_ids, inputs_embeds, or tensor_stream must be provided.")
1524
+
1525
+ # Build position ids (MRoPE) if needed and tensor_stream is available
1526
+ # During decode we reuse `self.rope_deltas` computed on the initial forward pass; `rope_delta` captures how far
1527
+ # cached rotary phases have progressed so we can advance `position_ids` without rebuilding the TensorStream.
1528
+ if position_ids is None and tensor_stream is not None:
1529
+ position_ids, self.rope_deltas = self.get_rope_index(input_ids, tensor_stream, attention_mask)
1530
+ elif position_ids is None and input_ids is not None:
1531
+ # For text inputs build position ids and modality tensor
1532
+ position_ids = compute_position_ids_input_ids(input_ids)
1533
+ if cache_position is not None and self.rope_deltas is not None:
1534
+ # Combine the incremental decode step (`cache_position`) with cached offsets so hidden states continue
1535
+ # rotating in lockstep across generation steps.
1536
+ rope_delta = (cache_position[0] + self.rope_deltas).to(input_ids.device)
1537
+ else:
1538
+ rope_delta = 0
1539
+ if cache_position is not None and not isinstance(rope_delta, int): # otherwise `deltas` is an int `0`
1540
+ batch_size = input_ids.shape[0]
1541
+ rope_delta = rope_delta.repeat_interleave(batch_size // rope_delta.shape[0], dim=0)
1542
+ position_ids = position_ids.add(rope_delta)
1543
+
1544
+ if tensor_stream is not None:
1545
+ modality_tensor = modality_mask(tensor_stream)
1546
+ else:
1547
+ batch_size, seq_len = input_ids.shape
1548
+ modality_tensor = torch.empty(batch_size, seq_len, device=position_ids.device).fill_(TextType.text.value)
1549
+
1550
+ outputs = self.model(
1551
+ input_ids=input_ids,
1552
+ tensor_stream=tensor_stream,
1553
+ attention_mask=attention_mask,
1554
+ position_ids=position_ids,
1555
+ modality_tensor=modality_tensor,
1556
+ past_key_values=past_key_values,
1557
+ inputs_embeds=inputs_embeds,
1558
+ use_cache=use_cache,
1559
+ output_hidden_states=output_hidden_states,
1560
+ return_dict=return_dict,
1561
+ cache_position=cache_position,
1562
+ **kwargs,
1563
+ )
1564
+
1565
+ hidden_states = outputs[0]
1566
+ logits = self.lm_head(hidden_states)
1567
+
1568
+ loss = None
1569
+ if labels is not None:
1570
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size)
1571
+
1572
+ return CausalLMOutputWithPast(
1573
+ loss=loss,
1574
+ logits=logits,
1575
+ past_key_values=outputs.past_key_values,
1576
+ hidden_states=outputs.hidden_states,
1577
+ attentions=None,
1578
+ )
1579
+
1580
+ def prepare_inputs_for_generation(
1581
+ self,
1582
+ input_ids: torch.LongTensor,
1583
+ past_key_values: list[torch.FloatTensor] | None = None,
1584
+ attention_mask: torch.Tensor | None = None,
1585
+ inputs_embeds: torch.FloatTensor | None = None,
1586
+ tensor_stream: TensorStream | None = None,
1587
+ cache_position: torch.LongTensor | None = None,
1588
+ position_ids: torch.LongTensor | None = None,
1589
+ use_cache: bool = True,
1590
+ **kwargs,
1591
+ ) -> dict[str, Any]:
1592
+ """
1593
+ Prepare inputs for generation, handling TensorStream inputs properly.
1594
+ """
1595
+ # Call parent preparation
1596
+ model_inputs = super().prepare_inputs_for_generation(
1597
+ input_ids,
1598
+ past_key_values=past_key_values,
1599
+ attention_mask=attention_mask,
1600
+ inputs_embeds=inputs_embeds,
1601
+ cache_position=cache_position,
1602
+ position_ids=position_ids,
1603
+ use_cache=use_cache,
1604
+ **kwargs,
1605
+ )
1606
+
1607
+ # Handle TensorStream for first forward pass only
1608
+ if tensor_stream is not None and (cache_position is None or cache_position[0] == 0):
1609
+ model_inputs["tensor_stream"] = tensor_stream
1610
+ # Let forward rebuild position_ids using cached deltas during decode
1611
+ model_inputs["position_ids"] = None
1612
+ # Drop tensor_stream after step 0
1613
+ if cache_position is not None and cache_position[0] != 0:
1614
+ model_inputs["tensor_stream"] = None
1615
+ return model_inputs
1616
+
1617
+ def can_generate(self) -> bool:
1618
+ return True
1619
+
1620
+
1621
+ __all__ = [
1622
+ "IsaacConfig",
1623
+ "IsaacModel",
1624
+ "IsaacForConditionalGeneration",
1625
+ "IsaacProcessor",
1626
+ ]