Danny Yin commited on
Commit
73b433d
·
1 Parent(s): d5add18
LICENSE ADDED
File without changes
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - AutoGaze
5
+ - NVILA
6
+ ---
7
+
8
+ ## Model Overview
9
+
10
+ ### Description: <br>
11
+
12
+ NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.
13
+
14
+ Specifically, NVILA-HD-Video uses [AutoGaze](nvidia/AutoGaze) to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
15
+
16
+ This model is for research and development only.
17
+
18
+ ### Quick Start:
19
+
20
+ Note: please first install [AutoGaze](https://github.com/NVlabs/AutoGaze).
21
+
22
+ ```python
23
+ import torch
24
+ from transformers import AutoModel, AutoProcessor
25
+
26
+ model_path = "nvidia/NVILA-8B-HD-Video"
27
+ video_path = "https://huggingface.co/datasets/bfshi/HLVid/resolve/main/example/clip_av_video_5_001.mp4"
28
+ prompt = "Question: What does the white text on the green road sign say?\n \
29
+ A. Hampden St\n \
30
+ B. Hampden Ave\n \
31
+ C. HampdenBlvd\n \
32
+ D. Hampden Rd\n \
33
+ Please answer directly with the letter of the correct answer."
34
+
35
+ # ----- Video processing args -----
36
+ num_video_frames = 128 # Total sampled frames for tiles
37
+ num_video_frames_thumbnail = 64 # Total sampled frames for thumbnails
38
+ max_tiles_video = 48 # Max spatial tiles per video (one tile is 392x392)
39
+
40
+ # ----- AutoGaze args (tiles) -----
41
+ gazing_ratio_tile = [0.2] + [0.06] * 15 # Per-frame max gazing ratios (single float or list)
42
+ task_loss_requirement_tile = 0.6
43
+
44
+ # ----- AutoGaze args (thumbnails) -----
45
+ gazing_ratio_thumbnail = 1 # Set to None to skip gazing on thumbnails
46
+ task_loss_requirement_thumbnail = None
47
+
48
+ # ----- Batching -----
49
+ max_batch_size_autogaze = 16
50
+ max_batch_size_siglip = 32
51
+
52
+ # Load processor and model
53
+ processor = AutoProcessor.from_pretrained(
54
+ model_path,
55
+ num_video_frames=num_video_frames,
56
+ num_video_frames_thumbnail=num_video_frames_thumbnail,
57
+ max_tiles_video=max_tiles_video,
58
+ gazing_ratio_tile=gazing_ratio_tile,
59
+ gazing_ratio_thumbnail=gazing_ratio_thumbnail,
60
+ task_loss_requirement_tile=task_loss_requirement_tile,
61
+ task_loss_requirement_thumbnail=task_loss_requirement_thumbnail,
62
+ max_batch_size_autogaze=max_batch_size_autogaze,
63
+ trust_remote_code=True,
64
+ )
65
+
66
+ model = AutoModel.from_pretrained(
67
+ model_path,
68
+ trust_remote_code=True,
69
+ device_map="auto",
70
+ max_batch_size_siglip=max_batch_size_siglip,
71
+ )
72
+ model.eval()
73
+
74
+ # Run inference
75
+ video_token = processor.tokenizer.video_token
76
+ inputs = processor(text=f"{video_token}\n\n{prompt}", videos=video_path, return_tensors="pt")
77
+ inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
78
+
79
+ outputs = model.generate(**inputs)
80
+ response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()
81
+ print(response)
82
+ ```
83
+
84
+ For more details, see the [VILA github repo](https://github.com/NVlabs/VILA/tree/main/vila_hd/nvila_hd_video).
85
+
86
+ ### License/Terms of Use: <br>
87
+
88
+ Governing Terms: [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). Additional Information: [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/) for [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
89
+
90
+ ### Deployment Geography:
91
+
92
+ Global
93
+
94
+ ### Use Case: <br>
95
+
96
+ The model is used for understanding high-resolution long-form videos.
97
+
98
+ ## Reference(s):
99
+
100
+ AutoGaze GitHub: https://github.com/NVlabs/AutoGaze <br>
101
+
102
+ ## Model Architecture:
103
+ **Architecture Type:** Neural Network
104
+
105
+ **Network Architecture:** Multi-modal Large Language Model
106
+
107
+ **Number of model parameters:** 8B <br>
108
+
109
+ **This model was developed based on [AutoGaze](https://huggingface.co/nvidia/AutoGaze) and [NVILA-Lite-8B](https://huggingface.co/Efficient-Large-Model/NVILA-Lite-8B) <br>
110
+
111
+ ## Input: <br>
112
+ **Input Type(s):** Video and Text <br>
113
+ **Input Format:** Red, Green, Blue (RGB) and strings <br>
114
+ **Input Parameters:** Three Dimensional (3D) and One Dimensional (1D) <br>
115
+ **Other Properties Related to Input:** Videos with resolution up to 4K and #frames up to 1K and text input up to 20K tokens <br>
116
+
117
+ ## Output: <br>
118
+ **Output Type(s):** Text <br>
119
+ **Output Format:** Strings <br>
120
+ **Output Parameters:** One Dimensional (1D) <br>
121
+ **Other Properties Related to Output:** Text output up to 20K tokens <br>
122
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
123
+
124
+ ## Software Integration:
125
+ **Runtime Engine(s):**
126
+ Not Applicable (N/A) <br>
127
+
128
+ **Supported Hardware Microarchitecture Compatibility:** <br>
129
+ NVIDIA Ampere <br>
130
+ NVIDIA Blackwell <br>
131
+ NVIDIA Hopper <br>
132
+ NVIDIA Jetson <br>
133
+
134
+ **Preferred/Supported Operating System(s):** <br>
135
+ Linux <br>
136
+ Linux 4 Tegra <br>
137
+ QNX <br>
138
+ Windows <br>
139
+
140
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
141
+
142
+ ## Model Version(s):
143
+ v1.0 - Initial release
144
+
145
+ ## Training Datasets: <br>
146
+
147
+ 72 datasets. See NVILA paper for more details.
148
+
149
+ Dataset partition: Training 100% <br>
150
+
151
+ ## Training Dataset:
152
+
153
+ **Link:**
154
+ See NVILA paper for more details.
155
+
156
+ **Data Collection Method by dataset:** <br>
157
+ [Hybrid: Automated, Human]
158
+
159
+ **Labeling Method by dataset:** <br>
160
+ [Hybrid: Automated, Human]
161
+
162
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
163
+ 72 datasets split into 5 stages (Projector Alignment, Vision Encoder Alignment, Pre-Training, Image Instruction-Tuning, and Patch Selection Tuning) <br>
164
+
165
+
166
+
167
+
168
+ ## Inference:
169
+ **Acceleration Engine:** N/A <br>
170
+ **Test Hardware:** <br>
171
+ The model is tested on NVIDIA A100 GPU.
172
+
173
+
174
+
175
+ ### Ethical Considerations:
176
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
added_tokens.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<image>": 151649,
3
+ "<vila/sentinel>": 151648,
4
+ "<vila/video>": 151650,
5
+ "<|endoftext|>": 151643,
6
+ "<|im_end|>": 151645,
7
+ "<|im_start|>": 151644,
8
+ "[BOS]": 151646,
9
+ "[PAD]": 151647
10
+ }
chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% for message in messages %}{% if loop.first and message['role'] != 'system' %}{{ '<|im_start|>system
2
+ You are a helpful assistant<|im_end|>
3
+ ' }}{% endif %}{{ '<|im_start|>' + message['role'] + '
4
+ ' }}{% if message['content'] is string %}{{ message['content'] + '<|im_end|>
5
+ ' }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{{ '<image>' }}{% elif content['type'] == 'video' or 'video' in content %}{{ '<vila/video>' }}{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{{ '<|im_end|>
6
+ ' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
7
+ ' }}{% endif %}
config.json ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NVILAForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_nvila.NVILAConfig",
7
+ "AutoModel": "modeling_nvila.NVILAForConditionalGeneration",
8
+ "AutoModelForCausalLM": "modeling_nvila.NVILAForConditionalGeneration",
9
+ "AutoModelForImageTextToText": "modeling_nvila.NVILAForConditionalGeneration",
10
+ "AutoModelForVision2Seq": "modeling_nvila.NVILAForConditionalGeneration"
11
+ },
12
+ "image_token_id": 151649,
13
+ "model_type": "nvila",
14
+ "text_config": {
15
+ "_attn_implementation_autoset": false,
16
+ "architectures": [
17
+ "Qwen2ForCausalLM"
18
+ ],
19
+ "attention_dropout": 0.0,
20
+ "bos_token_id": 151643,
21
+ "eos_token_id": 151645,
22
+ "hidden_act": "silu",
23
+ "hidden_size": 3584,
24
+ "initializer_range": 0.02,
25
+ "intermediate_size": 18944,
26
+ "layer_types": [
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention",
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention",
50
+ "full_attention",
51
+ "full_attention",
52
+ "full_attention",
53
+ "full_attention",
54
+ "full_attention"
55
+ ],
56
+ "max_position_embeddings": 32768,
57
+ "max_window_layers": 28,
58
+ "model_max_length": 40960,
59
+ "model_type": "qwen2",
60
+ "num_attention_heads": 28,
61
+ "num_hidden_layers": 28,
62
+ "num_key_value_heads": 4,
63
+ "rms_norm_eps": 1e-06,
64
+ "rope_scaling": null,
65
+ "rope_theta": 1000000.0,
66
+ "sliding_window": null,
67
+ "tokenizer_model_max_length": 40960,
68
+ "tokenizer_padding_side": "right",
69
+ "torch_dtype": "bfloat16",
70
+ "use_cache": true,
71
+ "use_sliding_window": false,
72
+ "vocab_size": 151651
73
+ },
74
+ "max_batch_size_siglip": 128,
75
+ "torch_dtype": "bfloat16",
76
+ "transformers_version": "4.55.4",
77
+ "video_token_id": 151650,
78
+ "vision_config": {
79
+ "_attn_implementation_autoset": false,
80
+ "architectures": [
81
+ "SiglipVisionModel"
82
+ ],
83
+ "attention_dropout": 0.0,
84
+ "attn_implementation": "sdpa",
85
+ "attn_type": "block_causal",
86
+ "hidden_act": "gelu_pytorch_tanh",
87
+ "hidden_size": 1152,
88
+ "image_size": 448,
89
+ "intermediate_size": 4304,
90
+ "layer_norm_eps": 1e-06,
91
+ "max_embed_batch_size": 16,
92
+ "model_type": "siglip_vision_model",
93
+ "num_attention_heads": 16,
94
+ "num_channels": 3,
95
+ "num_hidden_layers": 27,
96
+ "num_image_tokens": 256,
97
+ "patch_size": 14,
98
+ "projection_dim": 2048,
99
+ "projector_hidden_act": "gelu_fast",
100
+ "scales": "56+112+196+392",
101
+ "torch_dtype": "bfloat16",
102
+ "vision_use_head": false
103
+ }
104
+ }
configuration_nvila.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ from pathlib import Path
3
+ from typing import Any
4
+
5
+ from transformers.configuration_utils import PretrainedConfig
6
+ from transformers.models.qwen2 import Qwen2Config
7
+ from autogaze.vision_encoders.siglip.configuration_siglip import SiglipVisionConfig
8
+
9
+
10
+ class NVILAConfig(PretrainedConfig):
11
+ model_type = "nvila"
12
+ sub_configs = {
13
+ "text_config": Qwen2Config,
14
+ "vision_config": SiglipVisionConfig,
15
+ }
16
+ _auto_class = "AutoConfig"
17
+
18
+ def __init__(
19
+ self,
20
+ *,
21
+ text_config: dict[str, Any] | None = None,
22
+ vision_config: dict[str, Any] | None = None,
23
+ image_token_id: int | None = None,
24
+ video_token_id: int | None = None,
25
+ max_batch_size_siglip: int = 16,
26
+ **kwargs,
27
+ ):
28
+ self.text_config = Qwen2Config(**text_config) if text_config is not None else Qwen2Config()
29
+ self.vision_config = SiglipVisionConfig(**vision_config) if vision_config is not None else SiglipVisionConfig()
30
+
31
+ self.image_token_id = image_token_id if image_token_id is not None else -1
32
+ self.video_token_id = video_token_id if video_token_id is not None else -1
33
+ self.max_batch_size_siglip = max_batch_size_siglip
34
+
35
+ super().__init__(**kwargs)
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "transformers_version": "4.55.4"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6fcee82d90b6709f451256bbebfc2cadf7fe55731cac9878d43dd35ce9443272
3
+ size 5242359656
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6650c3b5f2192619c44e63ab4b52b86062162c7106e3d5bc7c336e17d16d1a4
3
+ size 5321808048
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:610da6f8adea1111d364c7acb9abb46abd0123a440caccc50c9c9b6c8c30d7fe
3
+ size 5368631104
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ced3a3efab23c75b5621ac2693c7c989c850a6c8aa61b9c743a43753f3fed37
3
+ size 241471808
modeling_nvila.py ADDED
@@ -0,0 +1,604 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import contextlib
2
+ import sys
3
+ from pathlib import Path
4
+ from typing import Optional
5
+
6
+ import einops
7
+ import numpy as np
8
+ import torch
9
+ import torch.nn as nn
10
+ from torch import Tensor
11
+ from transformers import Qwen2ForCausalLM
12
+ from transformers.cache_utils import Cache
13
+ from transformers.generation.utils import GenerationMixin
14
+ from transformers.modeling_outputs import BaseModelOutputWithPooling, CausalLMOutputWithPast
15
+ from transformers.modeling_utils import PreTrainedModel
16
+ from autogaze.vision_encoders.siglip.modeling_siglip import SiglipVisionModel
17
+
18
+ from .configuration_nvila import NVILAConfig
19
+
20
+
21
+ MM_HIDDEN_SIZE = 1152
22
+
23
+
24
+ class TokenShuffle(nn.Module):
25
+ """Token shuffle module that groups tokens and concatenates their features."""
26
+ def __init__(self, shuffle_num: int):
27
+ super().__init__()
28
+ self.shuffle_num = shuffle_num
29
+
30
+ def forward(self, x: Tensor) -> Tensor:
31
+ """
32
+ Args:
33
+ x: (B, N, C) tensor where B is batch size, N is sequence length, C is hidden size
34
+ Returns:
35
+ (B, N', C * shuffle_num) tensor where N' = ceil(N / shuffle_num)
36
+ """
37
+ # x: (B, N, C)
38
+ if x.shape[1] % self.shuffle_num != 0:
39
+ # Pad with the last token to make sequence length divisible by shuffle_num
40
+ pad_size = self.shuffle_num - (x.shape[1] % self.shuffle_num)
41
+ x = torch.cat([x, x[:, -1:].repeat(1, pad_size, 1)], dim=1)
42
+ # Rearrange: (B, N, C) -> (B, N//k, k*C) where k = shuffle_num
43
+ return einops.rearrange(x, "b (n k) c -> b n (k c)", k=self.shuffle_num)
44
+
45
+
46
+ class NVILAMultiModalProjector(nn.Module):
47
+ """Multi-modal projector using mlp_shuffle_9 architecture."""
48
+ def __init__(self, config: NVILAConfig):
49
+ super().__init__()
50
+
51
+ self.layers = nn.Sequential(
52
+ TokenShuffle(9),
53
+ nn.LayerNorm(MM_HIDDEN_SIZE * 9),
54
+ nn.Linear(MM_HIDDEN_SIZE * 9, MM_HIDDEN_SIZE * 3),
55
+ nn.GELU(),
56
+ nn.LayerNorm(MM_HIDDEN_SIZE * 3),
57
+ nn.Linear(MM_HIDDEN_SIZE * 3, config.text_config.hidden_size),
58
+ nn.GELU(),
59
+ nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size),
60
+ )
61
+
62
+ def forward(self, x: Tensor) -> Tensor:
63
+ return self.layers(x)
64
+
65
+
66
+ class NVILAForConditionalGeneration(PreTrainedModel, GenerationMixin):
67
+ config_class = NVILAConfig
68
+ base_model_prefix: str = "llm"
69
+ _auto_class = "AutoModel"
70
+ _supports_flash_attn_2 = True
71
+ _supports_sdpa = True
72
+
73
+ def __init__(self, config: NVILAConfig):
74
+ super().__init__(config)
75
+
76
+ self.config: NVILAConfig
77
+
78
+ @contextlib.contextmanager
79
+ def default_torch_dtype(dtype):
80
+ original_dtype = torch.get_default_dtype()
81
+ torch.set_default_dtype(dtype)
82
+ try:
83
+ yield
84
+ finally:
85
+ torch.set_default_dtype(original_dtype)
86
+
87
+ with default_torch_dtype(config.torch_dtype):
88
+ self.vision_tower = SiglipVisionModel(config.vision_config)
89
+ self.mm_projector = NVILAMultiModalProjector(config)
90
+ self.llm = Qwen2ForCausalLM(config.text_config)
91
+
92
+ self.post_init()
93
+
94
+ def forward(
95
+ self,
96
+ *,
97
+ input_ids: Tensor | None = None,
98
+ inputs_embeds: Tensor | None = None,
99
+ pixel_values: Tensor | None = None,
100
+ pixel_values_images_tiles: list[Tensor] | None = None,
101
+ pixel_values_images_thumbnails: list[Tensor] | None = None,
102
+ num_spatial_tiles_each_image: list[int] | None = None,
103
+ pixel_values_videos_tiles: list[Tensor] | None = None,
104
+ pixel_values_videos_thumbnails: list[Tensor] | None = None,
105
+ gazing_info: dict | None = None,
106
+ num_spatial_tiles_each_video: list[int] | None = None,
107
+ **kwargs,
108
+ ) -> CausalLMOutputWithPast:
109
+ assert (input_ids is None) != (
110
+ inputs_embeds is None
111
+ ), "Exactly one of `input_ids` or `inputs_embeds` must be specified."
112
+
113
+ # Pop processor-only fields that the LLM should not see
114
+ kwargs.pop("pixel_values_videos_tiles_autogaze", None)
115
+ kwargs.pop("pixel_values_videos_thumbnails_autogaze", None)
116
+ kwargs.pop("pixel_values_videos", None)
117
+
118
+ if input_ids is not None and torch.any(
119
+ torch.isin(
120
+ input_ids,
121
+ torch.tensor(
122
+ [self.config.image_token_id, self.config.video_token_id],
123
+ device=input_ids.device,
124
+ ),
125
+ ).any()
126
+ ): # Prefill
127
+ # Extract fields from kwargs if not passed as explicit args
128
+ if gazing_info is None:
129
+ gazing_info = kwargs.pop("gazing_info", None)
130
+ if pixel_values_images_tiles is None:
131
+ pixel_values_images_tiles = kwargs.pop("pixel_values_images_tiles", None)
132
+ if pixel_values_images_thumbnails is None:
133
+ pixel_values_images_thumbnails = kwargs.pop("pixel_values_images_thumbnails", None)
134
+ if num_spatial_tiles_each_image is None:
135
+ num_spatial_tiles_each_image = kwargs.pop("num_spatial_tiles_each_image", None)
136
+ if pixel_values_videos_tiles is None:
137
+ pixel_values_videos_tiles = kwargs.pop("pixel_values_videos_tiles", None)
138
+ if pixel_values_videos_thumbnails is None:
139
+ pixel_values_videos_thumbnails = kwargs.pop("pixel_values_videos_thumbnails", None)
140
+ if num_spatial_tiles_each_video is None:
141
+ num_spatial_tiles_each_video = kwargs.pop("num_spatial_tiles_each_video", None)
142
+
143
+ inputs_embeds = self._embed(
144
+ input_ids=input_ids,
145
+ pixel_values=pixel_values,
146
+ pixel_values_images_tiles=pixel_values_images_tiles,
147
+ pixel_values_images_thumbnails=pixel_values_images_thumbnails,
148
+ num_spatial_tiles_each_image=num_spatial_tiles_each_image,
149
+ pixel_values_videos_tiles=pixel_values_videos_tiles,
150
+ pixel_values_videos_thumbnails=pixel_values_videos_thumbnails,
151
+ gazing_info=gazing_info,
152
+ num_spatial_tiles_each_video=num_spatial_tiles_each_video,
153
+ )
154
+ input_ids = None
155
+
156
+ outputs = self.llm(
157
+ input_ids=input_ids,
158
+ inputs_embeds=inputs_embeds,
159
+ **kwargs,
160
+ )
161
+
162
+ return outputs
163
+
164
+ def _embed(
165
+ self,
166
+ *,
167
+ input_ids: Tensor,
168
+ pixel_values: Tensor | None,
169
+ pixel_values_images_tiles: list[Tensor] | None,
170
+ pixel_values_images_thumbnails: list[Tensor] | None,
171
+ num_spatial_tiles_each_image: list[int] | None,
172
+ pixel_values_videos_tiles: list[Tensor] | None,
173
+ pixel_values_videos_thumbnails: list[Tensor] | None,
174
+ gazing_info: dict | None = None,
175
+ num_spatial_tiles_each_video: list[int] | None = None,
176
+ ) -> Tensor:
177
+ inputs_embeds: Tensor = self.llm.model.embed_tokens(input_ids)
178
+
179
+ # Handle images
180
+ if pixel_values_images_tiles is not None and len(pixel_values_images_tiles) > 0:
181
+ per_image_features = self._encode_images(
182
+ pixel_values_images_tiles=pixel_values_images_tiles,
183
+ pixel_values_images_thumbnails=pixel_values_images_thumbnails,
184
+ num_spatial_tiles_each_image=num_spatial_tiles_each_image,
185
+ )
186
+ all_features = torch.cat(per_image_features, dim=0)
187
+
188
+ image_token_mask = input_ids == self.config.image_token_id
189
+ num_image_tokens = image_token_mask.sum().item()
190
+ num_image_features = all_features.shape[0]
191
+
192
+ assert num_image_features == num_image_tokens, (
193
+ f"Number of image features {num_image_features} does not match "
194
+ f"number of image tokens {num_image_tokens}"
195
+ )
196
+
197
+ inputs_embeds[image_token_mask] = all_features.to(inputs_embeds.dtype)
198
+
199
+ # Handle videos
200
+ if pixel_values_videos_tiles is not None:
201
+ per_video_features = self._encode_vision(
202
+ pixel_values_videos_tiles=pixel_values_videos_tiles,
203
+ pixel_values_videos_thumbnails=pixel_values_videos_thumbnails,
204
+ gazing_info=gazing_info,
205
+ num_spatial_tiles_each_video=num_spatial_tiles_each_video,
206
+ )
207
+ # per_video_features: list of (num_tokens_i, llm_hidden) tensors
208
+ all_features = torch.cat(per_video_features, dim=0)
209
+
210
+ # Match vision features to video tokens
211
+ video_token_mask = input_ids == self.config.video_token_id
212
+ num_video_tokens = video_token_mask.sum().item()
213
+ num_vision_features = all_features.shape[0]
214
+
215
+ assert num_vision_features == num_video_tokens, (
216
+ f"Number of vision features {num_vision_features} does not match "
217
+ f"number of video tokens {num_video_tokens}"
218
+ )
219
+
220
+ inputs_embeds[video_token_mask] = all_features.to(inputs_embeds.dtype)
221
+
222
+ return inputs_embeds
223
+
224
+ def _make_default_gazing_info(
225
+ self,
226
+ total_items: int,
227
+ T: int,
228
+ device: torch.device,
229
+ ) -> dict:
230
+ """Create gazing_info that gazes at every patch (no reduction).
231
+
232
+ Args:
233
+ total_items: Number of items (tiles or thumbnails) in the batch.
234
+ T: Temporal frames per item.
235
+ device: Target torch device.
236
+
237
+ Returns:
238
+ gazing_info dict with ``gazing_pos``, ``num_gazing_each_frame``,
239
+ ``if_padded_gazing``.
240
+ """
241
+ image_size = self.vision_tower.config.image_size
242
+ patch_size = self.vision_tower.config.patch_size
243
+ scales = sorted(
244
+ int(s) for s in self.vision_tower.config.scales.split("+")
245
+ )
246
+ num_patches_each_scale = [(s // patch_size) ** 2 for s in scales]
247
+ total_patches_per_frame = sum(num_patches_each_scale)
248
+
249
+ # Gazing positions: all patches for every frame
250
+ per_item_pos = []
251
+ for t in range(T):
252
+ start = t * total_patches_per_frame
253
+ per_item_pos.append(
254
+ torch.arange(start, start + total_patches_per_frame, device=device, dtype=torch.long)
255
+ )
256
+ per_item_pos = torch.cat(per_item_pos) # (T * total_patches_per_frame,)
257
+
258
+ gazing_pos = per_item_pos.unsqueeze(0).expand(total_items, -1) # (B, N)
259
+ num_gazing_each_frame = torch.full(
260
+ (T,), total_patches_per_frame, device=device, dtype=torch.long
261
+ )
262
+ if_padded_gazing = torch.zeros_like(gazing_pos, dtype=torch.bool)
263
+
264
+ return {
265
+ "gazing_pos": gazing_pos,
266
+ "num_gazing_each_frame": num_gazing_each_frame,
267
+ "if_padded_gazing": if_padded_gazing,
268
+ }
269
+
270
+ def _encode_images(
271
+ self,
272
+ pixel_values_images_tiles: list[Tensor],
273
+ pixel_values_images_thumbnails: list[Tensor] | None,
274
+ num_spatial_tiles_each_image: list[int],
275
+ ) -> list[Tensor]:
276
+ """Encode image tiles + thumbnails and return projected features per image.
277
+
278
+ Each image is a set of spatial tiles plus one thumbnail (T=1 each).
279
+ All patches are kept (no gazing reduction). For each image the
280
+ spatial tiles are merged into one effective frame, the thumbnail
281
+ forms a second effective frame, and both are padded to
282
+ ``shuffle_num`` before projection through the mm_projector.
283
+
284
+ Args:
285
+ pixel_values_images_tiles: Per-image tile tensors, each
286
+ ``(num_tiles_i, 1, C, H, W)``.
287
+ pixel_values_images_thumbnails: Per-image thumbnail tensors,
288
+ each ``(1, 1, C, H, W)``. May be ``None``.
289
+ num_spatial_tiles_each_image: Number of spatial tiles per image.
290
+
291
+ Returns:
292
+ List of tensors (one per image), each ``(num_tokens_i, llm_hidden)``.
293
+ """
294
+ shuffle_num = 9
295
+ device = self.vision_tower.device
296
+
297
+ # --- Run vision tower on all tiles ---
298
+ all_tiles = torch.cat(pixel_values_images_tiles, dim=0) # (total_tiles, 1, C, H, W)
299
+ total_tiles = all_tiles.shape[0]
300
+
301
+ gi_tiles = self._make_default_gazing_info(total_tiles, 1, device)
302
+ tiles_features = self._run_vision_tower_batched(all_tiles, gi_tiles) # (total_tiles, N, H)
303
+
304
+ num_gaze_tiles = gi_tiles["num_gazing_each_frame"] # (1,)
305
+ if_padded_tiles = gi_tiles["if_padded_gazing"] # (total_tiles, N)
306
+ frame_lens_tiles = num_gaze_tiles.tolist()
307
+
308
+ tile_feats: list[Tensor] = []
309
+ for idx in range(total_tiles):
310
+ feats = tiles_features[idx]
311
+ pad_mask = if_padded_tiles[idx]
312
+ frame_feats = feats.split(frame_lens_tiles, dim=0)
313
+ frame_pads = pad_mask.split(frame_lens_tiles, dim=0)
314
+ tile_feats.append(
315
+ torch.cat([f[~p] for f, p in zip(frame_feats, frame_pads)], dim=0)
316
+ )
317
+
318
+ # --- Run vision tower on all thumbnails ---
319
+ thumb_feats: list[Tensor] | None = None
320
+ if pixel_values_images_thumbnails is not None and len(pixel_values_images_thumbnails) > 0:
321
+ all_thumbs = torch.cat(pixel_values_images_thumbnails, dim=0) # (num_images, 1, C, H, W)
322
+ total_thumbs = all_thumbs.shape[0]
323
+
324
+ gi_thumbs = self._make_default_gazing_info(total_thumbs, 1, device)
325
+ thumbs_features = self._run_vision_tower_batched(all_thumbs, gi_thumbs)
326
+
327
+ num_gaze_thumbs = gi_thumbs["num_gazing_each_frame"]
328
+ if_padded_thumbs = gi_thumbs["if_padded_gazing"]
329
+ frame_lens_thumbs = num_gaze_thumbs.tolist()
330
+
331
+ thumb_feats = []
332
+ for idx in range(total_thumbs):
333
+ feats = thumbs_features[idx]
334
+ pad_mask = if_padded_thumbs[idx]
335
+ frame_feats = feats.split(frame_lens_thumbs, dim=0)
336
+ frame_pads = pad_mask.split(frame_lens_thumbs, dim=0)
337
+ thumb_feats.append(
338
+ torch.cat([f[~p] for f, p in zip(frame_feats, frame_pads)], dim=0)
339
+ )
340
+
341
+ # --- Build per-image sequences ---
342
+ tile_offset = 0
343
+ per_image_sequences: list[Tensor] = []
344
+ per_image_token_counts: list[int] = []
345
+
346
+ for img_idx, ns in enumerate(num_spatial_tiles_each_image):
347
+ effective_frames: list[Tensor] = []
348
+
349
+ # Tiles effective frame: merge all spatial tiles
350
+ spatial_feats = tile_feats[tile_offset : tile_offset + ns]
351
+ tile_offset += ns
352
+ effective_frames.append(torch.cat(spatial_feats, dim=0))
353
+
354
+ # Thumbnail effective frame
355
+ if thumb_feats is not None:
356
+ effective_frames.append(thumb_feats[img_idx])
357
+
358
+ # Pad each effective frame to divisible by shuffle_num
359
+ padded_frames: list[Tensor] = []
360
+ for frame in effective_frames:
361
+ n = frame.shape[0]
362
+ pad = (shuffle_num - n % shuffle_num) % shuffle_num
363
+ if pad > 0:
364
+ frame = torch.cat([frame, frame[-1:].expand(pad, -1)], dim=0)
365
+ padded_frames.append(frame)
366
+
367
+ image_seq = torch.cat(padded_frames, dim=0)
368
+ per_image_sequences.append(image_seq)
369
+ per_image_token_counts.append(image_seq.shape[0] // shuffle_num)
370
+
371
+ all_features = torch.cat(per_image_sequences, dim=0).unsqueeze(0)
372
+ projected = self.mm_projector(
373
+ all_features.to(device=self.device, dtype=self.dtype)
374
+ )
375
+ projected = projected.squeeze(0)
376
+
377
+ return list(projected.split(per_image_token_counts, dim=0))
378
+
379
+ def _run_vision_tower_batched(
380
+ self,
381
+ all_pixels: Tensor,
382
+ gazing_info_batch: dict,
383
+ ) -> Tensor:
384
+ """Run the vision tower in minibatches and concatenate features.
385
+
386
+ Args:
387
+ all_pixels: ``(B, T, C, H, W)`` tensor.
388
+ gazing_info_batch: Dict with ``gazing_pos`` ``(B, N)``,
389
+ ``if_padded_gazing`` ``(B, N)``, and
390
+ ``num_gazing_each_frame`` ``(T,)`` (shared across batch).
391
+
392
+ Returns:
393
+ ``(B, N, H)`` hidden features from the second-to-last layer.
394
+ """
395
+ device = self.vision_tower.device
396
+ dtype = self.vision_tower.dtype
397
+ total = all_pixels.shape[0]
398
+ bs = self.config.max_batch_size_siglip
399
+
400
+ if total <= bs:
401
+ out: BaseModelOutputWithPooling = self.vision_tower(
402
+ all_pixels.to(device=device, dtype=dtype),
403
+ gazing_info=gazing_info_batch,
404
+ output_hidden_states=True,
405
+ )
406
+ assert out.hidden_states is not None
407
+ return out.hidden_states[-2]
408
+
409
+ num_gaze_shared = gazing_info_batch["num_gazing_each_frame"]
410
+ all_pos = gazing_info_batch["gazing_pos"]
411
+ all_pad = gazing_info_batch["if_padded_gazing"]
412
+
413
+ feature_chunks: list[Tensor] = []
414
+ for start in range(0, total, bs):
415
+ end = min(start + bs, total)
416
+ mini_gi = {
417
+ "gazing_pos": all_pos[start:end],
418
+ "if_padded_gazing": all_pad[start:end],
419
+ "num_gazing_each_frame": num_gaze_shared,
420
+ }
421
+ out = self.vision_tower(
422
+ all_pixels[start:end].to(device=device, dtype=dtype),
423
+ gazing_info=mini_gi,
424
+ output_hidden_states=True,
425
+ )
426
+ assert out.hidden_states is not None
427
+ feature_chunks.append(out.hidden_states[-2])
428
+
429
+ return torch.cat(feature_chunks, dim=0)
430
+
431
+ def _encode_vision(
432
+ self,
433
+ pixel_values_videos_tiles: list[Tensor],
434
+ pixel_values_videos_thumbnails: list[Tensor],
435
+ gazing_info: dict | None,
436
+ num_spatial_tiles_each_video: list[int],
437
+ ) -> list[Tensor]:
438
+ """Encode tiles and thumbnails and return projected features per video.
439
+
440
+ Workflow
441
+ -------
442
+ 1. Batch all tiles / thumbnails across videos and run the vision tower
443
+ (in minibatches controlled by ``config.max_batch_size_siglip``).
444
+ 2. Remove padded gazing features.
445
+ 3. Re-order per video: for each global temporal frame gather all spatial
446
+ tiles, then append thumbnail frames.
447
+ 4. Pad each effective frame to be divisible by ``shuffle_num`` (9).
448
+ 5. Concatenate all videos into a single sequence (batch=1), project
449
+ through ``mm_projector``, then split back per video.
450
+
451
+ Args:
452
+ pixel_values_videos_tiles: Per-video tile tensors, each
453
+ ``(num_tiles_i, T_tile, C, H, W)``.
454
+ pixel_values_videos_thumbnails: Per-video thumbnail tensors, each
455
+ ``(T_thumb_i, 1, C, H, W)``.
456
+ gazing_info: Dict produced by the processor containing per-video
457
+ gazing data for tiles and thumbnails. ``None`` triggers
458
+ default "gaze at all patches" behaviour.
459
+ num_spatial_tiles_each_video: Number of spatial tiles per video.
460
+
461
+ Returns:
462
+ List of tensors (one per video), each ``(num_tokens_i, llm_hidden)``.
463
+ """
464
+ shuffle_num = 9 # must match TokenShuffle in NVILAMultiModalProjector
465
+ device = self.vision_tower.device
466
+ dtype = self.vision_tower.dtype
467
+
468
+ num_videos = len(pixel_values_videos_tiles)
469
+ num_tiles_per_video = [t.shape[0] for t in pixel_values_videos_tiles]
470
+ num_thumbs_per_video = [t.shape[0] for t in pixel_values_videos_thumbnails]
471
+
472
+ # ---- 1. Batch & run vision tower on tiles ----
473
+ all_tiles = torch.cat(pixel_values_videos_tiles, dim=0) # (total_tiles, T_tile, C, H, W)
474
+ T_tile = all_tiles.shape[1]
475
+
476
+ if gazing_info is not None:
477
+ tiles_nge = gazing_info["num_gazing_each_frame_tiles"]
478
+ ref = tiles_nge[0][0]
479
+ assert all(
480
+ torch.equal(t[0], ref) for t in tiles_nge
481
+ ), "num_gazing_each_frame must be identical across all videos for tiles"
482
+ tiles_gi = {
483
+ "gazing_pos": torch.cat(gazing_info["gazing_pos_tiles"], dim=0).to(device),
484
+ "num_gazing_each_frame": gazing_info["num_gazing_each_frame_tiles"][0][0].to(device),
485
+ "if_padded_gazing": torch.cat(gazing_info["if_padded_gazing_tiles"], dim=0).to(device),
486
+ }
487
+ else:
488
+ tiles_gi = self._make_default_gazing_info(all_tiles.shape[0], T_tile, device)
489
+
490
+ tiles_features = self._run_vision_tower_batched(all_tiles, tiles_gi) # (total_tiles, N, H)
491
+
492
+ # ---- 2. Batch & run vision tower on thumbnails ----
493
+ all_thumbs = torch.cat(pixel_values_videos_thumbnails, dim=0) # (total_thumbs, 1, C, H, W)
494
+
495
+ if gazing_info is not None:
496
+ thumbs_nge = gazing_info["num_gazing_each_frame_thumbnails"]
497
+ ref = thumbs_nge[0][0]
498
+ assert all(
499
+ torch.equal(t[0], ref) for t in thumbs_nge
500
+ ), "num_gazing_each_frame must be identical across all videos for thumbnails"
501
+ thumbs_gi = {
502
+ "gazing_pos": torch.cat(gazing_info["gazing_pos_thumbnails"], dim=0).to(device),
503
+ "num_gazing_each_frame": gazing_info["num_gazing_each_frame_thumbnails"][0][0].to(device),
504
+ "if_padded_gazing": torch.cat(gazing_info["if_padded_gazing_thumbnails"], dim=0).to(device),
505
+ }
506
+ else:
507
+ thumbs_gi = self._make_default_gazing_info(all_thumbs.shape[0], 1, device)
508
+
509
+ thumbs_features = self._run_vision_tower_batched(all_thumbs, thumbs_gi) # (total_thumbs, N', H)
510
+
511
+ # ---- 3. Remove padded features & split by frame ----
512
+ # For each tile: list of T_tile tensors, each (n_i, hidden)
513
+ all_tiles_if_padded = tiles_gi["if_padded_gazing"]
514
+ all_tiles_num_gaze = tiles_gi["num_gazing_each_frame"] # 1-D (T_tile,)
515
+ tiles_frame_lens = all_tiles_num_gaze.tolist()
516
+
517
+ all_tiles_frame_feats: list[list[Tensor]] = []
518
+ for idx in range(tiles_features.shape[0]):
519
+ feats = tiles_features[idx] # (N, hidden)
520
+ pad_mask = all_tiles_if_padded[idx] # (N,)
521
+ frame_feats = feats.split(tiles_frame_lens, dim=0)
522
+ frame_pads = pad_mask.split(tiles_frame_lens, dim=0)
523
+ all_tiles_frame_feats.append(
524
+ [f[~p] for f, p in zip(frame_feats, frame_pads)]
525
+ )
526
+
527
+ # For each thumbnail: list with 1 tensor (n_i, hidden)
528
+ all_thumbs_if_padded = thumbs_gi["if_padded_gazing"]
529
+ all_thumbs_num_gaze = thumbs_gi["num_gazing_each_frame"] # 1-D (1,)
530
+ thumbs_frame_lens = all_thumbs_num_gaze.tolist()
531
+
532
+ all_thumbs_frame_feats: list[list[Tensor]] = []
533
+ for idx in range(thumbs_features.shape[0]):
534
+ feats = thumbs_features[idx]
535
+ pad_mask = all_thumbs_if_padded[idx]
536
+ frame_feats = feats.split(thumbs_frame_lens, dim=0)
537
+ frame_pads = pad_mask.split(thumbs_frame_lens, dim=0)
538
+ all_thumbs_frame_feats.append(
539
+ [f[~p] for f, p in zip(frame_feats, frame_pads)]
540
+ )
541
+
542
+ # ---- 4. Per-video: reorder, pad frames, build sequences ----
543
+ tile_offset = 0
544
+ thumb_offset = 0
545
+ per_video_sequences: list[Tensor] = []
546
+ per_video_token_counts: list[int] = []
547
+
548
+ for vid_idx in range(num_videos):
549
+ ns = num_spatial_tiles_each_video[vid_idx]
550
+ nt = num_tiles_per_video[vid_idx]
551
+ tc = nt // ns # temporal chunks
552
+ total_frames = tc * T_tile
553
+ n_thumbs = num_thumbs_per_video[vid_idx]
554
+
555
+ vid_tile_feats = all_tiles_frame_feats[tile_offset: tile_offset + nt]
556
+ tile_offset += nt
557
+ vid_thumb_feats = all_thumbs_frame_feats[thumb_offset: thumb_offset + n_thumbs]
558
+ thumb_offset += n_thumbs
559
+
560
+ # -- Reorder tile features to frame-first --
561
+ # Tiles from processor are ordered:
562
+ # chunk0: [S0, S1, ..., S_{ns-1}], chunk1: [S0, ...], ...
563
+ # We want: for each global frame g, cat all spatial tiles.
564
+ effective_frames: list[Tensor] = []
565
+ for g in range(total_frames):
566
+ chunk = g // T_tile
567
+ f_in_chunk = g % T_tile
568
+ spatial_feats = [
569
+ vid_tile_feats[chunk * ns + s][f_in_chunk]
570
+ for s in range(ns)
571
+ ]
572
+ effective_frames.append(torch.cat(spatial_feats, dim=0))
573
+
574
+ # -- Append thumbnail frames --
575
+ for thumb in vid_thumb_feats:
576
+ effective_frames.append(thumb[0]) # single frame
577
+
578
+ # -- Pad each effective frame to divisible by shuffle_num --
579
+ padded_frames: list[Tensor] = []
580
+ for frame in effective_frames:
581
+ n = frame.shape[0]
582
+ pad = (shuffle_num - n % shuffle_num) % shuffle_num
583
+ if pad > 0:
584
+ padded_frame = torch.cat(
585
+ [frame, frame[-1:].expand(pad, -1)], dim=0
586
+ )
587
+ else:
588
+ padded_frame = frame
589
+ padded_frames.append(padded_frame)
590
+
591
+ video_seq = torch.cat(padded_frames, dim=0) # (total_padded, hidden)
592
+ per_video_sequences.append(video_seq)
593
+ per_video_token_counts.append(video_seq.shape[0] // shuffle_num)
594
+
595
+ # ---- 5. Concat all videos, project, split back ----
596
+ all_features = torch.cat(per_video_sequences, dim=0).unsqueeze(0) # (1, total, hidden)
597
+ projected = self.mm_projector(
598
+ all_features.to(device=self.device, dtype=self.dtype)
599
+ ) # (1, total // shuffle_num, llm_hidden)
600
+ projected = projected.squeeze(0) # (total // shuffle_num, llm_hidden)
601
+
602
+ per_video_features = list(projected.split(per_video_token_counts, dim=0))
603
+
604
+ return per_video_features
preprocessor_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_nvila.NVILAProcessor"
4
+ },
5
+ "do_convert_rgb": null,
6
+ "do_normalize": true,
7
+ "do_rescale": true,
8
+ "do_resize": true,
9
+ "image_mean": [
10
+ 0.5,
11
+ 0.5,
12
+ 0.5
13
+ ],
14
+ "image_processor_type": "SiglipImageProcessor",
15
+ "image_std": [
16
+ 0.5,
17
+ 0.5,
18
+ 0.5
19
+ ],
20
+ "processor_class": "NVILAProcessor",
21
+ "resample": 3,
22
+ "rescale_factor": 0.00392156862745098,
23
+ "size": {
24
+ "height": 392,
25
+ "width": 392
26
+ },
27
+ "autogaze_model_id": "bfshi/AutoGaze",
28
+ "gazing_ratio_tile": 0.75,
29
+ "gazing_ratio_thumbnail": 0.75,
30
+ "task_loss_requirement_tile": 0.7,
31
+ "task_loss_requirement_thumbnail": 0.7,
32
+ "target_scales": [56, 112, 196, 392],
33
+ "target_patch_size": 16,
34
+ "num_video_frames": 8,
35
+ "max_tiles_video": 8,
36
+ "num_video_frames_thumbnail": 8,
37
+ "mm_projector_shuffle_num": 9,
38
+ "max_batch_size_autogaze": 32
39
+ }
processing_nvila.py ADDED
@@ -0,0 +1,1092 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import glob
2
+ import os
3
+ import re
4
+ import tempfile
5
+ import urllib.request
6
+ from os import PathLike
7
+ from typing import cast, Optional
8
+ from urllib.parse import urlparse
9
+
10
+ import cv2
11
+ import numpy as np
12
+ import torch
13
+ import transformers.image_transforms as image_transforms
14
+ import transformers.image_utils as image_utils
15
+ import transformers.video_utils as video_utils
16
+ from PIL import Image
17
+ from transformers.feature_extraction_utils import BatchFeature
18
+ from transformers.image_utils import ImageInput
19
+ from transformers.models.qwen2 import Qwen2Tokenizer, Qwen2TokenizerFast
20
+ from transformers.models.siglip import SiglipImageProcessor, SiglipImageProcessorFast
21
+ from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, VideosKwargs
22
+ from transformers.tokenization_utils_base import BatchEncoding, TextInput
23
+ from transformers.video_utils import VideoInput, VideoMetadata
24
+
25
+ from autogaze.models.autogaze import AutoGaze
26
+ from autogaze.models.autogaze import AutoGazeImageProcessor
27
+ from autogaze.datasets.video_utils import transform_video_for_pytorch
28
+
29
+
30
+ def _find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
31
+ """Find the closest aspect ratio from a set of target ratios.
32
+
33
+ Referenced from https://github.com/OpenGVLab/InternVL and llava/mm_utils.py
34
+ """
35
+ best_ratio_diff = float("inf")
36
+ best_ratio = (1, 1)
37
+ area = width * height
38
+ for ratio in target_ratios:
39
+ target_aspect_ratio = ratio[0] / ratio[1]
40
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
41
+ if ratio_diff < best_ratio_diff:
42
+ best_ratio_diff = ratio_diff
43
+ best_ratio = ratio
44
+ elif ratio_diff == best_ratio_diff:
45
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
46
+ best_ratio = ratio
47
+ return best_ratio
48
+
49
+
50
+ class NVILAProcessorKwargs(ProcessingKwargs, total=False):
51
+ _defaults = {} # type: ignore
52
+
53
+
54
+ def _load_video_frames(video_path: str, num_frames: int = 8) -> list[Image]:
55
+ """
56
+ Load video frames from a video file path.
57
+ Similar to _load_video in llava/utils/media.py
58
+
59
+ Args:
60
+ video_path: Path to the video file or directory of frames
61
+ num_frames: Number of frames to extract
62
+
63
+ Returns:
64
+ List of PIL Images representing video frames
65
+ """
66
+ vidcap = cv2.VideoCapture(video_path)
67
+
68
+ if not vidcap.isOpened():
69
+ raise ValueError(f"Failed to open video: {video_path}")
70
+
71
+ frame_count = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
72
+ while frame_count > 0:
73
+ vidcap.set(cv2.CAP_PROP_POS_FRAMES, frame_count - 1)
74
+ if vidcap.grab():
75
+ break
76
+ frame_count -= 1
77
+ else:
78
+ vidcap.release()
79
+ raise ValueError(f"Video '{video_path}' has no frames.")
80
+
81
+ indices = np.round(np.linspace(0, frame_count - 1, num_frames)).astype(int)
82
+ frames = {}
83
+ for index in indices:
84
+ if index in frames:
85
+ continue
86
+ vidcap.set(cv2.CAP_PROP_POS_FRAMES, index)
87
+ success, frame = vidcap.read()
88
+ if not success:
89
+ continue
90
+ frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
91
+ frames[index] = Image.fromarray(frame)
92
+
93
+ vidcap.release()
94
+
95
+ frames_to_return = [frames[index] for index in indices if index in frames]
96
+ if len(frames_to_return) < num_frames:
97
+ if frames_to_return:
98
+ frames_to_return = frames_to_return + [frames_to_return[-1]] * (num_frames - len(frames_to_return))
99
+ else:
100
+ raise ValueError(f"Could not extract any frames from video: {video_path}")
101
+
102
+ return frames_to_return
103
+
104
+
105
+ class NVILAProcessor(ProcessorMixin):
106
+ attributes = [
107
+ "image_processor",
108
+ "tokenizer",
109
+ ]
110
+ image_processor_class = "AutoImageProcessor"
111
+ tokenizer_class = "AutoTokenizer"
112
+ _auto_class = "AutoProcessor"
113
+
114
+ def __init__(
115
+ self,
116
+ image_processor: SiglipImageProcessor | SiglipImageProcessorFast,
117
+ tokenizer: Qwen2Tokenizer | Qwen2TokenizerFast,
118
+ chat_template: str | None = None,
119
+ autogaze_model_id: str | None = None,
120
+ gazing_ratio_tile: list[float] | float = 0.75,
121
+ gazing_ratio_thumbnail: float | None = 0.75,
122
+ task_loss_requirement_tile: float = 0.7,
123
+ task_loss_requirement_thumbnail: float | None = 0.7,
124
+ target_scales: list[int] | None = None,
125
+ target_patch_size: int | None = None,
126
+ max_tiles_image: int = 12,
127
+ num_video_frames: int = 8,
128
+ max_tiles_video: int = 8,
129
+ num_video_frames_thumbnail: int = 8,
130
+ mm_projector_shuffle_num: int = 9,
131
+ max_batch_size_autogaze: int = 32,
132
+ **kwargs,
133
+ ):
134
+ super().__init__(
135
+ image_processor,
136
+ tokenizer,
137
+ chat_template=chat_template,
138
+ **kwargs,
139
+ )
140
+
141
+ self.image_processor: SiglipImageProcessor | SiglipImageProcessorFast
142
+ self.tokenizer: Qwen2Tokenizer | Qwen2TokenizerFast
143
+
144
+ # AutoGaze configuration
145
+ self.autogaze_model_id = autogaze_model_id or "bfshi/AutoGaze"
146
+ self.gazing_ratio_tile = gazing_ratio_tile
147
+ self.gazing_ratio_thumbnail = gazing_ratio_thumbnail
148
+ self.task_loss_requirement_tile = task_loss_requirement_tile
149
+ self.task_loss_requirement_thumbnail = task_loss_requirement_thumbnail
150
+ self.target_scales = target_scales or [56, 112, 224, 448]
151
+ self.target_patch_size = target_patch_size or 16
152
+
153
+ # Image / video processing configuration
154
+ self.max_tiles_image = max_tiles_image
155
+ self.num_video_frames = num_video_frames
156
+ self.max_tiles_video = max_tiles_video
157
+ self.num_video_frames_thumbnail = num_video_frames_thumbnail
158
+ self.mm_projector_shuffle_num = mm_projector_shuffle_num
159
+ self.max_batch_size_autogaze = max_batch_size_autogaze
160
+
161
+ # Load AutoGaze if available
162
+ self._autogaze_model = None
163
+ self._autogaze_model = AutoGaze.from_pretrained(
164
+ self.autogaze_model_id,
165
+ device_map=None,
166
+ )
167
+ self._autogaze_model.to("cuda").eval()
168
+ print("AutoGaze loaded successfully in processor")
169
+
170
+ def __call__(
171
+ self,
172
+ *,
173
+ text: TextInput | list[TextInput],
174
+ images: ImageInput | None = None,
175
+ videos: VideoInput | None = None,
176
+ **kwargs: Unpack[NVILAProcessorKwargs],
177
+ ) -> BatchFeature:
178
+ normalized_text, normalized_images, normalized_videos = self._normalize_inputs(
179
+ text=text,
180
+ images=images,
181
+ videos=videos,
182
+ )
183
+
184
+ images_inputs, image_token_padding_strategy = (
185
+ self._preprocess_images(
186
+ normalized_images,
187
+ **kwargs,
188
+ )
189
+ if len(normalized_images) > 0
190
+ else (BatchFeature(), [])
191
+ )
192
+
193
+ videos_inputs = (
194
+ self._preprocess_videos(
195
+ normalized_videos,
196
+ **kwargs,
197
+ )
198
+ if len(normalized_videos) > 0
199
+ else (BatchFeature(), [])
200
+ )
201
+
202
+ # Run AutoGaze on preprocessed tiles/thumbnails and compute padding
203
+ gazing_info = None
204
+ video_token_padding_strategy = []
205
+ skip_tiles_gaze = self._should_gaze_all_patches(self.gazing_ratio_tile, self.task_loss_requirement_tile)
206
+ skip_thumbs_gaze = self._should_gaze_all_patches(self.gazing_ratio_thumbnail, self.task_loss_requirement_thumbnail)
207
+ can_construct_without_autogaze = skip_tiles_gaze and skip_thumbs_gaze
208
+ if len(normalized_videos) > 0 and (self._autogaze_model is not None or can_construct_without_autogaze):
209
+ gazing_info = self._get_gazing_info_from_videos(videos_inputs)
210
+ # Compute video padding strategy from gazing results.
211
+ # Because the mm_projector uses TokenShuffle(9), each
212
+ # "effective frame" is padded to a multiple of 9 before
213
+ # projection, then divided by 9. So total tokens per
214
+ # video = sum_over_frames(ceil(non_padded_per_frame / 9)).
215
+ shuffle_num = self.mm_projector_shuffle_num
216
+ ns_list = videos_inputs["num_spatial_tiles_each_video"]
217
+
218
+ for vid_idx in range(len(gazing_info["if_padded_gazing_tiles"])):
219
+ tiles_if_pad = gazing_info["if_padded_gazing_tiles"][vid_idx] # (num_tiles, N)
220
+ tiles_num_gaze = gazing_info["num_gazing_each_frame_tiles"][vid_idx] # (num_tiles, T_tile)
221
+ thumbs_if_pad = gazing_info["if_padded_gazing_thumbnails"][vid_idx] # (T_thumb, N')
222
+ thumbs_num_gaze = gazing_info["num_gazing_each_frame_thumbnails"][vid_idx] # (T_thumb, 1)
223
+
224
+ ns = ns_list[vid_idx]
225
+ num_tiles = tiles_if_pad.shape[0]
226
+ T_tile = tiles_num_gaze.shape[1]
227
+ tc = num_tiles // ns # temporal chunks
228
+ total_frames = tc * T_tile
229
+
230
+ # Non-padded count per tile per frame
231
+ tile_non_padded = [] # tile_non_padded[tile][frame] = int
232
+ for t_idx in range(num_tiles):
233
+ frame_sizes = tiles_num_gaze[t_idx].tolist()
234
+ frame_pad_segs = tiles_if_pad[t_idx].split(frame_sizes)
235
+ tile_non_padded.append(
236
+ [int((~seg).sum().item()) for seg in frame_pad_segs]
237
+ )
238
+
239
+ total_tokens = 0
240
+
241
+ # Tile effective frames (all spatial tiles for one temporal frame)
242
+ for g in range(total_frames):
243
+ chunk = g // T_tile
244
+ f_in_chunk = g % T_tile
245
+ frame_count = sum(
246
+ tile_non_padded[chunk * ns + s][f_in_chunk]
247
+ for s in range(ns)
248
+ )
249
+ total_tokens += (frame_count + shuffle_num - 1) // shuffle_num
250
+
251
+ # Thumbnail frames (each is 1 frame)
252
+ for th_idx in range(thumbs_if_pad.shape[0]):
253
+ frame_sizes = thumbs_num_gaze[th_idx].tolist()
254
+ frame_pad_segs = thumbs_if_pad[th_idx].split(frame_sizes)
255
+ non_pad = sum(int((~seg).sum().item()) for seg in frame_pad_segs)
256
+ total_tokens += (non_pad + shuffle_num - 1) // shuffle_num
257
+
258
+ video_token_padding_strategy.append([total_tokens])
259
+ else:
260
+ video_token_padding_strategy = [[(self.num_video_frames + self.num_video_frames_thumbnail) * 118] * len(normalized_videos)]
261
+
262
+ # Remove AutoGaze-processed pixel values — they were only needed
263
+ # for computing gazing_info and should not be sent to the model.
264
+ if len(normalized_videos) > 0:
265
+ videos_inputs.pop("pixel_values_videos_tiles_autogaze", None)
266
+ videos_inputs.pop("pixel_values_videos_thumbnails_autogaze", None)
267
+
268
+ text_inputs = self._preprocess_text(
269
+ normalized_text,
270
+ image_token_padding_strategy=image_token_padding_strategy,
271
+ video_token_padding_strategy=video_token_padding_strategy,
272
+ **kwargs,
273
+ )
274
+
275
+ # Combine all inputs
276
+ batch_feature = BatchFeature(
277
+ {
278
+ **text_inputs,
279
+ **images_inputs,
280
+ **videos_inputs,
281
+ }
282
+ )
283
+
284
+ # Attach gazing_info so the model can use it downstream
285
+ if gazing_info is not None:
286
+ batch_feature["gazing_info"] = gazing_info
287
+
288
+ return batch_feature
289
+
290
+ def batch_decode(self, *args, **kwargs) -> list[str]:
291
+ return self.tokenizer.batch_decode(*args, **kwargs)
292
+
293
+ def _normalize_inputs(
294
+ self,
295
+ *,
296
+ text: TextInput | list[TextInput],
297
+ images: ImageInput | None,
298
+ videos: VideoInput | None,
299
+ ) -> tuple[list[str], list[Image], list[list[Image]]]:
300
+ if isinstance(text, list):
301
+ normalized_text = text
302
+ else:
303
+ normalized_text = [text]
304
+
305
+ if images is not None and images != []:
306
+ image_flat_list = cast(list, image_utils.make_flat_list_of_images(images))
307
+ normalized_images = [cast(Image, image_transforms.to_pil_image(image)) for image in image_flat_list]
308
+ else:
309
+ normalized_images = []
310
+
311
+ if videos is not None and videos != []:
312
+ # Handle video inputs - can be file paths (str) or lists of PIL Images
313
+ # videos can be a single item or a list
314
+ if not isinstance(videos, (list, tuple)):
315
+ videos = [videos]
316
+
317
+ normalized_videos = []
318
+ # Use num_video_frames from processor config
319
+ num_frames = self.num_video_frames
320
+ for video_input in videos:
321
+ if isinstance(video_input, str):
322
+ parsed = urlparse(video_input)
323
+ if parsed.scheme in ("http", "https"):
324
+ suffix = os.path.splitext(parsed.path)[1] or ".mp4"
325
+ tmp = tempfile.NamedTemporaryFile(suffix=suffix, delete=False)
326
+ try:
327
+ urllib.request.urlretrieve(video_input, tmp.name)
328
+ video_frames = _load_video_frames(tmp.name, num_frames=num_frames)
329
+ finally:
330
+ tmp.close()
331
+ os.unlink(tmp.name)
332
+ else:
333
+ video_frames = _load_video_frames(video_input, num_frames=num_frames)
334
+ normalized_videos.append(video_frames)
335
+ elif isinstance(video_input, (list, tuple)):
336
+ # If it's already a list of images, convert them to PIL Images
337
+ normalized_videos.append([
338
+ cast(Image, image_transforms.to_pil_image(image)) for image in video_input
339
+ ])
340
+ else:
341
+ # Try to use video_utils for other types
342
+ try:
343
+ video_list = cast(list[list], video_utils.make_batched_videos([video_input]))
344
+ normalized_videos.extend([
345
+ [cast(Image, image_transforms.to_pil_image(image)) for image in video]
346
+ for video in video_list
347
+ ])
348
+ except Exception:
349
+ raise ValueError(
350
+ f"Unsupported video input type: {type(video_input)}. "
351
+ "Expected str (file path) or list of PIL Images."
352
+ )
353
+ else:
354
+ normalized_videos = []
355
+
356
+ return normalized_text, normalized_images, normalized_videos
357
+
358
+ def _preprocess_images(
359
+ self,
360
+ images: list[Image],
361
+ **kwargs: Unpack[NVILAProcessorKwargs],
362
+ ) -> tuple[BatchFeature, list[list[int]]]:
363
+ """Preprocess images into spatial tiles plus a thumbnail.
364
+
365
+ Each image is split into a grid of spatial tiles whose count is at
366
+ most ``max_tiles_image``. A thumbnail (the whole image resized to
367
+ ``image_size × image_size``) is appended. Every tile / thumbnail
368
+ is a single-frame "video" of shape ``(1, C, H, W)``. No AutoGaze
369
+ is applied — all patches are kept.
370
+
371
+ Returns:
372
+ A tuple ``(images_inputs, padding_strategy)`` where
373
+ ``images_inputs`` is a ``BatchFeature`` with:
374
+
375
+ - ``"pixel_values_images_tiles"`` – list of tensors, one per
376
+ image, each ``(num_tiles_i, 1, C, H, W)``.
377
+ - ``"pixel_values_images_thumbnails"`` – list of tensors, one
378
+ per image, each ``(1, 1, C, H, W)``.
379
+ - ``"num_spatial_tiles_each_image"`` – list of ints.
380
+
381
+ ``padding_strategy`` is a list (one per image) of
382
+ ``[total_tokens]`` used for text-token padding.
383
+ """
384
+ merged_kwargs = self._merge_kwargs(
385
+ NVILAProcessorKwargs, # type: ignore
386
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
387
+ **kwargs,
388
+ )
389
+
390
+ if hasattr(self.image_processor, "size"):
391
+ image_size = self.image_processor.size.get("height", 392)
392
+ else:
393
+ image_size = 392
394
+
395
+ shuffle_num = self.mm_projector_shuffle_num
396
+
397
+ num_patches_each_scale = [
398
+ (s // self.target_patch_size) ** 2 for s in self.target_scales
399
+ ]
400
+ total_patches_per_frame = sum(num_patches_each_scale)
401
+
402
+ pixel_values_images_tiles: list[torch.Tensor] = []
403
+ pixel_values_images_thumbnails: list[torch.Tensor] = []
404
+ num_spatial_tiles_each_image: list[int] = []
405
+ padding_strategy: list[list[int]] = []
406
+
407
+ for image in images:
408
+ image = image.convert("RGB")
409
+ orig_width, orig_height = image.size
410
+
411
+ max_spatial_tiles = max(self.max_tiles_image, 1)
412
+ aspect_ratio = orig_width / orig_height
413
+
414
+ target_ratios = {
415
+ (i, j)
416
+ for n in range(1, max_spatial_tiles + 1)
417
+ for i in range(1, n + 1)
418
+ for j in range(1, n + 1)
419
+ if 1 <= i * j <= max_spatial_tiles
420
+ }
421
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
422
+
423
+ target_aspect_ratio = _find_closest_aspect_ratio(
424
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size
425
+ )
426
+
427
+ target_width = image_size * target_aspect_ratio[0]
428
+ target_height = image_size * target_aspect_ratio[1]
429
+ num_tiles = target_aspect_ratio[0] * target_aspect_ratio[1]
430
+ num_cols = target_aspect_ratio[0]
431
+
432
+ resized = image.resize((target_width, target_height))
433
+
434
+ # Spatial tiles + thumbnail (whole image resized)
435
+ all_tile_images: list[Image] = []
436
+ for tile_idx in range(num_tiles):
437
+ col = tile_idx % num_cols
438
+ row = tile_idx // num_cols
439
+ box = (
440
+ col * image_size,
441
+ row * image_size,
442
+ (col + 1) * image_size,
443
+ (row + 1) * image_size,
444
+ )
445
+ all_tile_images.append(resized.crop(box))
446
+
447
+ thumbnail = image.resize((image_size, image_size))
448
+ all_images_for_siglip = all_tile_images + [thumbnail]
449
+
450
+ # SigLIP: process tiles + thumbnail at once → (num_tiles+1, C, H, W)
451
+ siglip_processed = self.image_processor(
452
+ all_images_for_siglip, **merged_kwargs["images_kwargs"],
453
+ )["pixel_values"]
454
+ if not isinstance(siglip_processed, torch.Tensor):
455
+ siglip_processed = torch.tensor(np.array(siglip_processed))
456
+
457
+ # Split into tiles and thumbnail, add temporal dim
458
+ tiles_pv = siglip_processed[:num_tiles].unsqueeze(1) # (num_tiles, 1, C, H, W)
459
+ thumb_pv = siglip_processed[num_tiles:].unsqueeze(1) # (1, 1, C, H, W)
460
+
461
+ pixel_values_images_tiles.append(tiles_pv)
462
+ pixel_values_images_thumbnails.append(thumb_pv)
463
+ num_spatial_tiles_each_image.append(num_tiles)
464
+
465
+ # Padding: tiles effective frame + thumbnail effective frame
466
+ tiles_tokens = (num_tiles * total_patches_per_frame + shuffle_num - 1) // shuffle_num
467
+ thumb_tokens = (total_patches_per_frame + shuffle_num - 1) // shuffle_num
468
+ padding_strategy.append([tiles_tokens + thumb_tokens])
469
+
470
+ images_inputs = BatchFeature({
471
+ "pixel_values_images_tiles": pixel_values_images_tiles,
472
+ "pixel_values_images_thumbnails": pixel_values_images_thumbnails,
473
+ "num_spatial_tiles_each_image": num_spatial_tiles_each_image,
474
+ })
475
+
476
+ return images_inputs, padding_strategy
477
+
478
+ def _preprocess_text(
479
+ self,
480
+ text: list[str],
481
+ *,
482
+ image_token_padding_strategy: list[list[int]],
483
+ video_token_padding_strategy: list[list[int]],
484
+ **kwargs: Unpack[NVILAProcessorKwargs],
485
+ ) -> BatchEncoding:
486
+ # Apply chat template to text
487
+ messages = [[
488
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
489
+ {"role": "user", "content": t}
490
+ ] for t in text]
491
+ text = self.tokenizer.apply_chat_template(
492
+ messages,
493
+ tokenize=False,
494
+ add_generation_prompt=True
495
+ )
496
+
497
+ # Pad media tokens.
498
+ assert isinstance(self.tokenizer.image_token, str)
499
+ assert isinstance(self.tokenizer.video_token, str)
500
+
501
+ for media_token, padding_strategy in (
502
+ (self.tokenizer.image_token, image_token_padding_strategy),
503
+ (self.tokenizer.video_token, video_token_padding_strategy),
504
+ ):
505
+ assert sum([s.count(media_token) for s in text]) == len(padding_strategy)
506
+
507
+ # Pad to number of tiles.
508
+ pad_lens = [len(x) for x in padding_strategy]
509
+ text = [re.sub(rf"({re.escape(media_token)})", lambda _: media_token * pad_lens.pop(0), s) for s in text]
510
+
511
+ # Pad to number of features.
512
+ pad_lens = [y for x in padding_strategy for y in x]
513
+ text = [re.sub(rf"({re.escape(media_token)})", lambda _: media_token * pad_lens.pop(0), s) for s in text]
514
+
515
+ merged_kwargs = self._merge_kwargs(
516
+ NVILAProcessorKwargs, # type: ignore
517
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
518
+ **kwargs,
519
+ )
520
+
521
+ text_inputs = self.tokenizer(
522
+ text=text,
523
+ **merged_kwargs["text_kwargs"],
524
+ )
525
+
526
+ return text_inputs
527
+
528
+ def _preprocess_videos(
529
+ self,
530
+ videos: list[list[Image]],
531
+ **kwargs: Unpack[NVILAProcessorKwargs],
532
+ ) -> BatchFeature:
533
+ """Preprocess videos into spatiotemporal tiles and thumbnails.
534
+
535
+ Each video is split into a grid of spatiotemporal tiles and a set of
536
+ low-resolution thumbnail frames. Both SigLIP-processed and
537
+ AutoGaze-processed copies are produced.
538
+
539
+ Spatial tiling
540
+ Every frame is resized so that its dimensions become a multiple of
541
+ ``image_size`` (from the SigLIP image processor) and then cropped
542
+ into ``(cols, rows)`` spatial tiles, where ``cols * rows <=
543
+ max_tiles_video``. The best ``(cols, rows)`` is chosen by matching
544
+ the original frame aspect ratio (same logic as
545
+ ``dynamic_preprocess`` in ``llava/mm_utils.py``).
546
+
547
+ Temporal chunking
548
+ The T sampled frames are divided into ``T // max_num_frames``
549
+ consecutive chunks of ``max_num_frames`` frames each, where
550
+ ``max_num_frames`` comes from the AutoGaze model config.
551
+ ``T`` must be divisible by ``max_num_frames``.
552
+
553
+ Tile ordering
554
+ Tiles are ordered **temporal-chunk-first**: all spatial tiles for
555
+ the first temporal chunk, then all spatial tiles for the second
556
+ temporal chunk, and so on.
557
+
558
+ Thumbnails
559
+ Each frame is also resized to ``image_size × image_size`` to form a
560
+ thumbnail. If the number of frames exceeds
561
+ ``num_video_frames_thumbnail``, thumbnails are uniformly subsampled
562
+ (every k-th frame) to that count. Each thumbnail is treated as a
563
+ single-frame video (temporal dim = 1).
564
+
565
+ Args:
566
+ videos: List of videos, where each video is a list of PIL Images
567
+ (one per frame).
568
+ **kwargs: Additional keyword arguments forwarded to the SigLIP
569
+ image processor.
570
+
571
+ Returns:
572
+ A tuple ``(videos_inputs, padding_strategy)`` where
573
+
574
+ ``videos_inputs`` is a ``BatchFeature`` dict with the keys:
575
+
576
+ - ``"pixel_values_videos_tiles"`` – list of tensors, one per video.
577
+ Each tensor has shape ``(num_tiles, T_tile, C, H, W)`` where
578
+ ``num_tiles = num_spatial_tiles * temporal_chunks``,
579
+ ``T_tile = max_num_frames`` (from AutoGaze config),
580
+ and ``H = W = image_size``.
581
+ Processed by the SigLIP image processor.
582
+ - ``"pixel_values_videos_thumbnails"`` – list of tensors, one per
583
+ video. Each tensor has shape
584
+ ``(T_thumbnail, 1, C, H, W)`` where ``T_thumbnail <=
585
+ num_video_frames_thumbnail`` and ``H = W = image_size``.
586
+ Processed by the SigLIP image processor.
587
+ - ``"pixel_values_videos_tiles_autogaze"`` *(optional)* – same
588
+ structure as ``pixel_values_videos_tiles`` but processed by the
589
+ AutoGaze ``transform_video_for_pytorch`` transform.
590
+ Only present when AutoGaze is available.
591
+ - ``"pixel_values_videos_thumbnails_autogaze"`` *(optional)* – same
592
+ structure as ``pixel_values_videos_thumbnails`` but processed by
593
+ the AutoGaze transform. Only present when AutoGaze is available.
594
+
595
+ ``padding_strategy`` is a list (one entry per video) of lists of
596
+ ints used for text-token padding. Currently a placeholder; the
597
+ final strategy depends on downstream gazing results.
598
+ """
599
+ merged_kwargs = self._merge_kwargs(
600
+ NVILAProcessorKwargs, # type: ignore
601
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
602
+ **kwargs,
603
+ )
604
+
605
+ # Get siglip image size (tile spatial resolution)
606
+ if hasattr(self.image_processor, "size"):
607
+ image_size = self.image_processor.size.get("height", 392)
608
+ else:
609
+ image_size = 392
610
+
611
+ # Get AutoGaze max_num_frames for temporal chunking
612
+ if self._autogaze_model is not None:
613
+ autogaze_max_num_frames = self._autogaze_model.config.max_num_frames
614
+ else:
615
+ autogaze_max_num_frames = 16 # default
616
+
617
+ # Load AutoGaze transform if available
618
+ autogaze_transform = None
619
+ largest_scale = max(self.target_scales)
620
+ autogaze_transform = AutoGazeImageProcessor.from_pretrained(
621
+ self.autogaze_model_id,
622
+ size=(largest_scale, largest_scale),
623
+ )
624
+
625
+ pixel_values_videos_tiles = []
626
+ pixel_values_videos_thumbnails = []
627
+ pixel_values_videos_tiles_autogaze = []
628
+ pixel_values_videos_thumbnails_autogaze = []
629
+ num_spatial_tiles_each_video = []
630
+
631
+ for video in videos:
632
+ video = [img.convert("RGB") for img in video]
633
+ num_frames = len(video)
634
+ orig_width, orig_height = video[0].size
635
+
636
+ # --- Temporal chunking ---
637
+ temporal_chunks = num_frames // autogaze_max_num_frames
638
+ assert temporal_chunks >= 1 and num_frames % autogaze_max_num_frames == 0, (
639
+ f"Number of frames ({num_frames}) must be divisible by "
640
+ f"AutoGaze max_num_frames ({autogaze_max_num_frames})"
641
+ )
642
+
643
+ # --- Spatial tiling ---
644
+ # max_tiles_video directly controls the max number of spatial tiles
645
+ max_spatial_tiles = max(self.max_tiles_video, 1)
646
+
647
+ # Use dynamic_preprocess-style approach for finding best spatial aspect ratio
648
+ aspect_ratio = orig_width / orig_height
649
+
650
+ target_ratios = {
651
+ (i, j)
652
+ for n in range(1, max_spatial_tiles + 1)
653
+ for i in range(1, n + 1)
654
+ for j in range(1, n + 1)
655
+ if 1 <= i * j <= max_spatial_tiles
656
+ }
657
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
658
+
659
+ target_aspect_ratio = _find_closest_aspect_ratio(
660
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size
661
+ )
662
+
663
+ target_width = image_size * target_aspect_ratio[0] # cols * image_size
664
+ target_height = image_size * target_aspect_ratio[1] # rows * image_size
665
+ num_spatial_tiles = target_aspect_ratio[0] * target_aspect_ratio[1]
666
+ num_cols = target_aspect_ratio[0]
667
+
668
+ # --- Build per-frame spatial tiles and thumbnails ---
669
+ # spatial_tile_frames[spatial_idx] = list of T PIL Images
670
+ spatial_tile_frames = [[] for _ in range(num_spatial_tiles)]
671
+ thumbnail_frames = []
672
+
673
+ for frame in video:
674
+ # Resize frame for spatial tiling
675
+ resized_frame = frame.resize((target_width, target_height))
676
+
677
+ # Split into spatial tiles
678
+ for tile_idx in range(num_spatial_tiles):
679
+ col = tile_idx % num_cols
680
+ row = tile_idx // num_cols
681
+ box = (
682
+ col * image_size,
683
+ row * image_size,
684
+ (col + 1) * image_size,
685
+ (row + 1) * image_size,
686
+ )
687
+ tile = resized_frame.crop(box)
688
+ spatial_tile_frames[tile_idx].append(tile)
689
+
690
+ # Thumbnail: resize whole frame to image_size x image_size
691
+ thumbnail = frame.resize((image_size, image_size))
692
+ thumbnail_frames.append(thumbnail)
693
+
694
+ # --- Assemble spatiotemporal tiles ---
695
+ # Collect all tile images in flat order: temporal chunk (outer) ×
696
+ # spatial tile (inner) × frame-within-chunk (innermost).
697
+ num_tiles = temporal_chunks * num_spatial_tiles
698
+ T_tile = autogaze_max_num_frames
699
+ all_tile_images = []
700
+ for t_chunk in range(temporal_chunks):
701
+ for spatial_idx in range(num_spatial_tiles):
702
+ start = t_chunk * T_tile
703
+ end = start + T_tile
704
+ all_tile_images.extend(spatial_tile_frames[spatial_idx][start:end])
705
+
706
+ # SigLIP: process all tile images at once → (num_tiles * T_tile, C, H, W)
707
+ siglip_processed = self.image_processor(
708
+ all_tile_images, **merged_kwargs["images_kwargs"],
709
+ )["pixel_values"]
710
+ if not isinstance(siglip_processed, torch.Tensor):
711
+ siglip_processed = torch.tensor(np.array(siglip_processed))
712
+ video_tiles_siglip = siglip_processed.reshape(num_tiles, T_tile, *siglip_processed.shape[1:])
713
+ pixel_values_videos_tiles.append(video_tiles_siglip)
714
+
715
+ # AutoGaze transform: process all tile images at once
716
+ if autogaze_transform is not None:
717
+ all_tile_np = np.stack([np.array(f) for f in all_tile_images]) # (num_tiles * T_tile, H, W, 3)
718
+ autogaze_processed = transform_video_for_pytorch(all_tile_np, autogaze_transform)
719
+ video_tiles_autogaze = autogaze_processed.reshape(num_tiles, T_tile, *autogaze_processed.shape[1:])
720
+ pixel_values_videos_tiles_autogaze.append(video_tiles_autogaze)
721
+
722
+ # --- Assemble thumbnails ---
723
+ # Subsample thumbnails if needed (keep every k-th frame)
724
+ if len(thumbnail_frames) > self.num_video_frames_thumbnail:
725
+ step = len(thumbnail_frames) // self.num_video_frames_thumbnail
726
+ sampled_thumbnail_frames = thumbnail_frames[::step][: self.num_video_frames_thumbnail]
727
+ else:
728
+ sampled_thumbnail_frames = thumbnail_frames
729
+
730
+ T_thumb = len(sampled_thumbnail_frames)
731
+
732
+ # SigLIP: process all thumbnail images at once → (T_thumb, C, H, W)
733
+ siglip_processed = self.image_processor(
734
+ sampled_thumbnail_frames, **merged_kwargs["images_kwargs"],
735
+ )["pixel_values"]
736
+ if not isinstance(siglip_processed, torch.Tensor):
737
+ siglip_processed = torch.tensor(np.array(siglip_processed))
738
+ # Each thumbnail is a single-frame video → (T_thumb, 1, C, H, W)
739
+ video_thumbnails_siglip = siglip_processed.unsqueeze(1)
740
+ pixel_values_videos_thumbnails.append(video_thumbnails_siglip)
741
+
742
+ # AutoGaze transform: process all thumbnail images at once
743
+ if autogaze_transform is not None:
744
+ all_thumb_np = np.stack([np.array(f) for f in sampled_thumbnail_frames]) # (T_thumb, H, W, 3)
745
+ autogaze_processed = transform_video_for_pytorch(all_thumb_np, autogaze_transform)
746
+ video_thumbnails_autogaze = autogaze_processed.unsqueeze(1) # (T_thumb, 1, C, H, W)
747
+ pixel_values_videos_thumbnails_autogaze.append(video_thumbnails_autogaze)
748
+
749
+ num_spatial_tiles_each_video.append(num_spatial_tiles)
750
+
751
+ print(
752
+ f"Video tiling: {num_frames} frames @ {orig_width}x{orig_height} → "
753
+ f"{num_spatial_tiles} spatial × {temporal_chunks} temporal = "
754
+ f"{num_spatial_tiles * temporal_chunks} tiles, each "
755
+ f"{autogaze_max_num_frames}×{image_size}×{image_size}; "
756
+ f"{len(sampled_thumbnail_frames)} thumbnail frames"
757
+ )
758
+
759
+ # Build output BatchFeature
760
+ videos_inputs = BatchFeature(
761
+ {
762
+ "pixel_values_videos_tiles": pixel_values_videos_tiles,
763
+ "pixel_values_videos_thumbnails": pixel_values_videos_thumbnails,
764
+ "num_spatial_tiles_each_video": num_spatial_tiles_each_video,
765
+ }
766
+ )
767
+ if pixel_values_videos_tiles_autogaze:
768
+ videos_inputs["pixel_values_videos_tiles_autogaze"] = pixel_values_videos_tiles_autogaze
769
+ if pixel_values_videos_thumbnails_autogaze:
770
+ videos_inputs["pixel_values_videos_thumbnails_autogaze"] = pixel_values_videos_thumbnails_autogaze
771
+
772
+ return videos_inputs
773
+
774
+ @staticmethod
775
+ def _should_gaze_all_patches(gazing_ratio, task_loss_requirement) -> bool:
776
+ """Return True when the gazing config means every patch is kept.
777
+
778
+ This is the case when ``gazing_ratio`` is ``None`` (no gazing at all),
779
+ or when ``gazing_ratio == 1`` (keep 100 %) **and**
780
+ ``task_loss_requirement is None`` (no adaptive pruning).
781
+ """
782
+ if gazing_ratio is None:
783
+ return True
784
+ if task_loss_requirement is not None:
785
+ return False
786
+ if isinstance(gazing_ratio, (list, tuple)):
787
+ return all(r == 1 for r in gazing_ratio)
788
+ return gazing_ratio == 1
789
+
790
+ @staticmethod
791
+ def _sort_gazing_pos_per_frame(
792
+ gazing_pos: torch.Tensor,
793
+ if_padded: torch.Tensor,
794
+ num_gazing_each_frame: torch.Tensor,
795
+ ) -> torch.Tensor:
796
+ """Sort non-padded gazing positions in ascending order within each frame.
797
+
798
+ Padded positions are left untouched at the end of each frame's segment
799
+ so that the total count (padded + non-padded) per frame is unchanged.
800
+
801
+ Args:
802
+ gazing_pos: ``(B, N)`` tensor of gazing patch indices.
803
+ if_padded: ``(B, N)`` bool tensor (``True`` = padded / dummy).
804
+ num_gazing_each_frame: ``(B, T)`` tensor giving the number of
805
+ gazing positions (padded + non-padded) for each frame.
806
+
807
+ Returns:
808
+ A new ``(B, N)`` tensor with the same values as *gazing_pos*
809
+ except that the non-padded entries within every frame are sorted.
810
+ """
811
+ sorted_pos = gazing_pos.clone()
812
+ B, _ = gazing_pos.shape
813
+ T = num_gazing_each_frame.shape[1]
814
+
815
+ for b in range(B):
816
+ offset = 0
817
+ for t in range(T):
818
+ count = int(num_gazing_each_frame[b, t].item())
819
+ frame_pos = gazing_pos[b, offset : offset + count]
820
+ frame_pad = if_padded[b, offset : offset + count]
821
+
822
+ # Indices of non-padded (real) positions within the frame segment
823
+ real_mask = ~frame_pad
824
+ real_pos = frame_pos[real_mask]
825
+
826
+ # Sort the real positions
827
+ real_pos_sorted = real_pos.sort()[0]
828
+
829
+ # Write sorted values back at the correct locations
830
+ real_indices = real_mask.nonzero(as_tuple=True)[0]
831
+ sorted_pos[b, offset + real_indices] = real_pos_sorted
832
+
833
+ offset += count
834
+
835
+ return sorted_pos
836
+
837
+ def _run_autogaze_batched(
838
+ self,
839
+ all_videos: torch.Tensor,
840
+ autogaze_device: torch.device,
841
+ cpu_device: torch.device,
842
+ gazing_ratio,
843
+ task_loss_requirement,
844
+ ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
845
+ """Run AutoGaze in minibatches and return combined results on CPU.
846
+
847
+ Different minibatches may produce different per-frame gazing counts
848
+ (e.g. when ``task_loss_requirement`` triggers adaptive pruning).
849
+ This method pads each frame's segment to the *maximum* count across
850
+ all minibatches so that the results can be concatenated along the
851
+ batch dimension.
852
+
853
+ Args:
854
+ all_videos: ``(B, T, C, H, W)`` tensor of videos to process.
855
+ autogaze_device: Device where AutoGaze runs (typically CUDA).
856
+ cpu_device: Device for the returned tensors (typically CPU).
857
+ gazing_ratio: Gazing ratio to pass to AutoGaze.
858
+ task_loss_requirement: Task loss requirement to pass to AutoGaze.
859
+
860
+ Returns:
861
+ A tuple ``(gazing_pos, if_padded, num_gazing)`` where
862
+
863
+ - ``gazing_pos`` is ``(B, N_max)`` on *cpu_device*
864
+ - ``if_padded`` is ``(B, N_max)`` bool on *cpu_device*
865
+ - ``num_gazing`` is ``(B, T)`` on *cpu_device*
866
+
867
+ ``N_max = sum(max_per_frame)`` where ``max_per_frame[t]`` is the
868
+ largest per-frame count across all minibatches.
869
+ """
870
+ total = all_videos.shape[0]
871
+ bs = self.max_batch_size_autogaze
872
+
873
+ batch_results: list[dict] = []
874
+
875
+ with torch.inference_mode():
876
+ for start in range(0, total, bs):
877
+ batch = all_videos[start : start + bs]
878
+
879
+ gaze = self._autogaze_model(
880
+ {"video": batch.to(autogaze_device)},
881
+ gazing_ratio=gazing_ratio,
882
+ task_loss_requirement=task_loss_requirement,
883
+ target_scales=self.target_scales,
884
+ target_patch_size=self.target_patch_size,
885
+ )
886
+
887
+ ng = gaze["num_gazing_each_frame"]
888
+ if isinstance(ng, list):
889
+ ng = torch.tensor(ng, device=cpu_device, dtype=torch.long)
890
+ elif not isinstance(ng, torch.Tensor):
891
+ ng = torch.tensor(ng, device=cpu_device, dtype=torch.long)
892
+ else:
893
+ ng = ng.to(cpu_device)
894
+ if ng.dim() == 2:
895
+ ng = ng[0]
896
+
897
+ batch_results.append({
898
+ "gazing_pos": gaze["gazing_pos"].to(cpu_device),
899
+ "if_padded": gaze["if_padded_gazing"].to(cpu_device),
900
+ "num_gazing": ng,
901
+ "batch_size": batch.shape[0],
902
+ })
903
+
904
+ # Fast path: single minibatch — no cross-batch padding needed
905
+ if len(batch_results) == 1:
906
+ r = batch_results[0]
907
+ num_gazing = r["num_gazing"].unsqueeze(0).expand(total, -1).contiguous()
908
+ return r["gazing_pos"], r["if_padded"], num_gazing
909
+
910
+ # Compute the max per-frame count across all minibatches
911
+ all_ng = torch.stack([r["num_gazing"] for r in batch_results], dim=0) # (num_minibatches, T)
912
+ max_per_frame = all_ng.max(dim=0).values # (T,)
913
+ max_N = int(max_per_frame.sum().item())
914
+ T = max_per_frame.shape[0]
915
+
916
+ padded_pos_list = []
917
+ padded_mask_list = []
918
+
919
+ for r in batch_results:
920
+ src_pos = r["gazing_pos"] # (mini_B, N_src)
921
+ src_pad = r["if_padded"] # (mini_B, N_src)
922
+ src_ng = r["num_gazing"] # (T,)
923
+ mini_B = r["batch_size"]
924
+
925
+ if int(src_ng.sum().item()) == max_N:
926
+ padded_pos_list.append(src_pos)
927
+ padded_mask_list.append(src_pad)
928
+ continue
929
+
930
+ dst_pos = torch.zeros(mini_B, max_N, device=cpu_device, dtype=src_pos.dtype)
931
+ dst_pad = torch.ones(mini_B, max_N, device=cpu_device, dtype=torch.bool)
932
+
933
+ src_off = 0
934
+ dst_off = 0
935
+ for t in range(T):
936
+ sc = int(src_ng[t].item())
937
+ dc = int(max_per_frame[t].item())
938
+ dst_pos[:, dst_off : dst_off + sc] = src_pos[:, src_off : src_off + sc]
939
+ dst_pad[:, dst_off : dst_off + sc] = src_pad[:, src_off : src_off + sc]
940
+ src_off += sc
941
+ dst_off += dc
942
+
943
+ padded_pos_list.append(dst_pos)
944
+ padded_mask_list.append(dst_pad)
945
+
946
+ gazing_pos = torch.cat(padded_pos_list, dim=0)
947
+ if_padded = torch.cat(padded_mask_list, dim=0)
948
+ num_gazing = max_per_frame.unsqueeze(0).expand(total, -1).contiguous()
949
+
950
+ return gazing_pos, if_padded, num_gazing
951
+
952
+ def _get_gazing_info_from_videos(
953
+ self,
954
+ videos_inputs: BatchFeature,
955
+ ) -> Optional[dict]:
956
+ """Run AutoGaze on the preprocessed tiles and thumbnails.
957
+
958
+ All tiles from all videos are batched together (they share the same
959
+ temporal dimension ``T_tile``). Similarly, all thumbnails are batched
960
+ together (temporal dim = 1). AutoGaze is run once on each batch and
961
+ the results are split back per-video.
962
+
963
+ When a gazing ratio is 1 and the corresponding task_loss_requirement is
964
+ None (or gazing_ratio is None), all patches are kept and AutoGaze is
965
+ skipped for that component. If both tiles and thumbnails meet this
966
+ condition, AutoGaze is not invoked at all.
967
+
968
+ Args:
969
+ videos_inputs: The ``BatchFeature`` returned by
970
+ ``_preprocess_videos``, which must contain the keys
971
+ ``pixel_values_videos_tiles_autogaze`` and
972
+ ``pixel_values_videos_thumbnails_autogaze`` (unless the
973
+ corresponding component can skip AutoGaze).
974
+
975
+ Returns:
976
+ A dict with the following keys (or ``None`` if AutoGaze is
977
+ unavailable or the required inputs are missing):
978
+
979
+ - ``"gazing_pos_tiles"`` – list of tensors, one per video, each
980
+ shaped ``(num_tiles_i, N)``.
981
+ - ``"num_gazing_each_frame_tiles"`` – list of tensors, one per
982
+ video, each shaped ``(num_tiles_i, T_tile)``.
983
+ - ``"if_padded_gazing_tiles"`` – list of bool tensors, one per
984
+ video, each shaped ``(num_tiles_i, N)``.
985
+ - ``"gazing_pos_thumbnails"`` – list of tensors, one per video,
986
+ each shaped ``(T_thumb_i, N')``.
987
+ - ``"num_gazing_each_frame_thumbnails"`` – list of tensors, one per
988
+ video, each shaped ``(T_thumb_i, 1)``.
989
+ - ``"if_padded_gazing_thumbnails"`` – list of bool tensors, one per
990
+ video, each shaped ``(T_thumb_i, N')``.
991
+ """
992
+ skip_tiles = self._should_gaze_all_patches(
993
+ self.gazing_ratio_tile, self.task_loss_requirement_tile
994
+ )
995
+ skip_thumbnails = self._should_gaze_all_patches(
996
+ self.gazing_ratio_thumbnail, self.task_loss_requirement_thumbnail
997
+ )
998
+ need_autogaze = not skip_tiles or not skip_thumbnails
999
+
1000
+ if need_autogaze and self._autogaze_model is None:
1001
+ return None
1002
+
1003
+ # Per-video tile/thumbnail counts from SigLIP tensors (always present)
1004
+ siglip_tiles = videos_inputs["pixel_values_videos_tiles"]
1005
+ siglip_thumbs = videos_inputs["pixel_values_videos_thumbnails"]
1006
+ num_tiles_per_video = [t.shape[0] for t in siglip_tiles]
1007
+ num_thumbs_per_video = [t.shape[0] for t in siglip_thumbs]
1008
+
1009
+ device = torch.device("cpu")
1010
+ autogaze_device = torch.device("cuda") if torch.cuda.is_available() else device
1011
+
1012
+ # Total patches per frame across all scales
1013
+ num_patches_each_scale = [
1014
+ (s // self.target_patch_size) ** 2 for s in self.target_scales
1015
+ ]
1016
+ total_patches_per_frame = sum(num_patches_each_scale)
1017
+
1018
+ # Ensure AutoGaze model is on GPU for inference
1019
+ if need_autogaze:
1020
+ current_device = next(self._autogaze_model.parameters()).device
1021
+ if current_device != autogaze_device:
1022
+ self._autogaze_model = self._autogaze_model.to(autogaze_device)
1023
+
1024
+ # --- Tiles ---
1025
+ if skip_tiles:
1026
+ total_tiles = sum(num_tiles_per_video)
1027
+ T_tile = siglip_tiles[0].shape[1]
1028
+ per_frame_pos = torch.arange(total_patches_per_frame, device=device, dtype=torch.long)
1029
+ tiles_gazing_pos = per_frame_pos.repeat(T_tile).unsqueeze(0).expand(total_tiles, -1).contiguous()
1030
+ tiles_if_padded = torch.zeros(
1031
+ total_tiles, T_tile * total_patches_per_frame, device=device, dtype=torch.bool
1032
+ )
1033
+ tiles_num_gazing = torch.full(
1034
+ (total_tiles, T_tile), total_patches_per_frame, device=device, dtype=torch.long
1035
+ )
1036
+ else:
1037
+ tiles_autogaze = videos_inputs.get("pixel_values_videos_tiles_autogaze")
1038
+ if tiles_autogaze is None:
1039
+ return None
1040
+
1041
+ all_tiles = torch.cat(tiles_autogaze, dim=0)
1042
+ tiles_gazing_pos, tiles_if_padded, tiles_num_gazing = self._run_autogaze_batched(
1043
+ all_tiles, autogaze_device, device,
1044
+ self.gazing_ratio_tile, self.task_loss_requirement_tile,
1045
+ )
1046
+ tiles_gazing_pos = self._sort_gazing_pos_per_frame(
1047
+ tiles_gazing_pos, tiles_if_padded, tiles_num_gazing
1048
+ )
1049
+
1050
+ # --- Thumbnails ---
1051
+ if skip_thumbnails:
1052
+ total_thumbs = sum(num_thumbs_per_video)
1053
+ per_thumb_pos = torch.arange(
1054
+ total_patches_per_frame, device=device, dtype=torch.long
1055
+ )
1056
+ thumbs_gazing_pos = per_thumb_pos.unsqueeze(0).expand(total_thumbs, -1).contiguous()
1057
+ thumbs_if_padded = torch.zeros_like(thumbs_gazing_pos, dtype=torch.bool)
1058
+ thumbs_num_gazing = torch.full(
1059
+ (total_thumbs, 1), total_patches_per_frame,
1060
+ device=device, dtype=torch.long,
1061
+ )
1062
+ else:
1063
+ thumbs_autogaze = videos_inputs.get("pixel_values_videos_thumbnails_autogaze")
1064
+ if thumbs_autogaze is None:
1065
+ return None
1066
+
1067
+ all_thumbs = torch.cat(thumbs_autogaze, dim=0)
1068
+ thumbs_gazing_pos, thumbs_if_padded, thumbs_num_gazing = self._run_autogaze_batched(
1069
+ all_thumbs, autogaze_device, device,
1070
+ self.gazing_ratio_thumbnail, self.task_loss_requirement_thumbnail,
1071
+ )
1072
+ thumbs_gazing_pos = self._sort_gazing_pos_per_frame(
1073
+ thumbs_gazing_pos, thumbs_if_padded, thumbs_num_gazing
1074
+ )
1075
+
1076
+ # --- Split results back per video ---
1077
+ tiles_gazing_pos_list = list(torch.split(tiles_gazing_pos, num_tiles_per_video, dim=0))
1078
+ tiles_if_padded_list = list(torch.split(tiles_if_padded, num_tiles_per_video, dim=0))
1079
+ tiles_num_gazing_list = list(torch.split(tiles_num_gazing, num_tiles_per_video, dim=0))
1080
+
1081
+ thumbs_gazing_pos_list = list(torch.split(thumbs_gazing_pos, num_thumbs_per_video, dim=0))
1082
+ thumbs_if_padded_list = list(torch.split(thumbs_if_padded, num_thumbs_per_video, dim=0))
1083
+ thumbs_num_gazing_list = list(torch.split(thumbs_num_gazing, num_thumbs_per_video, dim=0))
1084
+
1085
+ return {
1086
+ "gazing_pos_tiles": tiles_gazing_pos_list,
1087
+ "num_gazing_each_frame_tiles": tiles_num_gazing_list,
1088
+ "if_padded_gazing_tiles": tiles_if_padded_list,
1089
+ "gazing_pos_thumbnails": thumbs_gazing_pos_list,
1090
+ "num_gazing_each_frame_thumbnails": thumbs_num_gazing_list,
1091
+ "if_padded_gazing_thumbnails": thumbs_if_padded_list,
1092
+ }
processor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_nvila.NVILAProcessor"
4
+ },
5
+ "processor_class": "NVILAProcessor"
6
+ }
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,793 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 16174169312
4
+ },
5
+ "weight_map": {
6
+ "llm.lm_head.weight": "model-00001-of-00004.safetensors",
7
+ "llm.model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "llm.model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
9
+ "llm.model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
10
+ "llm.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
11
+ "llm.model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
12
+ "llm.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
13
+ "llm.model.layers.0.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
14
+ "llm.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
15
+ "llm.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
16
+ "llm.model.layers.0.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
17
+ "llm.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
18
+ "llm.model.layers.0.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
19
+ "llm.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
20
+ "llm.model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
21
+ "llm.model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
22
+ "llm.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
23
+ "llm.model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
24
+ "llm.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
25
+ "llm.model.layers.1.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
26
+ "llm.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
27
+ "llm.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
28
+ "llm.model.layers.1.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
29
+ "llm.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
30
+ "llm.model.layers.1.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
31
+ "llm.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
32
+ "llm.model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
33
+ "llm.model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
34
+ "llm.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
35
+ "llm.model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
36
+ "llm.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
37
+ "llm.model.layers.2.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
38
+ "llm.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
39
+ "llm.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
40
+ "llm.model.layers.2.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
41
+ "llm.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
42
+ "llm.model.layers.2.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
43
+ "llm.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
44
+ "llm.model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
45
+ "llm.model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
46
+ "llm.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
47
+ "llm.model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
48
+ "llm.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
49
+ "llm.model.layers.3.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
50
+ "llm.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
51
+ "llm.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
52
+ "llm.model.layers.3.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
53
+ "llm.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
54
+ "llm.model.layers.3.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
55
+ "llm.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
56
+ "llm.model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
57
+ "llm.model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
58
+ "llm.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
59
+ "llm.model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
60
+ "llm.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
61
+ "llm.model.layers.4.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
62
+ "llm.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
63
+ "llm.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
64
+ "llm.model.layers.4.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
65
+ "llm.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
66
+ "llm.model.layers.4.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
67
+ "llm.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
68
+ "llm.model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
69
+ "llm.model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
70
+ "llm.model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
71
+ "llm.model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
72
+ "llm.model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
73
+ "llm.model.layers.5.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
74
+ "llm.model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
75
+ "llm.model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
76
+ "llm.model.layers.5.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
77
+ "llm.model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
78
+ "llm.model.layers.5.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
79
+ "llm.model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
80
+ "llm.model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
81
+ "llm.model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
82
+ "llm.model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
83
+ "llm.model.layers.6.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
84
+ "llm.model.layers.6.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
85
+ "llm.model.layers.6.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
86
+ "llm.model.layers.6.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
87
+ "llm.model.layers.6.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
88
+ "llm.model.layers.6.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
89
+ "llm.model.layers.6.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
90
+ "llm.model.layers.6.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
91
+ "llm.model.layers.6.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
92
+ "llm.model.layers.7.input_layernorm.weight": "model-00002-of-00004.safetensors",
93
+ "llm.model.layers.7.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
94
+ "llm.model.layers.7.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
95
+ "llm.model.layers.7.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
96
+ "llm.model.layers.7.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
97
+ "llm.model.layers.7.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
98
+ "llm.model.layers.7.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
99
+ "llm.model.layers.7.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
100
+ "llm.model.layers.7.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
101
+ "llm.model.layers.7.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
102
+ "llm.model.layers.7.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
103
+ "llm.model.layers.7.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
104
+ "llm.model.layers.8.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
105
+ "llm.model.layers.8.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
106
+ "llm.model.layers.8.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
107
+ "llm.model.layers.8.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
108
+ "llm.model.layers.8.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
109
+ "llm.model.layers.8.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
110
+ "llm.model.layers.8.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
111
+ "llm.model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
112
+ "llm.model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
113
+ "llm.model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
114
+ "llm.model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
115
+ "llm.model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
116
+ "llm.model.layers.10.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
117
+ "llm.model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
118
+ "llm.model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
119
+ "llm.model.layers.10.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
120
+ "llm.model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
121
+ "llm.model.layers.10.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
122
+ "llm.model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
123
+ "llm.model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
124
+ "llm.model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
125
+ "llm.model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
126
+ "llm.model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
127
+ "llm.model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
128
+ "llm.model.layers.11.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
129
+ "llm.model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
130
+ "llm.model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
131
+ "llm.model.layers.11.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
132
+ "llm.model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
133
+ "llm.model.layers.11.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
134
+ "llm.model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
135
+ "llm.model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
136
+ "llm.model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
137
+ "llm.model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
138
+ "llm.model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
139
+ "llm.model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
140
+ "llm.model.layers.12.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
141
+ "llm.model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
142
+ "llm.model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
143
+ "llm.model.layers.12.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
144
+ "llm.model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
145
+ "llm.model.layers.12.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
146
+ "llm.model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
147
+ "llm.model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
148
+ "llm.model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
149
+ "llm.model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
150
+ "llm.model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
151
+ "llm.model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
152
+ "llm.model.layers.13.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
153
+ "llm.model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
154
+ "llm.model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
155
+ "llm.model.layers.13.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
156
+ "llm.model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
157
+ "llm.model.layers.13.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
158
+ "llm.model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
159
+ "llm.model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
160
+ "llm.model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
161
+ "llm.model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
162
+ "llm.model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
163
+ "llm.model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
164
+ "llm.model.layers.14.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
165
+ "llm.model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
166
+ "llm.model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
167
+ "llm.model.layers.14.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
168
+ "llm.model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
169
+ "llm.model.layers.14.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
170
+ "llm.model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
171
+ "llm.model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
172
+ "llm.model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
173
+ "llm.model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
174
+ "llm.model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
175
+ "llm.model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
176
+ "llm.model.layers.15.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
177
+ "llm.model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
178
+ "llm.model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
179
+ "llm.model.layers.15.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
180
+ "llm.model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
181
+ "llm.model.layers.15.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
182
+ "llm.model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
183
+ "llm.model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
184
+ "llm.model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
185
+ "llm.model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
186
+ "llm.model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
187
+ "llm.model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
188
+ "llm.model.layers.16.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
189
+ "llm.model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
190
+ "llm.model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
191
+ "llm.model.layers.16.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
192
+ "llm.model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
193
+ "llm.model.layers.16.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
194
+ "llm.model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
195
+ "llm.model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
196
+ "llm.model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
197
+ "llm.model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
198
+ "llm.model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
199
+ "llm.model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
200
+ "llm.model.layers.17.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
201
+ "llm.model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
202
+ "llm.model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
203
+ "llm.model.layers.17.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
204
+ "llm.model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
205
+ "llm.model.layers.17.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
206
+ "llm.model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
207
+ "llm.model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
208
+ "llm.model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
209
+ "llm.model.layers.18.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
210
+ "llm.model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
211
+ "llm.model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
212
+ "llm.model.layers.18.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
213
+ "llm.model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
214
+ "llm.model.layers.18.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
215
+ "llm.model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
216
+ "llm.model.layers.8.input_layernorm.weight": "model-00002-of-00004.safetensors",
217
+ "llm.model.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
218
+ "llm.model.layers.8.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
219
+ "llm.model.layers.8.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
220
+ "llm.model.layers.8.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
221
+ "llm.model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
222
+ "llm.model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
223
+ "llm.model.layers.9.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
224
+ "llm.model.layers.9.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
225
+ "llm.model.layers.9.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
226
+ "llm.model.layers.9.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
227
+ "llm.model.layers.9.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
228
+ "llm.model.layers.9.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
229
+ "llm.model.layers.9.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
230
+ "llm.model.layers.9.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
231
+ "llm.model.layers.9.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
232
+ "llm.model.layers.9.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
233
+ "llm.model.layers.18.input_layernorm.weight": "model-00003-of-00004.safetensors",
234
+ "llm.model.layers.18.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
235
+ "llm.model.layers.18.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
236
+ "llm.model.layers.19.input_layernorm.weight": "model-00003-of-00004.safetensors",
237
+ "llm.model.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
238
+ "llm.model.layers.19.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
239
+ "llm.model.layers.19.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
240
+ "llm.model.layers.19.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
241
+ "llm.model.layers.19.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
242
+ "llm.model.layers.19.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
243
+ "llm.model.layers.19.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
244
+ "llm.model.layers.19.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
245
+ "llm.model.layers.19.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
246
+ "llm.model.layers.19.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
247
+ "llm.model.layers.19.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
248
+ "llm.model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
249
+ "llm.model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
250
+ "llm.model.layers.20.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
251
+ "llm.model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
252
+ "llm.model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
253
+ "llm.model.layers.20.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
254
+ "llm.model.layers.20.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
255
+ "llm.model.layers.20.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
256
+ "llm.model.layers.20.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
257
+ "llm.model.layers.20.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
258
+ "llm.model.layers.20.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
259
+ "llm.model.layers.20.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
260
+ "llm.model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
261
+ "llm.model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
262
+ "llm.model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
263
+ "llm.model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
264
+ "llm.model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
265
+ "llm.model.layers.21.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
266
+ "llm.model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
267
+ "llm.model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
268
+ "llm.model.layers.21.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
269
+ "llm.model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
270
+ "llm.model.layers.21.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
271
+ "llm.model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
272
+ "llm.model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
273
+ "llm.model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
274
+ "llm.model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
275
+ "llm.model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
276
+ "llm.model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
277
+ "llm.model.layers.22.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
278
+ "llm.model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
279
+ "llm.model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
280
+ "llm.model.layers.22.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
281
+ "llm.model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
282
+ "llm.model.layers.22.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
283
+ "llm.model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
284
+ "llm.model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
285
+ "llm.model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
286
+ "llm.model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
287
+ "llm.model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
288
+ "llm.model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
289
+ "llm.model.layers.23.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
290
+ "llm.model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
291
+ "llm.model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
292
+ "llm.model.layers.23.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
293
+ "llm.model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
294
+ "llm.model.layers.23.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
295
+ "llm.model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
296
+ "llm.model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
297
+ "llm.model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
298
+ "llm.model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
299
+ "llm.model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
300
+ "llm.model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
301
+ "llm.model.layers.24.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
302
+ "llm.model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
303
+ "llm.model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
304
+ "llm.model.layers.24.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
305
+ "llm.model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
306
+ "llm.model.layers.24.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
307
+ "llm.model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
308
+ "llm.model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
309
+ "llm.model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
310
+ "llm.model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
311
+ "llm.model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
312
+ "llm.model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
313
+ "llm.model.layers.25.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
314
+ "llm.model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
315
+ "llm.model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
316
+ "llm.model.layers.25.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
317
+ "llm.model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
318
+ "llm.model.layers.25.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
319
+ "llm.model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
320
+ "llm.model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
321
+ "llm.model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
322
+ "llm.model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
323
+ "llm.model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
324
+ "llm.model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
325
+ "llm.model.layers.26.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
326
+ "llm.model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
327
+ "llm.model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
328
+ "llm.model.layers.26.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
329
+ "llm.model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
330
+ "llm.model.layers.26.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
331
+ "llm.model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
332
+ "llm.model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
333
+ "llm.model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
334
+ "llm.model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
335
+ "llm.model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
336
+ "llm.model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
337
+ "llm.model.layers.27.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
338
+ "llm.model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
339
+ "llm.model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
340
+ "llm.model.layers.27.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
341
+ "llm.model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
342
+ "llm.model.layers.27.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
343
+ "llm.model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
344
+ "llm.model.norm.weight": "model-00003-of-00004.safetensors",
345
+ "vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00003-of-00004.safetensors",
346
+ "vision_tower.vision_model.embeddings.patch_embedding.bias": "model-00003-of-00004.safetensors",
347
+ "vision_tower.vision_model.embeddings.position_embedding.weight": "model-00003-of-00004.safetensors",
348
+ "vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00003-of-00004.safetensors",
349
+ "vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00003-of-00004.safetensors",
350
+ "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
351
+ "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
352
+ "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
353
+ "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
354
+ "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
355
+ "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
356
+ "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
357
+ "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
358
+ "vision_tower.vision_model.encoder.layers.0.layer_norm2.weight": "model-00003-of-00004.safetensors",
359
+ "vision_tower.vision_model.encoder.layers.0.layer_norm2.bias": "model-00003-of-00004.safetensors",
360
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00003-of-00004.safetensors",
361
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00003-of-00004.safetensors",
362
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00003-of-00004.safetensors",
363
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00003-of-00004.safetensors",
364
+ "vision_tower.vision_model.encoder.layers.1.layer_norm1.weight": "model-00003-of-00004.safetensors",
365
+ "vision_tower.vision_model.encoder.layers.1.layer_norm1.bias": "model-00003-of-00004.safetensors",
366
+ "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
367
+ "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
368
+ "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
369
+ "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
370
+ "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
371
+ "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
372
+ "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
373
+ "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
374
+ "vision_tower.vision_model.encoder.layers.1.layer_norm2.weight": "model-00003-of-00004.safetensors",
375
+ "vision_tower.vision_model.encoder.layers.1.layer_norm2.bias": "model-00003-of-00004.safetensors",
376
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00003-of-00004.safetensors",
377
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00003-of-00004.safetensors",
378
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00003-of-00004.safetensors",
379
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00003-of-00004.safetensors",
380
+ "vision_tower.vision_model.encoder.layers.2.layer_norm1.weight": "model-00003-of-00004.safetensors",
381
+ "vision_tower.vision_model.encoder.layers.2.layer_norm1.bias": "model-00003-of-00004.safetensors",
382
+ "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
383
+ "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
384
+ "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
385
+ "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
386
+ "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
387
+ "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
388
+ "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
389
+ "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
390
+ "vision_tower.vision_model.encoder.layers.2.layer_norm2.weight": "model-00003-of-00004.safetensors",
391
+ "vision_tower.vision_model.encoder.layers.2.layer_norm2.bias": "model-00003-of-00004.safetensors",
392
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00003-of-00004.safetensors",
393
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00003-of-00004.safetensors",
394
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00003-of-00004.safetensors",
395
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00003-of-00004.safetensors",
396
+ "vision_tower.vision_model.encoder.layers.3.layer_norm1.weight": "model-00003-of-00004.safetensors",
397
+ "vision_tower.vision_model.encoder.layers.3.layer_norm1.bias": "model-00003-of-00004.safetensors",
398
+ "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
399
+ "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
400
+ "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
401
+ "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
402
+ "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
403
+ "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
404
+ "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
405
+ "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
406
+ "vision_tower.vision_model.encoder.layers.3.layer_norm2.weight": "model-00003-of-00004.safetensors",
407
+ "vision_tower.vision_model.encoder.layers.3.layer_norm2.bias": "model-00003-of-00004.safetensors",
408
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00003-of-00004.safetensors",
409
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00003-of-00004.safetensors",
410
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00003-of-00004.safetensors",
411
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00003-of-00004.safetensors",
412
+ "vision_tower.vision_model.encoder.layers.4.layer_norm1.weight": "model-00003-of-00004.safetensors",
413
+ "vision_tower.vision_model.encoder.layers.4.layer_norm1.bias": "model-00003-of-00004.safetensors",
414
+ "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
415
+ "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
416
+ "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
417
+ "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
418
+ "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
419
+ "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
420
+ "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
421
+ "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
422
+ "vision_tower.vision_model.encoder.layers.4.layer_norm2.weight": "model-00003-of-00004.safetensors",
423
+ "vision_tower.vision_model.encoder.layers.4.layer_norm2.bias": "model-00003-of-00004.safetensors",
424
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00003-of-00004.safetensors",
425
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00003-of-00004.safetensors",
426
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00003-of-00004.safetensors",
427
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00003-of-00004.safetensors",
428
+ "vision_tower.vision_model.encoder.layers.5.layer_norm1.weight": "model-00003-of-00004.safetensors",
429
+ "vision_tower.vision_model.encoder.layers.5.layer_norm1.bias": "model-00003-of-00004.safetensors",
430
+ "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
431
+ "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
432
+ "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
433
+ "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
434
+ "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
435
+ "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
436
+ "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
437
+ "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
438
+ "vision_tower.vision_model.encoder.layers.5.layer_norm2.weight": "model-00003-of-00004.safetensors",
439
+ "vision_tower.vision_model.encoder.layers.5.layer_norm2.bias": "model-00003-of-00004.safetensors",
440
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00003-of-00004.safetensors",
441
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00003-of-00004.safetensors",
442
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00003-of-00004.safetensors",
443
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00003-of-00004.safetensors",
444
+ "vision_tower.vision_model.encoder.layers.6.layer_norm1.weight": "model-00003-of-00004.safetensors",
445
+ "vision_tower.vision_model.encoder.layers.6.layer_norm1.bias": "model-00003-of-00004.safetensors",
446
+ "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
447
+ "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
448
+ "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
449
+ "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
450
+ "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
451
+ "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
452
+ "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
453
+ "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
454
+ "vision_tower.vision_model.encoder.layers.6.layer_norm2.weight": "model-00003-of-00004.safetensors",
455
+ "vision_tower.vision_model.encoder.layers.6.layer_norm2.bias": "model-00003-of-00004.safetensors",
456
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00003-of-00004.safetensors",
457
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00003-of-00004.safetensors",
458
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00003-of-00004.safetensors",
459
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00003-of-00004.safetensors",
460
+ "vision_tower.vision_model.encoder.layers.7.layer_norm1.weight": "model-00003-of-00004.safetensors",
461
+ "vision_tower.vision_model.encoder.layers.7.layer_norm1.bias": "model-00003-of-00004.safetensors",
462
+ "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
463
+ "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
464
+ "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
465
+ "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
466
+ "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
467
+ "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
468
+ "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
469
+ "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
470
+ "vision_tower.vision_model.encoder.layers.7.layer_norm2.weight": "model-00003-of-00004.safetensors",
471
+ "vision_tower.vision_model.encoder.layers.7.layer_norm2.bias": "model-00003-of-00004.safetensors",
472
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00003-of-00004.safetensors",
473
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00003-of-00004.safetensors",
474
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00003-of-00004.safetensors",
475
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00003-of-00004.safetensors",
476
+ "vision_tower.vision_model.encoder.layers.8.layer_norm1.weight": "model-00003-of-00004.safetensors",
477
+ "vision_tower.vision_model.encoder.layers.8.layer_norm1.bias": "model-00003-of-00004.safetensors",
478
+ "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
479
+ "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
480
+ "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
481
+ "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
482
+ "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
483
+ "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
484
+ "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
485
+ "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
486
+ "vision_tower.vision_model.encoder.layers.8.layer_norm2.weight": "model-00003-of-00004.safetensors",
487
+ "vision_tower.vision_model.encoder.layers.8.layer_norm2.bias": "model-00003-of-00004.safetensors",
488
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00003-of-00004.safetensors",
489
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00003-of-00004.safetensors",
490
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00003-of-00004.safetensors",
491
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00003-of-00004.safetensors",
492
+ "vision_tower.vision_model.encoder.layers.9.layer_norm1.weight": "model-00003-of-00004.safetensors",
493
+ "vision_tower.vision_model.encoder.layers.9.layer_norm1.bias": "model-00003-of-00004.safetensors",
494
+ "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
495
+ "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
496
+ "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
497
+ "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
498
+ "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
499
+ "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
500
+ "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
501
+ "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
502
+ "vision_tower.vision_model.encoder.layers.9.layer_norm2.weight": "model-00003-of-00004.safetensors",
503
+ "vision_tower.vision_model.encoder.layers.9.layer_norm2.bias": "model-00003-of-00004.safetensors",
504
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00003-of-00004.safetensors",
505
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00003-of-00004.safetensors",
506
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00003-of-00004.safetensors",
507
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00003-of-00004.safetensors",
508
+ "vision_tower.vision_model.encoder.layers.10.layer_norm1.weight": "model-00003-of-00004.safetensors",
509
+ "vision_tower.vision_model.encoder.layers.10.layer_norm1.bias": "model-00003-of-00004.safetensors",
510
+ "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
511
+ "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
512
+ "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
513
+ "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
514
+ "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
515
+ "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
516
+ "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
517
+ "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
518
+ "vision_tower.vision_model.encoder.layers.10.layer_norm2.weight": "model-00003-of-00004.safetensors",
519
+ "vision_tower.vision_model.encoder.layers.10.layer_norm2.bias": "model-00003-of-00004.safetensors",
520
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00003-of-00004.safetensors",
521
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00003-of-00004.safetensors",
522
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00003-of-00004.safetensors",
523
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00003-of-00004.safetensors",
524
+ "vision_tower.vision_model.encoder.layers.11.layer_norm1.weight": "model-00003-of-00004.safetensors",
525
+ "vision_tower.vision_model.encoder.layers.11.layer_norm1.bias": "model-00003-of-00004.safetensors",
526
+ "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
527
+ "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
528
+ "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
529
+ "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
530
+ "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
531
+ "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
532
+ "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
533
+ "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
534
+ "vision_tower.vision_model.encoder.layers.11.layer_norm2.weight": "model-00003-of-00004.safetensors",
535
+ "vision_tower.vision_model.encoder.layers.11.layer_norm2.bias": "model-00003-of-00004.safetensors",
536
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00003-of-00004.safetensors",
537
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00003-of-00004.safetensors",
538
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00003-of-00004.safetensors",
539
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00003-of-00004.safetensors",
540
+ "vision_tower.vision_model.encoder.layers.12.layer_norm1.weight": "model-00003-of-00004.safetensors",
541
+ "vision_tower.vision_model.encoder.layers.12.layer_norm1.bias": "model-00003-of-00004.safetensors",
542
+ "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
543
+ "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
544
+ "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
545
+ "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
546
+ "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
547
+ "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
548
+ "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
549
+ "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
550
+ "vision_tower.vision_model.encoder.layers.12.layer_norm2.weight": "model-00003-of-00004.safetensors",
551
+ "vision_tower.vision_model.encoder.layers.12.layer_norm2.bias": "model-00003-of-00004.safetensors",
552
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00003-of-00004.safetensors",
553
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00003-of-00004.safetensors",
554
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00003-of-00004.safetensors",
555
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00003-of-00004.safetensors",
556
+ "vision_tower.vision_model.encoder.layers.13.layer_norm1.weight": "model-00003-of-00004.safetensors",
557
+ "vision_tower.vision_model.encoder.layers.13.layer_norm1.bias": "model-00003-of-00004.safetensors",
558
+ "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
559
+ "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
560
+ "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
561
+ "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
562
+ "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
563
+ "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
564
+ "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
565
+ "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
566
+ "vision_tower.vision_model.encoder.layers.13.layer_norm2.weight": "model-00003-of-00004.safetensors",
567
+ "vision_tower.vision_model.encoder.layers.13.layer_norm2.bias": "model-00003-of-00004.safetensors",
568
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00003-of-00004.safetensors",
569
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00003-of-00004.safetensors",
570
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00003-of-00004.safetensors",
571
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00003-of-00004.safetensors",
572
+ "vision_tower.vision_model.encoder.layers.14.layer_norm1.weight": "model-00003-of-00004.safetensors",
573
+ "vision_tower.vision_model.encoder.layers.14.layer_norm1.bias": "model-00003-of-00004.safetensors",
574
+ "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
575
+ "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
576
+ "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
577
+ "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
578
+ "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
579
+ "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
580
+ "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
581
+ "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
582
+ "vision_tower.vision_model.encoder.layers.14.layer_norm2.weight": "model-00003-of-00004.safetensors",
583
+ "vision_tower.vision_model.encoder.layers.14.layer_norm2.bias": "model-00003-of-00004.safetensors",
584
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00003-of-00004.safetensors",
585
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00003-of-00004.safetensors",
586
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00003-of-00004.safetensors",
587
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00003-of-00004.safetensors",
588
+ "vision_tower.vision_model.encoder.layers.15.layer_norm1.weight": "model-00003-of-00004.safetensors",
589
+ "vision_tower.vision_model.encoder.layers.15.layer_norm1.bias": "model-00003-of-00004.safetensors",
590
+ "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
591
+ "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
592
+ "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
593
+ "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
594
+ "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
595
+ "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
596
+ "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
597
+ "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
598
+ "vision_tower.vision_model.encoder.layers.15.layer_norm2.weight": "model-00003-of-00004.safetensors",
599
+ "vision_tower.vision_model.encoder.layers.15.layer_norm2.bias": "model-00003-of-00004.safetensors",
600
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00003-of-00004.safetensors",
601
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00003-of-00004.safetensors",
602
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00003-of-00004.safetensors",
603
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00003-of-00004.safetensors",
604
+ "vision_tower.vision_model.encoder.layers.16.layer_norm1.weight": "model-00003-of-00004.safetensors",
605
+ "vision_tower.vision_model.encoder.layers.16.layer_norm1.bias": "model-00003-of-00004.safetensors",
606
+ "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
607
+ "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
608
+ "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
609
+ "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
610
+ "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
611
+ "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
612
+ "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
613
+ "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
614
+ "vision_tower.vision_model.encoder.layers.16.layer_norm2.weight": "model-00003-of-00004.safetensors",
615
+ "vision_tower.vision_model.encoder.layers.16.layer_norm2.bias": "model-00003-of-00004.safetensors",
616
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00003-of-00004.safetensors",
617
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00003-of-00004.safetensors",
618
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00003-of-00004.safetensors",
619
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00003-of-00004.safetensors",
620
+ "vision_tower.vision_model.encoder.layers.17.layer_norm1.weight": "model-00003-of-00004.safetensors",
621
+ "vision_tower.vision_model.encoder.layers.17.layer_norm1.bias": "model-00003-of-00004.safetensors",
622
+ "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
623
+ "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
624
+ "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
625
+ "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
626
+ "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
627
+ "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
628
+ "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
629
+ "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
630
+ "vision_tower.vision_model.encoder.layers.17.layer_norm2.weight": "model-00003-of-00004.safetensors",
631
+ "vision_tower.vision_model.encoder.layers.17.layer_norm2.bias": "model-00003-of-00004.safetensors",
632
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00003-of-00004.safetensors",
633
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00003-of-00004.safetensors",
634
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00003-of-00004.safetensors",
635
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00003-of-00004.safetensors",
636
+ "vision_tower.vision_model.encoder.layers.18.layer_norm1.weight": "model-00003-of-00004.safetensors",
637
+ "vision_tower.vision_model.encoder.layers.18.layer_norm1.bias": "model-00003-of-00004.safetensors",
638
+ "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
639
+ "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
640
+ "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
641
+ "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
642
+ "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
643
+ "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
644
+ "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
645
+ "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
646
+ "vision_tower.vision_model.encoder.layers.18.layer_norm2.weight": "model-00003-of-00004.safetensors",
647
+ "vision_tower.vision_model.encoder.layers.18.layer_norm2.bias": "model-00003-of-00004.safetensors",
648
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00003-of-00004.safetensors",
649
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00003-of-00004.safetensors",
650
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00003-of-00004.safetensors",
651
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00003-of-00004.safetensors",
652
+ "vision_tower.vision_model.encoder.layers.19.layer_norm1.weight": "model-00003-of-00004.safetensors",
653
+ "vision_tower.vision_model.encoder.layers.19.layer_norm1.bias": "model-00003-of-00004.safetensors",
654
+ "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
655
+ "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
656
+ "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
657
+ "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
658
+ "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
659
+ "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
660
+ "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
661
+ "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
662
+ "vision_tower.vision_model.encoder.layers.19.layer_norm2.weight": "model-00003-of-00004.safetensors",
663
+ "vision_tower.vision_model.encoder.layers.19.layer_norm2.bias": "model-00003-of-00004.safetensors",
664
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00003-of-00004.safetensors",
665
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00003-of-00004.safetensors",
666
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00003-of-00004.safetensors",
667
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00003-of-00004.safetensors",
668
+ "vision_tower.vision_model.encoder.layers.20.layer_norm1.weight": "model-00003-of-00004.safetensors",
669
+ "vision_tower.vision_model.encoder.layers.20.layer_norm1.bias": "model-00003-of-00004.safetensors",
670
+ "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
671
+ "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
672
+ "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
673
+ "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
674
+ "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
675
+ "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
676
+ "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
677
+ "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
678
+ "vision_tower.vision_model.encoder.layers.20.layer_norm2.weight": "model-00003-of-00004.safetensors",
679
+ "vision_tower.vision_model.encoder.layers.20.layer_norm2.bias": "model-00003-of-00004.safetensors",
680
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00003-of-00004.safetensors",
681
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00003-of-00004.safetensors",
682
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00003-of-00004.safetensors",
683
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00003-of-00004.safetensors",
684
+ "vision_tower.vision_model.encoder.layers.21.layer_norm1.weight": "model-00003-of-00004.safetensors",
685
+ "vision_tower.vision_model.encoder.layers.21.layer_norm1.bias": "model-00003-of-00004.safetensors",
686
+ "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
687
+ "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
688
+ "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
689
+ "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
690
+ "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
691
+ "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
692
+ "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
693
+ "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
694
+ "vision_tower.vision_model.encoder.layers.21.layer_norm2.weight": "model-00003-of-00004.safetensors",
695
+ "vision_tower.vision_model.encoder.layers.21.layer_norm2.bias": "model-00003-of-00004.safetensors",
696
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00003-of-00004.safetensors",
697
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00003-of-00004.safetensors",
698
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00003-of-00004.safetensors",
699
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00003-of-00004.safetensors",
700
+ "vision_tower.vision_model.encoder.layers.22.layer_norm1.weight": "model-00003-of-00004.safetensors",
701
+ "vision_tower.vision_model.encoder.layers.22.layer_norm1.bias": "model-00003-of-00004.safetensors",
702
+ "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
703
+ "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
704
+ "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
705
+ "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
706
+ "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
707
+ "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
708
+ "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00003-of-00004.safetensors",
709
+ "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00003-of-00004.safetensors",
710
+ "vision_tower.vision_model.encoder.layers.22.layer_norm2.weight": "model-00003-of-00004.safetensors",
711
+ "vision_tower.vision_model.encoder.layers.22.layer_norm2.bias": "model-00003-of-00004.safetensors",
712
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00003-of-00004.safetensors",
713
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00003-of-00004.safetensors",
714
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00003-of-00004.safetensors",
715
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00003-of-00004.safetensors",
716
+ "vision_tower.vision_model.encoder.layers.23.layer_norm1.weight": "model-00003-of-00004.safetensors",
717
+ "vision_tower.vision_model.encoder.layers.23.layer_norm1.bias": "model-00003-of-00004.safetensors",
718
+ "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
719
+ "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
720
+ "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
721
+ "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
722
+ "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
723
+ "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
724
+ "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00004-of-00004.safetensors",
725
+ "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00004-of-00004.safetensors",
726
+ "vision_tower.vision_model.encoder.layers.23.layer_norm2.weight": "model-00004-of-00004.safetensors",
727
+ "vision_tower.vision_model.encoder.layers.23.layer_norm2.bias": "model-00004-of-00004.safetensors",
728
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00004-of-00004.safetensors",
729
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00004-of-00004.safetensors",
730
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00004-of-00004.safetensors",
731
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00004-of-00004.safetensors",
732
+ "vision_tower.vision_model.encoder.layers.24.layer_norm1.weight": "model-00004-of-00004.safetensors",
733
+ "vision_tower.vision_model.encoder.layers.24.layer_norm1.bias": "model-00004-of-00004.safetensors",
734
+ "vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
735
+ "vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
736
+ "vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
737
+ "vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
738
+ "vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
739
+ "vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
740
+ "vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.weight": "model-00004-of-00004.safetensors",
741
+ "vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.bias": "model-00004-of-00004.safetensors",
742
+ "vision_tower.vision_model.encoder.layers.24.layer_norm2.weight": "model-00004-of-00004.safetensors",
743
+ "vision_tower.vision_model.encoder.layers.24.layer_norm2.bias": "model-00004-of-00004.safetensors",
744
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc1.weight": "model-00004-of-00004.safetensors",
745
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc1.bias": "model-00004-of-00004.safetensors",
746
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc2.weight": "model-00004-of-00004.safetensors",
747
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc2.bias": "model-00004-of-00004.safetensors",
748
+ "vision_tower.vision_model.encoder.layers.25.layer_norm1.weight": "model-00004-of-00004.safetensors",
749
+ "vision_tower.vision_model.encoder.layers.25.layer_norm1.bias": "model-00004-of-00004.safetensors",
750
+ "vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
751
+ "vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
752
+ "vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
753
+ "vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
754
+ "vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
755
+ "vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
756
+ "vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight": "model-00004-of-00004.safetensors",
757
+ "vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.bias": "model-00004-of-00004.safetensors",
758
+ "vision_tower.vision_model.encoder.layers.25.layer_norm2.weight": "model-00004-of-00004.safetensors",
759
+ "vision_tower.vision_model.encoder.layers.25.layer_norm2.bias": "model-00004-of-00004.safetensors",
760
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc1.weight": "model-00004-of-00004.safetensors",
761
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc1.bias": "model-00004-of-00004.safetensors",
762
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc2.weight": "model-00004-of-00004.safetensors",
763
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc2.bias": "model-00004-of-00004.safetensors",
764
+ "vision_tower.vision_model.encoder.layers.26.layer_norm1.weight": "model-00004-of-00004.safetensors",
765
+ "vision_tower.vision_model.encoder.layers.26.layer_norm1.bias": "model-00004-of-00004.safetensors",
766
+ "vision_tower.vision_model.encoder.layers.26.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
767
+ "vision_tower.vision_model.encoder.layers.26.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
768
+ "vision_tower.vision_model.encoder.layers.26.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
769
+ "vision_tower.vision_model.encoder.layers.26.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
770
+ "vision_tower.vision_model.encoder.layers.26.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
771
+ "vision_tower.vision_model.encoder.layers.26.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
772
+ "vision_tower.vision_model.encoder.layers.26.self_attn.out_proj.weight": "model-00004-of-00004.safetensors",
773
+ "vision_tower.vision_model.encoder.layers.26.self_attn.out_proj.bias": "model-00004-of-00004.safetensors",
774
+ "vision_tower.vision_model.encoder.layers.26.layer_norm2.weight": "model-00004-of-00004.safetensors",
775
+ "vision_tower.vision_model.encoder.layers.26.layer_norm2.bias": "model-00004-of-00004.safetensors",
776
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc1.weight": "model-00004-of-00004.safetensors",
777
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc1.bias": "model-00004-of-00004.safetensors",
778
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc2.weight": "model-00004-of-00004.safetensors",
779
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc2.bias": "model-00004-of-00004.safetensors",
780
+ "vision_tower.vision_model.post_layernorm.weight": "model-00004-of-00004.safetensors",
781
+ "vision_tower.vision_model.post_layernorm.bias": "model-00004-of-00004.safetensors",
782
+ "mm_projector.layers.1.bias": "model-00004-of-00004.safetensors",
783
+ "mm_projector.layers.1.weight": "model-00004-of-00004.safetensors",
784
+ "mm_projector.layers.2.bias": "model-00004-of-00004.safetensors",
785
+ "mm_projector.layers.2.weight": "model-00004-of-00004.safetensors",
786
+ "mm_projector.layers.4.bias": "model-00004-of-00004.safetensors",
787
+ "mm_projector.layers.4.weight": "model-00004-of-00004.safetensors",
788
+ "mm_projector.layers.5.bias": "model-00004-of-00004.safetensors",
789
+ "mm_projector.layers.5.weight": "model-00004-of-00004.safetensors",
790
+ "mm_projector.layers.7.bias": "model-00004-of-00004.safetensors",
791
+ "mm_projector.layers.7.weight": "model-00004-of-00004.safetensors"
792
+ }
793
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "bos_token": {
7
+ "content": "[BOS]",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "eos_token": {
14
+ "content": "<|im_end|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "image_token": "<image>",
21
+ "pad_token": {
22
+ "content": "[PAD]",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "sentinel_token": "<vila/sentinel>",
29
+ "video_token": "<vila/video>"
30
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "[BOS]",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "[PAD]",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<vila/sentinel>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "151649": {
53
+ "content": "<image>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "151650": {
61
+ "content": "<vila/video>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ }
68
+ },
69
+ "additional_special_tokens": [
70
+ "<|im_start|>",
71
+ "<|im_end|>"
72
+ ],
73
+ "auto_map": {
74
+ "AutoProcessor": "processing_nvila.NVILAProcessor"
75
+ },
76
+ "bos_token": "[BOS]",
77
+ "clean_up_tokenization_spaces": false,
78
+ "eos_token": "<|im_end|>",
79
+ "errors": "replace",
80
+ "extra_special_tokens": {
81
+ "image_token": "<image>",
82
+ "sentinel_token": "<vila/sentinel>",
83
+ "video_token": "<vila/video>"
84
+ },
85
+ "image_token": "<image>",
86
+ "legacy": false,
87
+ "model_max_length": 40960,
88
+ "pad_token": "[PAD]",
89
+ "padding_side": "left",
90
+ "processor_class": "NVILAProcessor",
91
+ "sentinel_token": "<vila/sentinel>",
92
+ "split_special_tokens": false,
93
+ "tokenizer_class": "Qwen2Tokenizer",
94
+ "unk_token": null,
95
+ "video_token": "<vila/video>"
96
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff