TYTTYTTYT commited on
Commit
5a89fd6
·
verified ·
1 Parent(s): 1fe1dae

adapt to transformers==4.57.6

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
image_processing_qwen2_vl.py ADDED
@@ -0,0 +1,501 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from typing import Optional, Union
3
+
4
+ import numpy as np
5
+
6
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
7
+ from transformers.image_transforms import (
8
+ convert_to_rgb,
9
+ resize,
10
+ to_channel_dimension_format,
11
+ )
12
+ from transformers.image_utils import (
13
+ OPENAI_CLIP_MEAN,
14
+ OPENAI_CLIP_STD,
15
+ ChannelDimension,
16
+ ImageInput,
17
+ PILImageResampling,
18
+ get_image_size,
19
+ infer_channel_dimension_format,
20
+ is_scaled_image,
21
+ make_flat_list_of_images,
22
+ to_numpy_array,
23
+ valid_images,
24
+ validate_preprocess_arguments,
25
+ )
26
+ from transformers.utils import TensorType, logging
27
+ from transformers.video_utils import VideoInput, make_batched_videos
28
+
29
+
30
+ logger = logging.get_logger(__name__)
31
+
32
+
33
+ def smart_resize(
34
+ height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
35
+ ):
36
+ """Rescales the image so that the following conditions are met:
37
+
38
+ 1. Both dimensions (height and width) are divisible by 'factor'.
39
+
40
+ 2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
41
+
42
+ 3. The aspect ratio of the image is maintained as closely as possible.
43
+
44
+ """
45
+ if max(height, width) / min(height, width) > 200:
46
+ raise ValueError(
47
+ f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
48
+ )
49
+ h_bar = round(height / factor) * factor
50
+ w_bar = round(width / factor) * factor
51
+ if h_bar * w_bar > max_pixels:
52
+ beta = math.sqrt((height * width) / max_pixels)
53
+ h_bar = max(factor, math.floor(height / beta / factor) * factor)
54
+ w_bar = max(factor, math.floor(width / beta / factor) * factor)
55
+ elif h_bar * w_bar < min_pixels:
56
+ beta = math.sqrt(min_pixels / (height * width))
57
+ h_bar = math.ceil(height * beta / factor) * factor
58
+ w_bar = math.ceil(width * beta / factor) * factor
59
+ return h_bar, w_bar
60
+
61
+
62
+ class Qwen2VLImageProcessor(BaseImageProcessor):
63
+ r"""
64
+ Constructs a Qwen2-VL image processor that dynamically resizes images based on the original images.
65
+
66
+ Args:
67
+ do_resize (`bool`, *optional*, defaults to `True`):
68
+ Whether to resize the image's (height, width) dimensions.
69
+ size (`dict[str, int]`, *optional*, defaults to `{"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}`):
70
+ Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
71
+ resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
72
+ Resampling filter to use when resizing the image.
73
+ do_rescale (`bool`, *optional*, defaults to `True`):
74
+ Whether to rescale the image by the specified scale `rescale_factor`.
75
+ rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
76
+ Scale factor to use if rescaling the image.
77
+ do_normalize (`bool`, *optional*, defaults to `True`):
78
+ Whether to normalize the image.
79
+ image_mean (`float` or `list[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
80
+ Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
81
+ image_std (`float` or `list[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
82
+ Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
83
+ do_convert_rgb (`bool`, *optional*, defaults to `True`):
84
+ Whether to convert the image to RGB.
85
+ min_pixels (`int`, *optional*, defaults to `56 * 56`):
86
+ The min pixels of the image to resize the image.
87
+ max_pixels (`int`, *optional*, defaults to `28 * 28 * 1280`):
88
+ The max pixels of the image to resize the image.
89
+ patch_size (`int`, *optional*, defaults to 14):
90
+ The spatial patch size of the vision encoder.
91
+ temporal_patch_size (`int`, *optional*, defaults to 2):
92
+ The temporal patch size of the vision encoder.
93
+ merge_size (`int`, *optional*, defaults to 2):
94
+ The merge size of the vision encoder to llm encoder.
95
+ """
96
+
97
+ model_input_names = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw"]
98
+
99
+ def __init__(
100
+ self,
101
+ do_resize: bool = True,
102
+ size: Optional[dict[str, int]] = None,
103
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
104
+ do_rescale: bool = True,
105
+ rescale_factor: Union[int, float] = 1 / 255,
106
+ do_normalize: bool = True,
107
+ image_mean: Optional[Union[float, list[float]]] = None,
108
+ image_std: Optional[Union[float, list[float]]] = None,
109
+ do_convert_rgb: bool = True,
110
+ min_pixels: Optional[int] = None,
111
+ max_pixels: Optional[int] = None,
112
+ patch_size: int = 14,
113
+ temporal_patch_size: int = 2,
114
+ merge_size: int = 2,
115
+ focus_size: int = 2,
116
+ **kwargs,
117
+ ) -> None:
118
+ super().__init__(**kwargs)
119
+ if size is not None and ("shortest_edge" not in size or "longest_edge" not in size):
120
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
121
+ else:
122
+ size = {"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}
123
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
124
+ if min_pixels is not None:
125
+ size["shortest_edge"] = min_pixels
126
+ if max_pixels is not None:
127
+ size["longest_edge"] = max_pixels
128
+ self.min_pixels = size["shortest_edge"]
129
+ self.max_pixels = size["longest_edge"]
130
+ self.size = size
131
+
132
+ self.do_resize = do_resize
133
+ self.resample = resample
134
+ self.do_rescale = do_rescale
135
+ self.rescale_factor = rescale_factor
136
+ self.do_normalize = do_normalize
137
+ self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
138
+ self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
139
+
140
+ self.patch_size = patch_size
141
+ self.temporal_patch_size = temporal_patch_size
142
+ self.merge_size = merge_size
143
+ self.focus_size = focus_size
144
+ self.do_convert_rgb = do_convert_rgb
145
+
146
+ def _preprocess(
147
+ self,
148
+ images: Union[ImageInput, VideoInput],
149
+ do_resize: Optional[bool] = None,
150
+ size: Optional[dict[str, int]] = None,
151
+ resample: Optional[PILImageResampling] = None,
152
+ do_rescale: Optional[bool] = None,
153
+ rescale_factor: Optional[float] = None,
154
+ do_normalize: Optional[bool] = None,
155
+ image_mean: Optional[Union[float, list[float]]] = None,
156
+ image_std: Optional[Union[float, list[float]]] = None,
157
+ patch_size: Optional[int] = None,
158
+ temporal_patch_size: Optional[int] = None,
159
+ merge_size: Optional[int] = None,
160
+ focus_size: Optional[int] = None,
161
+ do_convert_rgb: Optional[bool] = None,
162
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
163
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
164
+ ):
165
+ """
166
+ Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
167
+
168
+ Args:
169
+ images (`ImageInput`):
170
+ Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
171
+ vision_info (`list[Dict]`, *optional*):
172
+ Optional list of dictionaries containing additional information about vision inputs.
173
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
174
+ Whether to resize the image.
175
+ size (`dict[str, int]`, *optional*, defaults to `self.size`):
176
+ Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
177
+ resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
178
+ Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
179
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
180
+ Whether to rescale the image.
181
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
182
+ Scale factor to use if rescaling the image.
183
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
184
+ Whether to normalize the image.
185
+ image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
186
+ Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
187
+ image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
188
+ Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
189
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
190
+ The spatial patch size of the vision encoder.
191
+ temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
192
+ The temporal patch size of the vision encoder.
193
+ merge_size (`int`, *optional*, defaults to `self.merge_size`):
194
+ The merge size of the vision encoder to llm encoder.
195
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
196
+ Whether to convert the image to RGB.
197
+ data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
198
+ The channel dimension format for the output image. Can be one of:
199
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
200
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
201
+ - Unset: Use the channel dimension format of the input image.
202
+ input_data_format (`ChannelDimension` or `str`, *optional*):
203
+ The channel dimension format for the input image. Can be one of:
204
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
205
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
206
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
207
+ """
208
+ images = make_flat_list_of_images(images)
209
+
210
+ if do_convert_rgb:
211
+ images = [convert_to_rgb(image) for image in images]
212
+
213
+ # All transformations expect numpy arrays.
214
+ images = [to_numpy_array(image) for image in images]
215
+
216
+ if do_rescale and is_scaled_image(images[0]):
217
+ logger.warning_once(
218
+ "It looks like you are trying to rescale already rescaled images. If the input"
219
+ " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
220
+ )
221
+ if input_data_format is None:
222
+ # We assume that all images have the same channel dimension format.
223
+ input_data_format = infer_channel_dimension_format(images[0])
224
+
225
+ height, width = get_image_size(images[0], channel_dim=input_data_format)
226
+ resized_height, resized_width = height, width
227
+ processed_images = []
228
+ for image in images:
229
+ if do_resize:
230
+ resized_height, resized_width = smart_resize(
231
+ height,
232
+ width,
233
+ factor=patch_size * merge_size * focus_size,
234
+ min_pixels=size["shortest_edge"],
235
+ max_pixels=size["longest_edge"],
236
+ )
237
+ image = resize(
238
+ image, size=(resized_height, resized_width), resample=resample, input_data_format=input_data_format
239
+ )
240
+
241
+ if do_rescale:
242
+ image = self.rescale(image, scale=rescale_factor, input_data_format=input_data_format)
243
+
244
+ if do_normalize:
245
+ image = self.normalize(
246
+ image=image, mean=image_mean, std=image_std, input_data_format=input_data_format
247
+ )
248
+
249
+ image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
250
+ processed_images.append(image)
251
+
252
+ patches = np.array(processed_images)
253
+ if data_format == ChannelDimension.LAST:
254
+ patches = patches.transpose(0, 3, 1, 2)
255
+ if patches.shape[0] % temporal_patch_size != 0:
256
+ repeats = np.repeat(
257
+ patches[-1][np.newaxis], temporal_patch_size - (patches.shape[0] % temporal_patch_size), axis=0
258
+ )
259
+ patches = np.concatenate([patches, repeats], axis=0)
260
+ channel = patches.shape[1]
261
+ grid_t = patches.shape[0] // temporal_patch_size
262
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
263
+ patches = patches.reshape(
264
+ grid_t,
265
+ temporal_patch_size,
266
+ channel,
267
+ grid_h // merge_size,
268
+ merge_size,
269
+ patch_size,
270
+ grid_w // merge_size,
271
+ merge_size,
272
+ patch_size,
273
+ )
274
+ patches = patches.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8)
275
+ flatten_patches = patches.reshape(
276
+ grid_t * grid_h * grid_w, channel * temporal_patch_size * patch_size * patch_size
277
+ )
278
+
279
+ return flatten_patches, (grid_t, grid_h, grid_w)
280
+
281
+ def preprocess(
282
+ self,
283
+ images: ImageInput,
284
+ videos: Optional[VideoInput] = None,
285
+ do_resize: Optional[bool] = None,
286
+ size: Optional[dict[str, int]] = None,
287
+ min_pixels: Optional[int] = None,
288
+ max_pixels: Optional[int] = None,
289
+ resample: Optional[PILImageResampling] = None,
290
+ do_rescale: Optional[bool] = None,
291
+ rescale_factor: Optional[float] = None,
292
+ do_normalize: Optional[bool] = None,
293
+ image_mean: Optional[Union[float, list[float]]] = None,
294
+ image_std: Optional[Union[float, list[float]]] = None,
295
+ patch_size: Optional[int] = None,
296
+ temporal_patch_size: Optional[int] = None,
297
+ merge_size: Optional[int] = None,
298
+ do_convert_rgb: Optional[bool] = None,
299
+ return_tensors: Optional[Union[str, TensorType]] = None,
300
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
301
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
302
+ ):
303
+ """
304
+ Args:
305
+ images (`ImageInput`):
306
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
307
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
308
+ videos (`VideoInput`):
309
+ Video to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
310
+ passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
311
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
312
+ Whether to resize the image.
313
+ size (`dict[str, int]`, *optional*, defaults to `self.size`):
314
+ Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
315
+ the longest edge resized to keep the input aspect ratio.
316
+ resample (`int`, *optional*, defaults to `self.resample`):
317
+ Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
318
+ has an effect if `do_resize` is set to `True`.
319
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
320
+ Whether to rescale the image.
321
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
322
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
323
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
324
+ Whether to normalize the image.
325
+ image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
326
+ Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
327
+ image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
328
+ Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
329
+ `True`.
330
+ min_pixels (`int`, *optional*, defaults to `self.min_pixels`):
331
+ The min pixels of the image to resize the image.
332
+ max_pixels (`int`, *optional*, defaults to `self.max_pixels`):
333
+ The max pixels of the image to resize the image.
334
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
335
+ The spatial patch size of the vision encoder.
336
+ temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
337
+ The temporal patch size of the vision encoder.
338
+ merge_size (`int`, *optional*, defaults to `self.merge_size`):
339
+ The merge size of the vision encoder to llm encoder.
340
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
341
+ Whether to convert the image to RGB.
342
+ return_tensors (`str` or `TensorType`, *optional*):
343
+ The type of tensors to return. Can be one of:
344
+ - Unset: Return a list of `np.ndarray`.
345
+ - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
346
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
347
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
348
+ - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
349
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
350
+ The channel dimension format for the output image. Can be one of:
351
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
352
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
353
+ - Unset: Use the channel dimension format of the input image.
354
+ input_data_format (`ChannelDimension` or `str`, *optional*):
355
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
356
+ from the input image. Can be one of:
357
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
358
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
359
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
360
+
361
+ """
362
+ min_pixels = min_pixels if min_pixels is not None else self.min_pixels
363
+ max_pixels = max_pixels if max_pixels is not None else self.max_pixels
364
+
365
+ if size is not None:
366
+ if "shortest_edge" not in size or "longest_edge" not in size:
367
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
368
+ min_pixels = size["shortest_edge"]
369
+ elif min_pixels is not None and max_pixels is not None:
370
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
371
+ size = {"shortest_edge": min_pixels, "longest_edge": max_pixels}
372
+ else:
373
+ size = {**self.size}
374
+
375
+ do_resize = do_resize if do_resize is not None else self.do_resize
376
+
377
+ resample = resample if resample is not None else self.resample
378
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
379
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
380
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
381
+ image_mean = image_mean if image_mean is not None else self.image_mean
382
+ image_std = image_std if image_std is not None else self.image_std
383
+ patch_size = patch_size if patch_size is not None else self.patch_size
384
+ temporal_patch_size = temporal_patch_size if temporal_patch_size is not None else self.temporal_patch_size
385
+ merge_size = merge_size if merge_size is not None else self.merge_size
386
+ do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
387
+
388
+ if images is not None:
389
+ images = self.fetch_images(images)
390
+ images = make_flat_list_of_images(images)
391
+
392
+ if images is not None and not valid_images(images):
393
+ raise ValueError(
394
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
395
+ "torch.Tensor, tf.Tensor or jax.ndarray."
396
+ )
397
+
398
+ validate_preprocess_arguments(
399
+ rescale_factor=rescale_factor,
400
+ do_normalize=do_normalize,
401
+ image_mean=image_mean,
402
+ image_std=image_std,
403
+ do_resize=do_resize,
404
+ size=size,
405
+ resample=resample,
406
+ )
407
+
408
+ data = {}
409
+ if images is not None:
410
+ pixel_values, vision_grid_thws = [], []
411
+ for image in images:
412
+ patches, image_grid_thw = self._preprocess(
413
+ image,
414
+ do_resize=do_resize,
415
+ size=size,
416
+ resample=resample,
417
+ do_rescale=do_rescale,
418
+ rescale_factor=rescale_factor,
419
+ do_normalize=do_normalize,
420
+ image_mean=image_mean,
421
+ image_std=image_std,
422
+ patch_size=patch_size,
423
+ temporal_patch_size=temporal_patch_size,
424
+ merge_size=merge_size,
425
+ data_format=data_format,
426
+ do_convert_rgb=do_convert_rgb,
427
+ input_data_format=input_data_format,
428
+ )
429
+ pixel_values.extend(patches)
430
+ vision_grid_thws.append(image_grid_thw)
431
+ pixel_values = np.array(pixel_values)
432
+ vision_grid_thws = np.array(vision_grid_thws)
433
+ data.update({"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws})
434
+
435
+ # kept for BC only and should be removed after v5.0
436
+ if videos is not None:
437
+ logger.warning(
438
+ "`Qwen2VLImageProcessor` works only with image inputs and doesn't process videos anymore. "
439
+ "This is a deprecated behavior and will be removed in v5.0. "
440
+ "Your videos should be forwarded to `Qwen2VLVideoProcessor`. "
441
+ )
442
+ videos = make_batched_videos(videos)
443
+ pixel_values_videos, vision_grid_thws_videos = [], []
444
+ for images in videos:
445
+ patches, video_grid_thw = self._preprocess(
446
+ images,
447
+ do_resize=do_resize,
448
+ size=size,
449
+ resample=resample,
450
+ do_rescale=do_rescale,
451
+ rescale_factor=rescale_factor,
452
+ do_normalize=do_normalize,
453
+ image_mean=image_mean,
454
+ image_std=image_std,
455
+ patch_size=patch_size,
456
+ temporal_patch_size=temporal_patch_size,
457
+ merge_size=merge_size,
458
+ data_format=data_format,
459
+ do_convert_rgb=do_convert_rgb,
460
+ input_data_format=input_data_format,
461
+ )
462
+ pixel_values_videos.extend(patches)
463
+ vision_grid_thws_videos.append(video_grid_thw)
464
+ data.update(
465
+ {
466
+ "pixel_values_videos": np.array(pixel_values_videos),
467
+ "video_grid_thw": np.array(vision_grid_thws_videos),
468
+ }
469
+ )
470
+
471
+ return BatchFeature(data=data, tensor_type=return_tensors)
472
+
473
+ def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
474
+ """
475
+ A utility that returns number of image patches for a given image size.
476
+
477
+ Args:
478
+ height (`int`):
479
+ Height of the input image.
480
+ width (`int`):
481
+ Width of the input image.
482
+ images_kwargs (`dict`, *optional*)
483
+ Any kwargs to override defaults of the image processor.
484
+ Returns:
485
+ `int`: Number of image patches per image.
486
+ """
487
+ min_pixels = images_kwargs["min_pixels"] if "min_pixels" in images_kwargs else self.size["shortest_edge"]
488
+ max_pixels = images_kwargs["max_pixels"] if "max_pixels" in images_kwargs else self.size["longest_edge"]
489
+ patch_size = images_kwargs.get("patch_size", self.patch_size)
490
+ merge_size = images_kwargs.get("merge_size", self.merge_size)
491
+ focus_size = images_kwargs.get("focus_size", self.focus_size)
492
+
493
+ factor = patch_size * merge_size * focus_size
494
+ resized_height, resized_width = smart_resize(
495
+ height, width, factor, min_pixels=min_pixels, max_pixels=max_pixels
496
+ )
497
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
498
+ return grid_h * grid_w
499
+
500
+
501
+ __all__ = ["Qwen2VLImageProcessor"]
image_processing_qwen2_vl_fast.py ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Union
2
+
3
+ import torch
4
+ from torchvision.transforms.v2 import functional as F
5
+
6
+ from transformers.image_processing_utils import BatchFeature
7
+ from transformers.image_processing_utils_fast import (
8
+ BaseImageProcessorFast,
9
+ DefaultFastImageProcessorKwargs,
10
+ group_images_by_shape,
11
+ reorder_images,
12
+ )
13
+ from transformers.image_utils import (
14
+ OPENAI_CLIP_MEAN,
15
+ OPENAI_CLIP_STD,
16
+ ChannelDimension,
17
+ ImageInput,
18
+ PILImageResampling,
19
+ SizeDict,
20
+ )
21
+ from transformers.processing_utils import Unpack
22
+ from transformers.utils import (
23
+ TensorType,
24
+ auto_docstring,
25
+ logging,
26
+ )
27
+ from transformers.video_utils import VideoInput, make_batched_videos
28
+ from .image_processing_qwen2_vl import smart_resize
29
+
30
+
31
+ logger = logging.get_logger(__name__)
32
+
33
+
34
+ class Qwen2VLFastImageProcessorKwargs(DefaultFastImageProcessorKwargs):
35
+ """
36
+ min_pixels (`int`, *optional*, defaults to `56 * 56`):
37
+ The min pixels of the image to resize the image.
38
+ max_pixels (`int`, *optional*, defaults to `28 * 28 * 1280`):
39
+ The max pixels of the image to resize the image.
40
+ patch_size (`int`, *optional*, defaults to 14):
41
+ The spatial patch size of the vision encoder.
42
+ temporal_patch_size (`int`, *optional*, defaults to 2):
43
+ The temporal patch size of the vision encoder.
44
+ merge_size (`int`, *optional*, defaults to 2):
45
+ The merge size of the vision encoder to llm encoder.
46
+ """
47
+
48
+ min_pixels: Optional[int]
49
+ max_pixels: Optional[int]
50
+ patch_size: Optional[int]
51
+ temporal_patch_size: Optional[int]
52
+ merge_size: Optional[int]
53
+ focus_size: Optional[int]
54
+
55
+
56
+ @auto_docstring
57
+ class ZFQwen2VLImageProcessorFast(BaseImageProcessorFast):
58
+ do_resize = True
59
+ resample = PILImageResampling.BICUBIC
60
+ size = {"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}
61
+ do_rescale = True
62
+ do_normalize = True
63
+ image_mean = OPENAI_CLIP_MEAN
64
+ image_std = OPENAI_CLIP_STD
65
+ do_convert_rgb = True
66
+ patch_size = 14
67
+ temporal_patch_size = 2
68
+ merge_size = 2
69
+ focus_size = 2
70
+ min_pixels = None
71
+ max_pixels = None
72
+ valid_kwargs = Qwen2VLFastImageProcessorKwargs
73
+ model_input_names = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw"]
74
+
75
+ def __init__(self, **kwargs: Unpack[Qwen2VLFastImageProcessorKwargs]):
76
+ size = kwargs.pop("size", None)
77
+ min_pixels = kwargs.pop("min_pixels", None)
78
+ max_pixels = kwargs.pop("max_pixels", None)
79
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
80
+ size = self.size if size is None else size
81
+ if min_pixels is not None:
82
+ size["shortest_edge"] = min_pixels
83
+ size.pop("min_pixels", None)
84
+ if max_pixels is not None:
85
+ size["longest_edge"] = max_pixels
86
+ size.pop("max_pixels", None)
87
+ if "shortest_edge" not in size or "longest_edge" not in size:
88
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
89
+
90
+ super().__init__(size=size, min_pixels=min_pixels, max_pixels=max_pixels, **kwargs)
91
+
92
+ def _further_process_kwargs(
93
+ self,
94
+ size: Optional[SizeDict] = None,
95
+ min_pixels: Optional[int] = None,
96
+ max_pixels: Optional[int] = None,
97
+ **kwargs,
98
+ ) -> dict:
99
+ """
100
+ Update kwargs that need further processing before being validated
101
+ Can be overridden by subclasses to customize the processing of kwargs.
102
+ """
103
+ if min_pixels is not None and max_pixels is not None:
104
+ size = {"shortest_edge": min_pixels, "longest_edge": max_pixels}
105
+ elif size is not None:
106
+ if "shortest_edge" not in size or "longest_edge" not in size:
107
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
108
+ min_pixels = size["shortest_edge"]
109
+ max_pixels = size["longest_edge"]
110
+ else:
111
+ size = {**self.size}
112
+
113
+ return super()._further_process_kwargs(size=size, min_pixels=min_pixels, max_pixels=max_pixels, **kwargs)
114
+
115
+ @auto_docstring
116
+ def preprocess(
117
+ self,
118
+ images: ImageInput,
119
+ videos: Optional[VideoInput] = None,
120
+ **kwargs: Unpack[Qwen2VLFastImageProcessorKwargs],
121
+ ) -> BatchFeature:
122
+ return super().preprocess(images, videos, **kwargs)
123
+
124
+ def _preprocess_image_like_inputs(
125
+ self,
126
+ images: ImageInput,
127
+ videos: VideoInput,
128
+ do_convert_rgb: bool,
129
+ input_data_format: ChannelDimension,
130
+ device: Optional[Union[str, "torch.device"]] = None,
131
+ **kwargs: Unpack[DefaultFastImageProcessorKwargs],
132
+ ) -> BatchFeature:
133
+ """
134
+ Preprocess image-like inputs.
135
+ To be overridden by subclasses when image-like inputs other than images should be processed.
136
+ It can be used for segmentation maps, depth maps, etc.
137
+ """
138
+ # Prepare input images
139
+ batch_feature = BatchFeature()
140
+ if images is not None:
141
+ images = self._prepare_image_like_inputs(
142
+ images=images, do_convert_rgb=do_convert_rgb, input_data_format=input_data_format, device=device
143
+ )
144
+ batch_feature = self._preprocess(images, **kwargs)
145
+ if videos is not None:
146
+ logger.warning(
147
+ "`Qwen2VLImageProcessorFast` works only with image inputs and doesn't process videos anymore. "
148
+ "This is a deprecated behavior and will be removed in v5.0. "
149
+ "Your videos should be forwarded to `Qwen2VLVideoProcessor`. "
150
+ )
151
+ # Can't change _prepare_images_structure to work with videos because it also needs to work with images.
152
+ videos = make_batched_videos(videos)
153
+ videos = [
154
+ torch.stack(self._prepare_image_like_inputs(video, do_convert_rgb, input_data_format, device))
155
+ for video in videos
156
+ ]
157
+ video_outputs = self._preprocess(videos, **kwargs)
158
+ batch_feature.update(
159
+ {"pixel_values_videos": video_outputs.pixel_values, "video_grid_thw": video_outputs.image_grid_thw}
160
+ )
161
+ return batch_feature
162
+
163
+ def _preprocess(
164
+ self,
165
+ images: list["torch.Tensor"],
166
+ do_resize: bool,
167
+ size: SizeDict,
168
+ interpolation: Optional["F.InterpolationMode"],
169
+ do_rescale: bool,
170
+ rescale_factor: float,
171
+ do_normalize: bool,
172
+ image_mean: Optional[Union[float, list[float]]],
173
+ image_std: Optional[Union[float, list[float]]],
174
+ patch_size: int,
175
+ temporal_patch_size: int,
176
+ merge_size: int,
177
+ focus_size: int,
178
+ disable_grouping: Optional[bool],
179
+ return_tensors: Optional[Union[str, TensorType]],
180
+ **kwargs,
181
+ ):
182
+ # Group images by size for batched resizing
183
+ grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
184
+ resized_images_grouped = {}
185
+ for shape, stacked_images in grouped_images.items():
186
+ height, width = stacked_images.shape[-2:]
187
+ if do_resize:
188
+ resized_height, resized_width = smart_resize(
189
+ height,
190
+ width,
191
+ factor=patch_size * merge_size * focus_size,
192
+ min_pixels=size["shortest_edge"],
193
+ max_pixels=size["longest_edge"],
194
+ )
195
+ stacked_images = self.resize(
196
+ image=stacked_images,
197
+ size=SizeDict(height=resized_height, width=resized_width),
198
+ interpolation=interpolation,
199
+ )
200
+ resized_images_grouped[shape] = stacked_images
201
+ resized_images = reorder_images(resized_images_grouped, grouped_images_index)
202
+
203
+ # Group images by size for further processing
204
+ # Needed in case do_resize is False, or resize returns images with different sizes
205
+ grouped_images, grouped_images_index = group_images_by_shape(resized_images, disable_grouping=disable_grouping)
206
+ processed_images_grouped = {}
207
+ processed_grids = {}
208
+ for shape, stacked_images in grouped_images.items():
209
+ resized_height, resized_width = stacked_images.shape[-2:]
210
+ # Fused rescale and normalize
211
+ patches = self.rescale_and_normalize(
212
+ stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std
213
+ )
214
+ if patches.ndim == 4:
215
+ # add a temporal dimension if we have images
216
+ patches = patches.unsqueeze(1)
217
+ if patches.shape[1] % temporal_patch_size != 0:
218
+ repeats = patches[:, -1:].repeat(1, temporal_patch_size - 1, 1, 1, 1)
219
+ patches = torch.cat([patches, repeats], dim=1)
220
+ batch_size, grid_t, channel = patches.shape[:3]
221
+ grid_t = grid_t // temporal_patch_size
222
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
223
+
224
+ patches = patches.view(
225
+ batch_size,
226
+ grid_t,
227
+ temporal_patch_size,
228
+ channel,
229
+ grid_h // merge_size,
230
+ merge_size,
231
+ patch_size,
232
+ grid_w // merge_size,
233
+ merge_size,
234
+ patch_size,
235
+ )
236
+ # Reorder dimensions to group grid and patch information for subsequent flattening.
237
+ # (batch, grid_t, grid_h, grid_w, merge_h, merge_w, channel, temp_patch_size, patch_h, patch_w)
238
+ patches = patches.permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
239
+ flatten_patches = patches.reshape(
240
+ batch_size,
241
+ grid_t * grid_h * grid_w,
242
+ channel * temporal_patch_size * patch_size * patch_size,
243
+ )
244
+
245
+ processed_images_grouped[shape] = flatten_patches
246
+ processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
247
+
248
+ processed_images = reorder_images(processed_images_grouped, grouped_images_index)
249
+ processed_grids = reorder_images(processed_grids, grouped_images_index)
250
+ pixel_values = torch.cat(processed_images, dim=0)
251
+ image_grid_thw = torch.tensor(processed_grids)
252
+
253
+ return BatchFeature(
254
+ data={"pixel_values": pixel_values, "image_grid_thw": image_grid_thw}, tensor_type=return_tensors
255
+ )
256
+
257
+ def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
258
+ """
259
+ A utility that returns number of image patches for a given image size.
260
+
261
+ Note: Do not remove this method! It is used by vLLM to infer the number of patches and placeholders
262
+ without an image input.
263
+
264
+ Args:
265
+ height (`int`):
266
+ Height of the input image.
267
+ width (`int`):
268
+ Width of the input image.
269
+ images_kwargs (`dict`, *optional*)
270
+ Any kwargs to override defaults of the image processor.
271
+ Returns:
272
+ `int`: Number of image patches per image.
273
+ """
274
+ min_pixels = images_kwargs["min_pixels"] if "min_pixels" in images_kwargs else self.size["shortest_edge"]
275
+ max_pixels = images_kwargs["max_pixels"] if "max_pixels" in images_kwargs else self.size["longest_edge"]
276
+ patch_size = images_kwargs.get("patch_size", self.patch_size)
277
+ merge_size = images_kwargs.get("merge_size", self.merge_size)
278
+ focus_size = images_kwargs.get("focus_size", self.focus_size)
279
+
280
+ factor = patch_size * merge_size * focus_size
281
+ resized_height, resized_width = smart_resize(
282
+ height, width, factor, min_pixels=min_pixels, max_pixels=max_pixels
283
+ )
284
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
285
+ return grid_h * grid_w
286
+
287
+
288
+ __all__ = ["ZFQwen2VLImageProcessorFast"]
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoImageProcessor": "image_processing_qwen2_vl_fast.ZFQwen2VLImageProcessorFast",
4
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor"
5
+ },
6
+ "crop_size": null,
7
+ "data_format": "channels_first",
8
+ "default_to_square": true,
9
+ "device": null,
10
+ "disable_grouping": null,
11
+ "do_center_crop": null,
12
+ "do_convert_rgb": true,
13
+ "do_normalize": true,
14
+ "do_pad": null,
15
+ "do_rescale": true,
16
+ "do_resize": true,
17
+ "focus_size": 2,
18
+ "image_mean": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "image_processor_type": "ZFQwen2VLImageProcessorFast",
24
+ "image_std": [
25
+ 0.5,
26
+ 0.5,
27
+ 0.5
28
+ ],
29
+ "input_data_format": null,
30
+ "max_pixels": null,
31
+ "merge_size": 2,
32
+ "min_pixels": null,
33
+ "pad_size": null,
34
+ "patch_size": 16,
35
+ "processor_class": "ZFQwen3VLProcessor",
36
+ "resample": 3,
37
+ "rescale_factor": 0.00392156862745098,
38
+ "return_tensors": null,
39
+ "size": {
40
+ "longest_edge": 16777216,
41
+ "shortest_edge": 65536
42
+ },
43
+ "temporal_patch_size": 2
44
+ }
processing_qwen3_vl.py ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Union
2
+
3
+ import numpy as np
4
+
5
+ from transformers.feature_extraction_utils import BatchFeature
6
+ from transformers.image_utils import ImageInput
7
+ from transformers.processing_utils import ImagesKwargs, MultiModalData, ProcessingKwargs, ProcessorMixin, Unpack, VideosKwargs
8
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
9
+ from transformers.utils import logging
10
+ from transformers.video_utils import VideoInput
11
+
12
+
13
+ logger = logging.get_logger(__name__)
14
+
15
+
16
+ class Qwen3VLVideosProcessorKwargs(VideosKwargs, total=False):
17
+ focus_size: Optional[int]
18
+
19
+
20
+ class Qwen3VLImagesKwargs(ImagesKwargs):
21
+ min_pixels: Optional[int]
22
+ max_pixels: Optional[int]
23
+ patch_size: Optional[int]
24
+ temporal_patch_size: Optional[int]
25
+ merge_size: Optional[int]
26
+ focus_size: Optional[int]
27
+
28
+
29
+ class Qwen3VLProcessorKwargs(ProcessingKwargs, total=False):
30
+ images_kwargs: Qwen3VLImagesKwargs
31
+ videos_kwargs: Qwen3VLVideosProcessorKwargs
32
+ _defaults = {
33
+ "text_kwargs": {
34
+ "padding": False,
35
+ "return_token_type_ids": False,
36
+ "return_mm_token_type_ids": False,
37
+ },
38
+ "videos_kwargs": {"return_metadata": True},
39
+ }
40
+
41
+
42
+ class ZFQwen3VLProcessor(ProcessorMixin):
43
+ r"""
44
+ Constructs a Qwen3VL processor which wraps a Qwen3VL image processor and a Qwen2 tokenizer into a single processor.
45
+ [`Qwen3VLProcessor`] offers all the functionalities of [`Qwen2VLImageProcessor`] and [`Qwen2TokenizerFast`]. See the
46
+ [`~Qwen3VLProcessor.__call__`] and [`~Qwen3VLProcessor.decode`] for more information.
47
+ Args:
48
+ image_processor ([`Qwen2VLImageProcessor`], *optional*):
49
+ The image processor is a required input.
50
+ tokenizer ([`Qwen2TokenizerFast`], *optional*):
51
+ The tokenizer is a required input.
52
+ video_processor ([`Qwen3VLVideoProcessor`], *optional*):
53
+ The video processor is a required input.
54
+ chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
55
+ in a chat into a tokenizable string.
56
+ """
57
+
58
+ attributes = ["image_processor", "tokenizer", "video_processor"]
59
+ image_processor_class = "AutoImageProcessor"
60
+ video_processor_class = "AutoVideoProcessor"
61
+ tokenizer_class = ("Qwen2Tokenizer", "Qwen2TokenizerFast")
62
+
63
+ def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
64
+ super().__init__(image_processor, tokenizer, video_processor, chat_template=chat_template)
65
+ self.image_token = "<|image_pad|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
66
+ self.video_token = "<|video_pad|>" if not hasattr(tokenizer, "video_token") else tokenizer.video_token
67
+ self.image_token_id = (
68
+ tokenizer.image_token_id
69
+ if getattr(tokenizer, "image_token_id", None)
70
+ else tokenizer.convert_tokens_to_ids(self.image_token)
71
+ )
72
+ self.video_token_id = (
73
+ tokenizer.video_token_id
74
+ if getattr(tokenizer, "video_token_id", None)
75
+ else tokenizer.convert_tokens_to_ids(self.video_token)
76
+ )
77
+ self.vision_start_token = (
78
+ "<|vision_start|>" if not hasattr(tokenizer, "vision_start_token") else tokenizer.vision_start_token
79
+ )
80
+ self.vision_end_token = (
81
+ "<|vision_end|>" if not hasattr(tokenizer, "vision_end_token") else tokenizer.vision_end_token
82
+ )
83
+ self.vision_start_token_id = (
84
+ tokenizer.vision_start_token_id
85
+ if getattr(tokenizer, "vision_start_token_id", None)
86
+ else tokenizer.convert_tokens_to_ids(self.vision_start_token)
87
+ )
88
+ self.vision_end_token_id = (
89
+ tokenizer.vision_end_token_id
90
+ if getattr(tokenizer, "vision_end_token_id", None)
91
+ else tokenizer.convert_tokens_to_ids(self.vision_end_token)
92
+ )
93
+
94
+ def __call__(
95
+ self,
96
+ images: ImageInput = None,
97
+ text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
98
+ videos: VideoInput = None,
99
+ **kwargs: Unpack[Qwen3VLProcessorKwargs],
100
+ ) -> BatchFeature:
101
+ """
102
+ Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
103
+ and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
104
+ the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
105
+ Qwen2VLImageProcessor's [`~Qwen2VLImageProcessor.__call__`] if `vision_infos` is not `None`.
106
+
107
+ Args:
108
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`):
109
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
110
+ tensor. Both channels-first and channels-last formats are supported.
111
+ text (`str`, `list[str]`, `list[list[str]]`):
112
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
113
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
114
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
115
+ videos (`np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`):
116
+ The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
117
+ tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
118
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
119
+ If set, will return tensors of a particular framework. Acceptable values are:
120
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
121
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
122
+ - `'np'`: Return NumPy `np.ndarray` objects.
123
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
124
+
125
+ Returns:
126
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
127
+
128
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
129
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
130
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
131
+ `None`).
132
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
133
+ - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
134
+ - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
135
+ - **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
136
+ """
137
+ output_kwargs = self._merge_kwargs(
138
+ Qwen3VLProcessorKwargs,
139
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
140
+ **kwargs,
141
+ )
142
+ if images is not None:
143
+ image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
144
+ image_grid_thw = image_inputs["image_grid_thw"]
145
+ else:
146
+ image_inputs = {}
147
+ image_grid_thw = None
148
+
149
+ if videos is not None:
150
+ videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
151
+ video_grid_thw = videos_inputs["video_grid_thw"]
152
+ # If user has not requested video metadata, pop it
153
+ if "return_metadata" not in kwargs:
154
+ video_metadata = videos_inputs.pop("video_metadata")
155
+ else:
156
+ video_metadata = videos_inputs["video_metadata"]
157
+ video_grid_thw = videos_inputs["video_grid_thw"]
158
+ else:
159
+ videos_inputs = {}
160
+ video_grid_thw = None
161
+
162
+ if not isinstance(text, list):
163
+ text = [text]
164
+
165
+ text = text.copy() # below lines change text in-place
166
+ if image_grid_thw is not None:
167
+ merge_length = self.image_processor.merge_size**2
168
+ index = 0
169
+ for i in range(len(text)):
170
+ while self.image_token in text[i]:
171
+ num_image_tokens = image_grid_thw[index].prod() // merge_length
172
+ text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
173
+ index += 1
174
+ text[i] = text[i].replace("<|placeholder|>", self.image_token)
175
+
176
+ if video_grid_thw is not None:
177
+ merge_length = self.video_processor.merge_size**2
178
+ index = 0
179
+ for i in range(len(text)):
180
+ while self.video_token in text[i]:
181
+ metadata = video_metadata[index]
182
+ if metadata.fps is None:
183
+ logger.warning_once(
184
+ "Qwen3VL requires frame timestamps to construct prompts, but the `fps` of the input video could not be inferred. "
185
+ "Probably `video_metadata` was missing from inputs and you passed pre-sampled frames. "
186
+ "Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
187
+ )
188
+ metadata.fps = 24 if metadata.fps is None else metadata.fps
189
+
190
+ # if timestamps are not provided, calculate them
191
+ curr_timestamp = self._calculate_timestamps(
192
+ metadata.frames_indices,
193
+ metadata.fps,
194
+ self.video_processor.merge_size,
195
+ self.video_processor.focus_size,
196
+ )
197
+
198
+ print(len(curr_timestamp), curr_timestamp)
199
+ video_placeholder = ""
200
+ frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
201
+ print(video_grid_thw)
202
+ for frame_idx in range(video_grid_thw[index][0]):
203
+ curr_time = curr_timestamp[frame_idx]
204
+ video_placeholder += f"<{curr_time:.1f} seconds>"
205
+ video_placeholder += (
206
+ self.vision_start_token + "<|placeholder|>" * frame_seqlen + self.vision_end_token
207
+ )
208
+ if f"{self.vision_start_token}{self.video_token}{self.vision_end_token}" in text[i]:
209
+ text[i] = text[i].replace(
210
+ f"{self.vision_start_token}{self.video_token}{self.vision_end_token}", video_placeholder, 1
211
+ )
212
+ else:
213
+ # vllm may input video token directly
214
+ text[i] = text[i].replace(self.video_token, video_placeholder, 1)
215
+ index += 1
216
+
217
+ text[i] = text[i].replace("<|placeholder|>", self.video_token)
218
+
219
+ return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
220
+ return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", None)
221
+ text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
222
+ self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
223
+
224
+ if return_mm_token_type_ids:
225
+ array_ids = np.array(text_inputs["input_ids"])
226
+ mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
227
+ mm_token_type_ids[array_ids == self.image_token_id] = 1
228
+ text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
229
+
230
+ return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
231
+
232
+ def _get_num_multimodal_tokens(self, image_sizes=None, video_sizes=None, **kwargs):
233
+ """
234
+ Computes the number of placeholder tokens needed for multimodal inputs with the given sizes.
235
+ Args:
236
+ image_sizes (`list[list[int]]`, *optional*):
237
+ The input sizes formatted as (height, width) per each image.
238
+ video_sizes (`list[list[int]]`, *optional*):
239
+ The input sizes formatted as (num_frames, height, width) per each video.
240
+ Returns:
241
+ `MultiModalData`: A `MultiModalData` object holding number of tokens per each of the provided
242
+ input modalities, along with other useful data.
243
+ """
244
+
245
+ vision_data = {}
246
+ if image_sizes is not None:
247
+ images_kwargs = Qwen3VLProcessorKwargs._defaults.get("images_kwargs", {})
248
+ images_kwargs.update(kwargs)
249
+ merge_size = images_kwargs.get("merge_size", None) or self.image_processor.merge_size
250
+
251
+ num_image_patches = [
252
+ self.image_processor.get_number_of_image_patches(*image_size, images_kwargs)
253
+ for image_size in image_sizes
254
+ ]
255
+ num_image_tokens = [(num_patches // merge_size**2) for num_patches in num_image_patches]
256
+ vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches})
257
+
258
+ if video_sizes is not None:
259
+ videos_kwargs = Qwen3VLProcessorKwargs._defaults.get("videos_kwargs", {})
260
+ videos_kwargs.update(kwargs)
261
+ num_video_patches = [
262
+ self.video_processor.get_number_of_video_patches(*video_size, videos_kwargs)
263
+ for video_size in video_sizes
264
+ ]
265
+ num_video_tokens = [(num_patches // merge_size**2) for num_patches in num_video_patches]
266
+ vision_data["num_video_tokens"] = num_video_tokens
267
+
268
+ return MultiModalData(**vision_data)
269
+
270
+ def post_process_image_text_to_text(
271
+ self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
272
+ ):
273
+ """
274
+ Post-process the output of the model to decode the text.
275
+
276
+ Args:
277
+ generated_outputs (`torch.Tensor` or `np.ndarray`):
278
+ The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
279
+ or `(sequence_length,)`.
280
+ skip_special_tokens (`bool`, *optional*, defaults to `True`):
281
+ Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
282
+ clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
283
+ Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
284
+ **kwargs:
285
+ Additional arguments to be passed to the tokenizer's `batch_decode method`.
286
+
287
+ Returns:
288
+ `list[str]`: The decoded text.
289
+ """
290
+ return self.tokenizer.batch_decode(
291
+ generated_outputs,
292
+ skip_special_tokens=skip_special_tokens,
293
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
294
+ **kwargs,
295
+ )
296
+
297
+ def _calculate_timestamps(
298
+ self,
299
+ indices: Union[list[int], np.ndarray],
300
+ video_fps: float,
301
+ merge_size: int = 2,
302
+ focus_size: int = 2,
303
+ ):
304
+ if not isinstance(indices, list):
305
+ indices = indices.tolist()
306
+ print(len(indices), indices)
307
+ b_size = merge_size * focus_size
308
+ if len(indices) % b_size != 0:
309
+ indices.extend(indices[-1] for _ in range(b_size - len(indices) % b_size))
310
+ print(len(indices), indices)
311
+ timestamps = [idx / video_fps for idx in indices]
312
+ # @JJJYmmm frames are merged by self.merge_size, \
313
+ # so we need to average the timestamps between the first/last frame within the temporal patch
314
+ timestamps = [
315
+ (timestamps[i] + timestamps[i + merge_size - 1]) / 2 for i in range(0, len(timestamps), merge_size)
316
+ ]
317
+ return timestamps
318
+
319
+
320
+ __all__ = ["ZFQwen3VLProcessor"]
processor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor"
4
+ },
5
+ "processor_class": "ZFQwen3VLProcessor"
6
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "auto_map": {
230
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor"
231
+ },
232
+ "bos_token": null,
233
+ "clean_up_tokenization_spaces": false,
234
+ "eos_token": "<|im_end|>",
235
+ "errors": "replace",
236
+ "extra_special_tokens": {},
237
+ "model_max_length": 262144,
238
+ "pad_token": "<|endoftext|>",
239
+ "processor_class": "ZFQwen3VLProcessor",
240
+ "split_special_tokens": false,
241
+ "tokenizer_class": "Qwen2Tokenizer",
242
+ "unk_token": null
243
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor",
4
+ "AutoVideoProcessor": "video_processing_qwen3_vl.ZFQwen3VLVideoProcessor"
5
+ },
6
+ "crop_size": null,
7
+ "data_format": "channels_first",
8
+ "default_to_square": true,
9
+ "device": null,
10
+ "do_center_crop": null,
11
+ "do_convert_rgb": true,
12
+ "do_normalize": true,
13
+ "do_rescale": true,
14
+ "do_resize": true,
15
+ "do_sample_frames": true,
16
+ "focus_size": 2,
17
+ "fps": 2,
18
+ "image_mean": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "image_std": [
24
+ 0.5,
25
+ 0.5,
26
+ 0.5
27
+ ],
28
+ "input_data_format": null,
29
+ "max_frames": 3600,
30
+ "merge_size": 2,
31
+ "min_frames": 4,
32
+ "num_frames": null,
33
+ "pad_size": null,
34
+ "patch_size": 16,
35
+ "processor_class": "ZFQwen3VLProcessor",
36
+ "resample": 3,
37
+ "rescale_factor": 0.00392156862745098,
38
+ "return_metadata": false,
39
+ "size": {
40
+ "longest_edge": 235929600,
41
+ "shortest_edge": 4096
42
+ },
43
+ "temporal_patch_size": 2,
44
+ "video_metadata": null,
45
+ "video_processor_type": "ZFQwen3VLVideoProcessor"
46
+ }
video_processing_qwen3_vl.py ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from typing import Optional, Union
3
+
4
+ import numpy as np
5
+ import torch
6
+
7
+ from transformers.feature_extraction_utils import BatchFeature
8
+ from transformers.image_utils import ChannelDimension, PILImageResampling, SizeDict, get_image_size
9
+ from transformers.processing_utils import Unpack, VideosKwargs
10
+ from transformers.utils import TensorType, add_start_docstrings, logging
11
+ from transformers.video_processing_utils import BASE_VIDEO_PROCESSOR_DOCSTRING, BaseVideoProcessor
12
+ from transformers.video_utils import VideoMetadata, group_videos_by_shape, reorder_videos
13
+
14
+
15
+ logger = logging.get_logger(__name__)
16
+
17
+
18
+ def smart_resize(
19
+ num_frames: int,
20
+ height: int,
21
+ width: int,
22
+ temporal_factor: int = 2,
23
+ factor: int = 32,
24
+ min_pixels: int = 128 * 128,
25
+ max_pixels: int = 16 * 16 * 2 * 2 * 2 * 6144,
26
+ ):
27
+ if num_frames < temporal_factor:
28
+ raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")
29
+ if height < factor or width < factor:
30
+ raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
31
+ elif max(height, width) / min(height, width) > 200:
32
+ raise ValueError(
33
+ f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
34
+ )
35
+ h_bar = round(height / factor) * factor
36
+ w_bar = round(width / factor) * factor
37
+ t_bar = round(num_frames / temporal_factor) * temporal_factor
38
+
39
+ if t_bar * h_bar * w_bar > max_pixels:
40
+ beta = math.sqrt((num_frames * height * width) / max_pixels)
41
+ h_bar = max(factor, math.floor(height / beta / factor) * factor)
42
+ w_bar = max(factor, math.floor(width / beta / factor) * factor)
43
+ elif t_bar * h_bar * w_bar < min_pixels:
44
+ beta = math.sqrt(min_pixels / (num_frames * height * width))
45
+ h_bar = math.ceil(height * beta / factor) * factor
46
+ w_bar = math.ceil(width * beta / factor) * factor
47
+
48
+ return h_bar, w_bar
49
+
50
+
51
+ class Qwen3VLVideoProcessorInitKwargs(VideosKwargs):
52
+ patch_size: Optional[int]
53
+ temporal_patch_size: Optional[int]
54
+ merge_size: Optional[int]
55
+ focus_size: Optional[int]
56
+ min_frames: Optional[int]
57
+ max_frames: Optional[int]
58
+
59
+
60
+ @add_start_docstrings(
61
+ "Constructs a fast Qwen3-VL image processor that dynamically resizes videos based on the original videos.",
62
+ BASE_VIDEO_PROCESSOR_DOCSTRING,
63
+ """
64
+ patch_size (`int`, *optional*, defaults to 16):
65
+ The spacial patch size of the vision encoder.
66
+ temporal_patch_size (`int`, *optional*, defaults to 2):
67
+ The temporal patch size of the vision encoder.
68
+ merge_size (`int`, *optional*, defaults to 2):
69
+ The merge size of the vision encoder to llm encoder.
70
+ """,
71
+ )
72
+ class ZFQwen3VLVideoProcessor(BaseVideoProcessor):
73
+ resample = PILImageResampling.BICUBIC
74
+ size = {"shortest_edge": 128 * 32 * 32, "longest_edge": 32 * 32 * 768}
75
+ image_mean = [0.5, 0.5, 0.5]
76
+ image_std = [0.5, 0.5, 0.5]
77
+ do_resize = True
78
+ do_rescale = True
79
+ do_normalize = True
80
+ do_convert_rgb = True
81
+ patch_size = 16
82
+ temporal_patch_size = 2
83
+ merge_size = 2
84
+ focus_size = 2
85
+ fps = 2
86
+ min_frames = 4
87
+ max_frames = 768
88
+ do_sample_frames = True
89
+ valid_kwargs = Qwen3VLVideoProcessorInitKwargs
90
+ model_input_names = ["pixel_values_videos", "video_grid_thw"]
91
+
92
+ def __init__(self, **kwargs: Unpack[Qwen3VLVideoProcessorInitKwargs]):
93
+ super().__init__(**kwargs)
94
+ if self.size is not None and (
95
+ self.size.get("shortest_edge", None) is None or self.size.get("longest_edge", None) is None
96
+ ):
97
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
98
+
99
+ def _further_process_kwargs(
100
+ self,
101
+ size: Optional[SizeDict] = None,
102
+ **kwargs,
103
+ ) -> dict:
104
+ """
105
+ Update kwargs that need further processing before being validated
106
+ Can be overridden by subclasses to customize the processing of kwargs.
107
+ """
108
+ if size is not None and ("shortest_edge" not in size or "longest_edge" not in size):
109
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
110
+
111
+ return super()._further_process_kwargs(size=size, **kwargs)
112
+
113
+ def sample_frames(
114
+ self,
115
+ metadata: VideoMetadata,
116
+ num_frames: Optional[int] = None,
117
+ fps: Optional[Union[int, float]] = None,
118
+ **kwargs,
119
+ ):
120
+ """
121
+ Default sampling function which uniformly samples the desired number of frames between 0 and total number of frames.
122
+ If `fps` is passed along with metadata, `fps` frames per second are sampled uniformty. Arguments `num_frames`
123
+ and `fps` are mutually exclusive.
124
+
125
+ Args:
126
+ video (`torch.Tensor`):
127
+ Video that need to be sampled.
128
+ metadata (`VideoMetadata`):
129
+ Metadata of the video containing information about total duration, fps and total number of frames.
130
+ num_frames (`int`, *optional*):
131
+ Maximum number of frames to sample. Defaults to `self.num_frames`.
132
+ fps (`int` or `float`, *optional*):
133
+ Target frames to sample per second. Defaults to `self.fps`.
134
+ Returns:
135
+ torch.Tensor:
136
+ Sampled video frames.
137
+ """
138
+ if fps is not None and num_frames is not None:
139
+ raise ValueError("`num_frames` and `fps` are mutually exclusive arguments, please use only one!")
140
+
141
+ total_num_frames = metadata.total_num_frames
142
+ fps = fps if fps is not None else self.fps
143
+
144
+ # If num_frames is not given but fps is, calculate num_frames from fps
145
+ if num_frames is None and fps is not None:
146
+ if metadata.fps is None:
147
+ metadata.fps = 24
148
+ logger.warning_once(
149
+ "Asked to sample `fps` frames per second but no video metadata was provided which is required when sampling with `fps`. "
150
+ "Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
151
+ )
152
+ num_frames = int(total_num_frames / metadata.fps * fps)
153
+ num_frames = min(min(max(num_frames, self.min_frames), self.max_frames), total_num_frames)
154
+
155
+ if num_frames is None:
156
+ num_frames = min(max(total_num_frames, self.min_frames), self.max_frames)
157
+
158
+ indices = np.linspace(0, total_num_frames - 1, num_frames).round().astype(int)
159
+
160
+ return indices
161
+
162
+ def _preprocess(
163
+ self,
164
+ videos: list[torch.Tensor],
165
+ do_convert_rgb: bool = True,
166
+ do_resize: bool = True,
167
+ size: Optional[SizeDict] = None,
168
+ interpolation: PILImageResampling = PILImageResampling.BICUBIC,
169
+ do_rescale: bool = True,
170
+ rescale_factor: float = 1 / 255.0,
171
+ do_normalize: bool = True,
172
+ image_mean: Optional[Union[float, list[float]]] = None,
173
+ image_std: Optional[Union[float, list[float]]] = None,
174
+ patch_size: Optional[int] = None,
175
+ temporal_patch_size: Optional[int] = None,
176
+ merge_size: Optional[int] = None,
177
+ focus_size: Optional[int] = None,
178
+ return_tensors: Optional[Union[str, TensorType]] = None,
179
+ **kwargs,
180
+ ):
181
+ grouped_videos, grouped_videos_index = group_videos_by_shape(videos)
182
+ resized_videos_grouped = {}
183
+
184
+ for shape, stacked_videos in grouped_videos.items():
185
+ B, T, C, H, W = stacked_videos.shape
186
+ num_frames, height, width = T, H, W
187
+ if do_resize:
188
+ resized_height, resized_width = smart_resize(
189
+ num_frames=num_frames,
190
+ height=height,
191
+ width=width,
192
+ temporal_factor=temporal_patch_size,
193
+ factor=patch_size * merge_size * focus_size,
194
+ min_pixels=size.shortest_edge,
195
+ max_pixels=size.longest_edge,
196
+ )
197
+ stacked_videos = stacked_videos.view(B * T, C, H, W)
198
+ stacked_videos = self.resize(
199
+ stacked_videos,
200
+ size=SizeDict(height=resized_height, width=resized_width),
201
+ interpolation=interpolation,
202
+ )
203
+ stacked_videos = stacked_videos.view(B, T, C, resized_height, resized_width)
204
+ resized_videos_grouped[shape] = stacked_videos
205
+ resized_videos = reorder_videos(resized_videos_grouped, grouped_videos_index)
206
+
207
+ # Group videos by size for further processing
208
+ # Needed in case do_resize is False, or resize returns videos with different sizes
209
+ grouped_videos, grouped_videos_index = group_videos_by_shape(resized_videos)
210
+ processed_videos_grouped = {}
211
+ processed_grids = {}
212
+ for shape, stacked_videos in grouped_videos.items():
213
+ resized_height, resized_width = get_image_size(stacked_videos[0], channel_dim=ChannelDimension.FIRST)
214
+
215
+ # Fused rescale and normalize
216
+ stacked_videos = self.rescale_and_normalize(
217
+ stacked_videos, do_rescale, rescale_factor, do_normalize, image_mean, image_std
218
+ )
219
+ patches = stacked_videos
220
+
221
+ temporal_focus_size = temporal_patch_size * focus_size
222
+ # Check that videos have `num_frames` divisible by `temporal_patch_size`
223
+ if res := patches.shape[1] % temporal_focus_size:
224
+ repeats = patches[:, -1:].repeat(1, temporal_focus_size - res, 1, 1, 1)
225
+ patches = torch.cat([patches, repeats], dim=1)
226
+ batch_size, grid_t, channel = patches.shape[:3]
227
+ grid_t = grid_t // temporal_patch_size
228
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
229
+
230
+ patches = patches.view(
231
+ batch_size,
232
+ grid_t,
233
+ temporal_patch_size,
234
+ channel,
235
+ grid_h // merge_size,
236
+ merge_size,
237
+ patch_size,
238
+ grid_w // merge_size,
239
+ merge_size,
240
+ patch_size,
241
+ )
242
+ patches = patches.permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
243
+ flatten_patches = patches.reshape(
244
+ batch_size,
245
+ grid_t * grid_h * grid_w,
246
+ channel * temporal_patch_size * patch_size * patch_size,
247
+ )
248
+
249
+ processed_videos_grouped[shape] = flatten_patches
250
+ processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
251
+
252
+ processed_videos = reorder_videos(processed_videos_grouped, grouped_videos_index)
253
+ processed_grids = reorder_videos(processed_grids, grouped_videos_index)
254
+ pixel_values_videos = torch.cat(processed_videos, dim=0)
255
+ video_grid_thw = torch.tensor(processed_grids)
256
+ data = {
257
+ "pixel_values_videos": pixel_values_videos,
258
+ "video_grid_thw": video_grid_thw,
259
+ }
260
+
261
+ return BatchFeature(data=data, tensor_type=return_tensors)
262
+
263
+
264
+ __all__ = ["ZFQwen3VLVideoProcessor"]
vocab.json ADDED
The diff for this file is too large to render. See raw diff