TYTTYTTYT commited on
Commit
8e9cbfb
·
verified ·
1 Parent(s): 087a9a0

Fixed bug in resize logic

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
image_processing_qwen2_vl.py ADDED
@@ -0,0 +1,474 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+
3
+ import numpy as np
4
+
5
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
6
+ from transformers.image_transforms import (
7
+ convert_to_rgb,
8
+ resize,
9
+ to_channel_dimension_format,
10
+ )
11
+ from transformers.image_utils import (
12
+ OPENAI_CLIP_MEAN,
13
+ OPENAI_CLIP_STD,
14
+ ChannelDimension,
15
+ ImageInput,
16
+ PILImageResampling,
17
+ get_image_size,
18
+ infer_channel_dimension_format,
19
+ is_scaled_image,
20
+ make_flat_list_of_images,
21
+ to_numpy_array,
22
+ valid_images,
23
+ validate_preprocess_arguments,
24
+ )
25
+ from transformers.processing_utils import ImagesKwargs
26
+ from transformers.utils import TensorType, logging
27
+ from transformers.video_utils import VideoInput
28
+
29
+
30
+ logger = logging.get_logger(__name__)
31
+
32
+
33
+ class Qwen2VLImageProcessorKwargs(ImagesKwargs, total=False):
34
+ r"""
35
+ min_pixels (`int`, *optional*, defaults to `56 * 56`):
36
+ The min pixels of the image to resize the image.
37
+ max_pixels (`int`, *optional*, defaults to `28 * 28 * 1280`):
38
+ The max pixels of the image to resize the image.
39
+ patch_size (`int`, *optional*, defaults to 14):
40
+ The spatial patch size of the vision encoder.
41
+ temporal_patch_size (`int`, *optional*, defaults to 2):
42
+ The temporal patch size of the vision encoder.
43
+ merge_size (`int`, *optional*, defaults to 2):
44
+ The merge size of the vision encoder to llm encoder.
45
+ """
46
+
47
+ min_pixels: int
48
+ max_pixels: int
49
+ patch_size: int
50
+ temporal_patch_size: int
51
+ merge_size: int
52
+
53
+
54
+ def smart_resize(
55
+ height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
56
+ ):
57
+ """Rescales the image so that the following conditions are met:
58
+
59
+ 1. Both dimensions (height and width) are divisible by 'factor'.
60
+
61
+ 2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
62
+
63
+ 3. The aspect ratio of the image is maintained as closely as possible.
64
+
65
+ """
66
+ if max(height, width) / min(height, width) > 200:
67
+ raise ValueError(
68
+ f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
69
+ )
70
+ h_bar = round(height / factor) * factor
71
+ w_bar = round(width / factor) * factor
72
+ if h_bar * w_bar > max_pixels:
73
+ beta = math.sqrt((height * width) / max_pixels)
74
+ h_bar = max(factor, math.floor(height / beta / factor) * factor)
75
+ w_bar = max(factor, math.floor(width / beta / factor) * factor)
76
+ elif h_bar * w_bar < min_pixels:
77
+ beta = math.sqrt(min_pixels / (height * width))
78
+ h_bar = math.ceil(height * beta / factor) * factor
79
+ w_bar = math.ceil(width * beta / factor) * factor
80
+ return h_bar, w_bar
81
+
82
+
83
+ class ZFQwen2VLImageProcessor(BaseImageProcessor):
84
+ r"""
85
+ Constructs a Qwen2-VL image processor that dynamically resizes images based on the original images.
86
+
87
+ Args:
88
+ do_resize (`bool`, *optional*, defaults to `True`):
89
+ Whether to resize the image's (height, width) dimensions.
90
+ size (`dict[str, int]`, *optional*, defaults to `{"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}`):
91
+ Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
92
+ resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
93
+ Resampling filter to use when resizing the image.
94
+ do_rescale (`bool`, *optional*, defaults to `True`):
95
+ Whether to rescale the image by the specified scale `rescale_factor`.
96
+ rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
97
+ Scale factor to use if rescaling the image.
98
+ do_normalize (`bool`, *optional*, defaults to `True`):
99
+ Whether to normalize the image.
100
+ image_mean (`float` or `list[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
101
+ Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
102
+ image_std (`float` or `list[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
103
+ Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
104
+ do_convert_rgb (`bool`, *optional*, defaults to `True`):
105
+ Whether to convert the image to RGB.
106
+ min_pixels (`int`, *optional*, defaults to `56 * 56`):
107
+ The min pixels of the image to resize the image.
108
+ max_pixels (`int`, *optional*, defaults to `28 * 28 * 1280`):
109
+ The max pixels of the image to resize the image.
110
+ patch_size (`int`, *optional*, defaults to 14):
111
+ The spatial patch size of the vision encoder.
112
+ temporal_patch_size (`int`, *optional*, defaults to 2):
113
+ The temporal patch size of the vision encoder.
114
+ merge_size (`int`, *optional*, defaults to 2):
115
+ The merge size of the vision encoder to llm encoder.
116
+ """
117
+
118
+ model_input_names = ["pixel_values", "image_grid_thw"]
119
+ valid_kwargs = Qwen2VLImageProcessorKwargs
120
+
121
+ def __init__(
122
+ self,
123
+ do_resize: bool = True,
124
+ size: dict[str, int] | None = None,
125
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
126
+ do_rescale: bool = True,
127
+ rescale_factor: int | float = 1 / 255,
128
+ do_normalize: bool = True,
129
+ image_mean: float | list[float] | None = None,
130
+ image_std: float | list[float] | None = None,
131
+ do_convert_rgb: bool = True,
132
+ min_pixels: int | None = None,
133
+ max_pixels: int | None = None,
134
+ patch_size: int = 14,
135
+ temporal_patch_size: int = 2,
136
+ merge_size: int = 2,
137
+ **kwargs,
138
+ ) -> None:
139
+ super().__init__(**kwargs)
140
+ if size is not None:
141
+ if "shortest_edge" not in size or "longest_edge" not in size:
142
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
143
+ else:
144
+ size = {"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}
145
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
146
+ if min_pixels is not None:
147
+ size["shortest_edge"] = min_pixels
148
+ if max_pixels is not None:
149
+ size["longest_edge"] = max_pixels
150
+ self.min_pixels = size["shortest_edge"]
151
+ self.max_pixels = size["longest_edge"]
152
+ self.size = size
153
+
154
+ self.do_resize = do_resize
155
+ self.resample = resample
156
+ self.do_rescale = do_rescale
157
+ self.rescale_factor = rescale_factor
158
+ self.do_normalize = do_normalize
159
+ self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
160
+ self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
161
+
162
+ self.patch_size = patch_size
163
+ self.temporal_patch_size = temporal_patch_size
164
+ self.merge_size = merge_size
165
+ self.do_convert_rgb = do_convert_rgb
166
+
167
+ def _preprocess(
168
+ self,
169
+ images: ImageInput | VideoInput,
170
+ do_resize: bool | None = None,
171
+ size: dict[str, int] | None = None,
172
+ resample: PILImageResampling | None = None,
173
+ do_rescale: bool | None = None,
174
+ rescale_factor: float | None = None,
175
+ do_normalize: bool | None = None,
176
+ image_mean: float | list[float] | None = None,
177
+ image_std: float | list[float] | None = None,
178
+ patch_size: int | None = None,
179
+ temporal_patch_size: int | None = None,
180
+ merge_size: int | None = None,
181
+ do_convert_rgb: bool | None = None,
182
+ data_format: ChannelDimension | None = ChannelDimension.FIRST,
183
+ input_data_format: str | ChannelDimension | None = None,
184
+ ):
185
+ """
186
+ Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
187
+
188
+ Args:
189
+ images (`ImageInput`):
190
+ Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
191
+ vision_info (`list[Dict]`, *optional*):
192
+ Optional list of dictionaries containing additional information about vision inputs.
193
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
194
+ Whether to resize the image.
195
+ size (`dict[str, int]`, *optional*, defaults to `self.size`):
196
+ Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
197
+ resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
198
+ Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
199
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
200
+ Whether to rescale the image.
201
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
202
+ Scale factor to use if rescaling the image.
203
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
204
+ Whether to normalize the image.
205
+ image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
206
+ Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
207
+ image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
208
+ Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
209
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
210
+ The spatial patch size of the vision encoder.
211
+ temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
212
+ The temporal patch size of the vision encoder.
213
+ merge_size (`int`, *optional*, defaults to `self.merge_size`):
214
+ The merge size of the vision encoder to llm encoder.
215
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
216
+ Whether to convert the image to RGB.
217
+ data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
218
+ The channel dimension format for the output image. Can be one of:
219
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
220
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
221
+ - Unset: Use the channel dimension format of the input image.
222
+ input_data_format (`ChannelDimension` or `str`, *optional*):
223
+ The channel dimension format for the input image. Can be one of:
224
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
225
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
226
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
227
+ """
228
+ images = make_flat_list_of_images(images)
229
+
230
+ if do_convert_rgb:
231
+ images = [convert_to_rgb(image) for image in images]
232
+
233
+ # All transformations expect numpy arrays.
234
+ images = [to_numpy_array(image) for image in images]
235
+
236
+ if do_rescale and is_scaled_image(images[0]):
237
+ logger.warning_once(
238
+ "It looks like you are trying to rescale already rescaled images. If the input"
239
+ " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
240
+ )
241
+ if input_data_format is None:
242
+ # We assume that all images have the same channel dimension format.
243
+ input_data_format = infer_channel_dimension_format(images[0])
244
+
245
+ height, width = get_image_size(images[0], channel_dim=input_data_format)
246
+ resized_height, resized_width = height, width
247
+ processed_images = []
248
+ for image in images:
249
+ if do_resize:
250
+ resized_height, resized_width = smart_resize(
251
+ height,
252
+ width,
253
+ factor=patch_size * merge_size,
254
+ min_pixels=size["shortest_edge"],
255
+ max_pixels=size["longest_edge"],
256
+ )
257
+ image = resize(
258
+ image, size=(resized_height, resized_width), resample=resample, input_data_format=input_data_format
259
+ )
260
+
261
+ if do_rescale:
262
+ image = self.rescale(image, scale=rescale_factor, input_data_format=input_data_format)
263
+
264
+ if do_normalize:
265
+ image = self.normalize(
266
+ image=image, mean=image_mean, std=image_std, input_data_format=input_data_format
267
+ )
268
+
269
+ image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
270
+ processed_images.append(image)
271
+
272
+ patches = np.array(processed_images)
273
+ if data_format == ChannelDimension.LAST:
274
+ patches = patches.transpose(0, 3, 1, 2)
275
+ if patches.shape[0] % temporal_patch_size != 0:
276
+ repeats = np.repeat(
277
+ patches[-1][np.newaxis], temporal_patch_size - (patches.shape[0] % temporal_patch_size), axis=0
278
+ )
279
+ patches = np.concatenate([patches, repeats], axis=0)
280
+ channel = patches.shape[1]
281
+ grid_t = patches.shape[0] // temporal_patch_size
282
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
283
+ patches = patches.reshape(
284
+ grid_t,
285
+ temporal_patch_size,
286
+ channel,
287
+ grid_h // merge_size,
288
+ merge_size,
289
+ patch_size,
290
+ grid_w // merge_size,
291
+ merge_size,
292
+ patch_size,
293
+ )
294
+ patches = patches.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8)
295
+ flatten_patches = patches.reshape(
296
+ grid_t * grid_h * grid_w, channel * temporal_patch_size * patch_size * patch_size
297
+ )
298
+
299
+ return flatten_patches, (grid_t, grid_h, grid_w)
300
+
301
+ def preprocess(
302
+ self,
303
+ images: ImageInput,
304
+ do_resize: bool | None = None,
305
+ size: dict[str, int] | None = None,
306
+ min_pixels: int | None = None,
307
+ max_pixels: int | None = None,
308
+ resample: PILImageResampling | None = None,
309
+ do_rescale: bool | None = None,
310
+ rescale_factor: float | None = None,
311
+ do_normalize: bool | None = None,
312
+ image_mean: float | list[float] | None = None,
313
+ image_std: float | list[float] | None = None,
314
+ patch_size: int | None = None,
315
+ temporal_patch_size: int | None = None,
316
+ merge_size: int | None = None,
317
+ do_convert_rgb: bool | None = None,
318
+ return_tensors: str | TensorType | None = None,
319
+ data_format: ChannelDimension | None = ChannelDimension.FIRST,
320
+ input_data_format: str | ChannelDimension | None = None,
321
+ ):
322
+ """
323
+ Args:
324
+ images (`ImageInput`):
325
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
326
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
327
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
328
+ Whether to resize the image.
329
+ size (`dict[str, int]`, *optional*, defaults to `self.size`):
330
+ Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
331
+ the longest edge resized to keep the input aspect ratio.
332
+ resample (`int`, *optional*, defaults to `self.resample`):
333
+ Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
334
+ has an effect if `do_resize` is set to `True`.
335
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
336
+ Whether to rescale the image.
337
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
338
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
339
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
340
+ Whether to normalize the image.
341
+ image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
342
+ Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
343
+ image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
344
+ Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
345
+ `True`.
346
+ min_pixels (`int`, *optional*, defaults to `self.min_pixels`):
347
+ The min pixels of the image to resize the image.
348
+ max_pixels (`int`, *optional*, defaults to `self.max_pixels`):
349
+ The max pixels of the image to resize the image.
350
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
351
+ The spatial patch size of the vision encoder.
352
+ temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
353
+ The temporal patch size of the vision encoder.
354
+ merge_size (`int`, *optional*, defaults to `self.merge_size`):
355
+ The merge size of the vision encoder to llm encoder.
356
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
357
+ Whether to convert the image to RGB.
358
+ return_tensors (`str` or `TensorType`, *optional*):
359
+ The type of tensors to return. Can be one of:
360
+ - Unset: Return a list of `np.ndarray`.
361
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
362
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
363
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
364
+ The channel dimension format for the output image. Can be one of:
365
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
366
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
367
+ - Unset: Use the channel dimension format of the input image.
368
+ input_data_format (`ChannelDimension` or `str`, *optional*):
369
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
370
+ from the input image. Can be one of:
371
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
372
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
373
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
374
+
375
+ """
376
+ min_pixels = min_pixels if min_pixels is not None else self.min_pixels
377
+ max_pixels = max_pixels if max_pixels is not None else self.max_pixels
378
+
379
+ if size is not None:
380
+ if "shortest_edge" not in size or "longest_edge" not in size:
381
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
382
+ min_pixels = size["shortest_edge"]
383
+ elif min_pixels is not None and max_pixels is not None:
384
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
385
+ size = {"shortest_edge": min_pixels, "longest_edge": max_pixels}
386
+ else:
387
+ size = {**self.size}
388
+
389
+ do_resize = do_resize if do_resize is not None else self.do_resize
390
+
391
+ resample = resample if resample is not None else self.resample
392
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
393
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
394
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
395
+ image_mean = image_mean if image_mean is not None else self.image_mean
396
+ image_std = image_std if image_std is not None else self.image_std
397
+ patch_size = patch_size if patch_size is not None else self.patch_size
398
+ temporal_patch_size = temporal_patch_size if temporal_patch_size is not None else self.temporal_patch_size
399
+ merge_size = merge_size if merge_size is not None else self.merge_size
400
+ do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
401
+
402
+ if images is not None:
403
+ images = self.fetch_images(images)
404
+ images = make_flat_list_of_images(images)
405
+
406
+ if images is not None and not valid_images(images):
407
+ raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, or torch.Tensor")
408
+
409
+ validate_preprocess_arguments(
410
+ rescale_factor=rescale_factor,
411
+ do_normalize=do_normalize,
412
+ image_mean=image_mean,
413
+ image_std=image_std,
414
+ do_resize=do_resize,
415
+ size=size,
416
+ resample=resample,
417
+ )
418
+
419
+ data = {}
420
+ pixel_values, vision_grid_thws = [], []
421
+ for image in images:
422
+ patches, image_grid_thw = self._preprocess(
423
+ image,
424
+ do_resize=do_resize,
425
+ size=size,
426
+ resample=resample,
427
+ do_rescale=do_rescale,
428
+ rescale_factor=rescale_factor,
429
+ do_normalize=do_normalize,
430
+ image_mean=image_mean,
431
+ image_std=image_std,
432
+ patch_size=patch_size,
433
+ temporal_patch_size=temporal_patch_size,
434
+ merge_size=merge_size,
435
+ data_format=data_format,
436
+ do_convert_rgb=do_convert_rgb,
437
+ input_data_format=input_data_format,
438
+ )
439
+ pixel_values.extend(patches)
440
+ vision_grid_thws.append(image_grid_thw)
441
+ pixel_values = np.array(pixel_values)
442
+ vision_grid_thws = np.array(vision_grid_thws)
443
+ data.update({"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws})
444
+
445
+ return BatchFeature(data=data, tensor_type=return_tensors)
446
+
447
+ def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
448
+ """
449
+ A utility that returns number of image patches for a given image size.
450
+
451
+ Args:
452
+ height (`int`):
453
+ Height of the input image.
454
+ width (`int`):
455
+ Width of the input image.
456
+ images_kwargs (`dict`, *optional*)
457
+ Any kwargs to override defaults of the image processor.
458
+ Returns:
459
+ `int`: Number of image patches per image.
460
+ """
461
+ min_pixels = images_kwargs["min_pixels"] if "min_pixels" in images_kwargs else self.size["shortest_edge"]
462
+ max_pixels = images_kwargs["max_pixels"] if "max_pixels" in images_kwargs else self.size["longest_edge"]
463
+ patch_size = images_kwargs.get("patch_size", self.patch_size)
464
+ merge_size = images_kwargs.get("merge_size", self.merge_size)
465
+
466
+ factor = patch_size * merge_size
467
+ resized_height, resized_width = smart_resize(
468
+ height, width, factor, min_pixels=min_pixels, max_pixels=max_pixels
469
+ )
470
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
471
+ return grid_h * grid_w
472
+
473
+
474
+ __all__ = ["ZFQwen2VLImageProcessor"]
image_processing_qwen2_vl_fast.py ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Union
2
+
3
+ import torch
4
+ import torchvision.transforms.v2.functional as tvF
5
+
6
+ from transformers.image_processing_utils import BatchFeature
7
+ from transformers.image_processing_utils_fast import (
8
+ BaseImageProcessorFast,
9
+ group_images_by_shape,
10
+ reorder_images,
11
+ )
12
+ from transformers.image_utils import (
13
+ OPENAI_CLIP_MEAN,
14
+ OPENAI_CLIP_STD,
15
+ ChannelDimension,
16
+ ImageInput,
17
+ PILImageResampling,
18
+ SizeDict,
19
+ )
20
+ from transformers.processing_utils import Unpack
21
+ from transformers.utils import (
22
+ TensorType,
23
+ auto_docstring,
24
+ logging,
25
+ )
26
+ from .image_processing_qwen2_vl import Qwen2VLImageProcessorKwargs, smart_resize
27
+
28
+
29
+ logger = logging.get_logger(__name__)
30
+
31
+
32
+ @auto_docstring
33
+ class ZFQwen2VLImageProcessorFast(BaseImageProcessorFast):
34
+ do_resize = True
35
+ resample = PILImageResampling.BICUBIC
36
+ size = {"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}
37
+ do_rescale = True
38
+ do_normalize = True
39
+ image_mean = OPENAI_CLIP_MEAN
40
+ image_std = OPENAI_CLIP_STD
41
+ do_convert_rgb = True
42
+ patch_size = 14
43
+ temporal_patch_size = 2
44
+ merge_size = 2
45
+ valid_kwargs = Qwen2VLImageProcessorKwargs
46
+ model_input_names = ["pixel_values", "image_grid_thw"]
47
+
48
+ def __init__(self, **kwargs: Unpack[Qwen2VLImageProcessorKwargs]):
49
+ size = kwargs.pop("size", None)
50
+ min_pixels = kwargs.pop("min_pixels", None)
51
+ max_pixels = kwargs.pop("max_pixels", None)
52
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
53
+ size = self.size if size is None else size
54
+ if min_pixels is not None:
55
+ size["shortest_edge"] = min_pixels
56
+ size.pop("min_pixels", None)
57
+ if max_pixels is not None:
58
+ size["longest_edge"] = max_pixels
59
+ size.pop("max_pixels", None)
60
+ if "shortest_edge" not in size or "longest_edge" not in size:
61
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
62
+
63
+ super().__init__(size=size, **kwargs)
64
+
65
+ def _further_process_kwargs(
66
+ self,
67
+ size: SizeDict | None = None,
68
+ min_pixels: int | None = None,
69
+ max_pixels: int | None = None,
70
+ **kwargs,
71
+ ) -> dict:
72
+ """
73
+ Update kwargs that need further processing before being validated
74
+ Can be overridden by subclasses to customize the processing of kwargs.
75
+ """
76
+ if min_pixels is not None and max_pixels is not None:
77
+ size = {"shortest_edge": min_pixels, "longest_edge": max_pixels}
78
+ elif size is not None:
79
+ if "shortest_edge" not in size or "longest_edge" not in size:
80
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
81
+ min_pixels = size["shortest_edge"]
82
+ max_pixels = size["longest_edge"]
83
+ else:
84
+ size = {**self.size}
85
+
86
+ return super()._further_process_kwargs(size=size, **kwargs)
87
+
88
+ @auto_docstring
89
+ def preprocess(
90
+ self,
91
+ images: ImageInput,
92
+ **kwargs: Unpack[Qwen2VLImageProcessorKwargs],
93
+ ) -> BatchFeature:
94
+ return super().preprocess(images, **kwargs)
95
+
96
+ def _preprocess_image_like_inputs(
97
+ self,
98
+ images: ImageInput,
99
+ do_convert_rgb: bool,
100
+ input_data_format: ChannelDimension,
101
+ device: Union[str, "torch.device"] | None = None,
102
+ **kwargs: Unpack[Qwen2VLImageProcessorKwargs],
103
+ ) -> BatchFeature:
104
+ """
105
+ Preprocess image-like inputs.
106
+ To be overridden by subclasses when image-like inputs other than images should be processed.
107
+ It can be used for segmentation maps, depth maps, etc.
108
+ """
109
+ # Prepare input images
110
+ batch_feature = BatchFeature()
111
+ images = self._prepare_image_like_inputs(
112
+ images=images, do_convert_rgb=do_convert_rgb, input_data_format=input_data_format, device=device
113
+ )
114
+ batch_feature = self._preprocess(images, **kwargs)
115
+ return batch_feature
116
+
117
+ def _preprocess(
118
+ self,
119
+ images: list["torch.Tensor"],
120
+ do_resize: bool,
121
+ size: SizeDict,
122
+ interpolation: Optional["tvF.InterpolationMode"],
123
+ do_rescale: bool,
124
+ rescale_factor: float,
125
+ do_normalize: bool,
126
+ image_mean: float | list[float] | None,
127
+ image_std: float | list[float] | None,
128
+ patch_size: int,
129
+ temporal_patch_size: int,
130
+ merge_size: int,
131
+ disable_grouping: bool | None,
132
+ return_tensors: str | TensorType | None,
133
+ **kwargs,
134
+ ):
135
+ # Group images by size for batched resizing
136
+ grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
137
+ resized_images_grouped = {}
138
+ for shape, stacked_images in grouped_images.items():
139
+ height, width = stacked_images.shape[-2:]
140
+ if do_resize:
141
+ resized_height, resized_width = smart_resize(
142
+ height,
143
+ width,
144
+ factor=patch_size * merge_size,
145
+ min_pixels=size["shortest_edge"],
146
+ max_pixels=size["longest_edge"],
147
+ )
148
+ stacked_images = self.resize(
149
+ image=stacked_images,
150
+ size=SizeDict(height=resized_height, width=resized_width),
151
+ interpolation=interpolation,
152
+ )
153
+ resized_images_grouped[shape] = stacked_images
154
+ resized_images = reorder_images(resized_images_grouped, grouped_images_index)
155
+
156
+ # Group images by size for further processing
157
+ # Needed in case do_resize is False, or resize returns images with different sizes
158
+ grouped_images, grouped_images_index = group_images_by_shape(resized_images, disable_grouping=disable_grouping)
159
+ processed_images_grouped = {}
160
+ processed_grids = {}
161
+ for shape, stacked_images in grouped_images.items():
162
+ resized_height, resized_width = stacked_images.shape[-2:]
163
+ # Fused rescale and normalize
164
+ patches = self.rescale_and_normalize(
165
+ stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std
166
+ )
167
+ if patches.ndim == 4:
168
+ # add a temporal dimension if we have images
169
+ patches = patches.unsqueeze(1)
170
+ if patches.shape[1] % temporal_patch_size != 0:
171
+ repeats = patches[:, -1:].repeat(1, temporal_patch_size - 1, 1, 1, 1)
172
+ patches = torch.cat([patches, repeats], dim=1)
173
+ batch_size, grid_t, channel = patches.shape[:3]
174
+ grid_t = grid_t // temporal_patch_size
175
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
176
+
177
+ patches = patches.view(
178
+ batch_size,
179
+ grid_t,
180
+ temporal_patch_size,
181
+ channel,
182
+ grid_h // merge_size,
183
+ merge_size,
184
+ patch_size,
185
+ grid_w // merge_size,
186
+ merge_size,
187
+ patch_size,
188
+ )
189
+ # Reorder dimensions to group grid and patch information for subsequent flattening.
190
+ # (batch, grid_t, grid_h, grid_w, merge_h, merge_w, channel, temp_patch_size, patch_h, patch_w)
191
+ patches = patches.permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
192
+ flatten_patches = patches.reshape(
193
+ batch_size,
194
+ grid_t * grid_h * grid_w,
195
+ channel * temporal_patch_size * patch_size * patch_size,
196
+ )
197
+
198
+ processed_images_grouped[shape] = flatten_patches
199
+ processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
200
+
201
+ processed_images = reorder_images(processed_images_grouped, grouped_images_index)
202
+ processed_grids = reorder_images(processed_grids, grouped_images_index)
203
+ pixel_values = torch.cat(processed_images, dim=0)
204
+ image_grid_thw = torch.tensor(processed_grids)
205
+
206
+ return BatchFeature(
207
+ data={"pixel_values": pixel_values, "image_grid_thw": image_grid_thw}, tensor_type=return_tensors
208
+ )
209
+
210
+ def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
211
+ """
212
+ A utility that returns number of image patches for a given image size.
213
+
214
+ Note: Do not remove this method! It is used by vLLM to infer the number of patches and placeholders
215
+ without an image input.
216
+
217
+ Args:
218
+ height (`int`):
219
+ Height of the input image.
220
+ width (`int`):
221
+ Width of the input image.
222
+ images_kwargs (`dict`, *optional*)
223
+ Any kwargs to override defaults of the image processor.
224
+ Returns:
225
+ `int`: Number of image patches per image.
226
+ """
227
+ min_pixels = images_kwargs["min_pixels"] if "min_pixels" in images_kwargs else self.size["shortest_edge"]
228
+ max_pixels = images_kwargs["max_pixels"] if "max_pixels" in images_kwargs else self.size["longest_edge"]
229
+ patch_size = images_kwargs.get("patch_size", self.patch_size)
230
+ merge_size = images_kwargs.get("merge_size", self.merge_size)
231
+
232
+ factor = patch_size * merge_size
233
+ resized_height, resized_width = smart_resize(
234
+ height, width, factor, min_pixels=min_pixels, max_pixels=max_pixels
235
+ )
236
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
237
+ return grid_h * grid_w
238
+
239
+
240
+ __all__ = ["ZFQwen2VLImageProcessorFast"]
processing_qwen3_vl.py ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+
3
+ from transformers.feature_extraction_utils import BatchFeature
4
+ from transformers.image_utils import ImageInput
5
+ from transformers.processing_utils import MultiModalData, ProcessingKwargs, ProcessorMixin, Unpack
6
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
7
+ from transformers.utils import logging
8
+ from transformers.utils.auto_docstring import auto_docstring
9
+ from transformers.video_utils import VideoInput
10
+
11
+ logger = logging.get_logger(__name__)
12
+
13
+
14
+ class Qwen3VLProcessorKwargs(ProcessingKwargs, total=False):
15
+ _defaults = { # type: ignore
16
+ "text_kwargs": {
17
+ "padding": False,
18
+ "return_token_type_ids": False,
19
+ "return_mm_token_type_ids": False,
20
+ },
21
+ "videos_kwargs": {"return_metadata": True},
22
+ }
23
+
24
+
25
+ @auto_docstring
26
+ class ZFQwen3VLProcessor(ProcessorMixin):
27
+ def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
28
+ self.image_token = "<|image_pad|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
29
+ self.video_token = "<|video_pad|>" if not hasattr(tokenizer, "video_token") else tokenizer.video_token
30
+ self.image_token_id = (
31
+ tokenizer.image_token_id
32
+ if getattr(tokenizer, "image_token_id", None)
33
+ else tokenizer.convert_tokens_to_ids(self.image_token)
34
+ )
35
+ self.video_token_id = (
36
+ tokenizer.video_token_id
37
+ if getattr(tokenizer, "video_token_id", None)
38
+ else tokenizer.convert_tokens_to_ids(self.video_token)
39
+ )
40
+ super().__init__(image_processor, tokenizer, video_processor, chat_template=chat_template)
41
+ self.vision_start_token = (
42
+ "<|vision_start|>" if not hasattr(tokenizer, "vision_start_token") else tokenizer.vision_start_token
43
+ )
44
+ self.vision_end_token = (
45
+ "<|vision_end|>" if not hasattr(tokenizer, "vision_end_token") else tokenizer.vision_end_token
46
+ )
47
+ self.vision_start_token_id = (
48
+ tokenizer.vision_start_token_id
49
+ if getattr(tokenizer, "vision_start_token_id", None)
50
+ else tokenizer.convert_tokens_to_ids(self.vision_start_token)
51
+ )
52
+ self.vision_end_token_id = (
53
+ tokenizer.vision_end_token_id
54
+ if getattr(tokenizer, "vision_end_token_id", None)
55
+ else tokenizer.convert_tokens_to_ids(self.vision_end_token)
56
+ )
57
+
58
+ @auto_docstring
59
+ def __call__(
60
+ self,
61
+ images: ImageInput = None,
62
+ text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
63
+ videos: VideoInput = None,
64
+ **kwargs: Unpack[Qwen3VLProcessorKwargs],
65
+ ) -> BatchFeature:
66
+ r"""
67
+ Returns:
68
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
69
+
70
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
71
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
72
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
73
+ `None`).
74
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
75
+ - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
76
+ - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
77
+ - **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
78
+ """
79
+ output_kwargs = self._merge_kwargs(
80
+ Qwen3VLProcessorKwargs,
81
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
82
+ **kwargs,
83
+ )
84
+ if images is not None:
85
+ image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
86
+ image_grid_thw = image_inputs["image_grid_thw"]
87
+ else:
88
+ image_inputs = {}
89
+ image_grid_thw = None
90
+
91
+ if videos is not None:
92
+ videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
93
+ video_grid_thw = videos_inputs["video_grid_thw"]
94
+ # If user has not requested video metadata, pop it
95
+ if not kwargs.get("return_metadata"):
96
+ video_metadata = videos_inputs.pop("video_metadata")
97
+ else:
98
+ video_metadata = videos_inputs["video_metadata"]
99
+ else:
100
+ videos_inputs = {}
101
+ video_grid_thw = None
102
+
103
+ if not isinstance(text, list):
104
+ text = [text]
105
+
106
+ text = text.copy() # below lines change text in-place
107
+ if image_grid_thw is not None:
108
+ merge_length = self.image_processor.merge_size**2
109
+ index = 0
110
+ for i in range(len(text)):
111
+ while self.image_token in text[i]:
112
+ num_image_tokens = image_grid_thw[index].prod() // merge_length
113
+ text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
114
+ index += 1
115
+ text[i] = text[i].replace("<|placeholder|>", self.image_token)
116
+
117
+ if video_grid_thw is not None:
118
+ merge_length = self.video_processor.merge_size**2
119
+ index = 0
120
+ for i in range(len(text)):
121
+ while self.video_token in text[i]:
122
+ metadata = video_metadata[index]
123
+ if metadata.fps is None:
124
+ logger.warning_once(
125
+ "Qwen3VL requires frame timestamps to construct prompts, but the `fps` of the input video could not be inferred. "
126
+ "Probably `video_metadata` was missing from inputs and you passed pre-sampled frames. "
127
+ "Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
128
+ )
129
+ metadata.fps = 24 if metadata.fps is None else metadata.fps
130
+
131
+ # if timestamps are not provided, calculate them
132
+ curr_timestamp = self._calculate_timestamps(
133
+ metadata.frames_indices,
134
+ metadata.fps,
135
+ self.video_processor.merge_size,
136
+ )
137
+
138
+ video_placeholder = ""
139
+ frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
140
+ for frame_idx in range(video_grid_thw[index][0]):
141
+ curr_time = curr_timestamp[frame_idx]
142
+ video_placeholder += f"<{curr_time:.1f} seconds>"
143
+ video_placeholder += (
144
+ self.vision_start_token + "<|placeholder|>" * frame_seqlen + self.vision_end_token
145
+ )
146
+ if f"{self.vision_start_token}{self.video_token}{self.vision_end_token}" in text[i]:
147
+ text[i] = text[i].replace(
148
+ f"{self.vision_start_token}{self.video_token}{self.vision_end_token}", video_placeholder, 1
149
+ )
150
+ else:
151
+ # vllm may input video token directly
152
+ text[i] = text[i].replace(self.video_token, video_placeholder, 1)
153
+ index += 1
154
+
155
+ text[i] = text[i].replace("<|placeholder|>", self.video_token)
156
+
157
+ return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
158
+ return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", None)
159
+ text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
160
+ self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
161
+
162
+ if return_mm_token_type_ids:
163
+ array_ids = np.array(text_inputs["input_ids"])
164
+ mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
165
+ mm_token_type_ids[array_ids == self.image_token_id] = 1
166
+ text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
167
+
168
+ return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
169
+
170
+ def _get_num_multimodal_tokens(self, image_sizes=None, video_sizes=None, **kwargs):
171
+ """
172
+ Computes the number of placeholder tokens needed for multimodal inputs with the given sizes.
173
+ Args:
174
+ image_sizes (`list[list[int]]`, *optional*):
175
+ The input sizes formatted as (height, width) per each image.
176
+ video_sizes (`list[list[int]]`, *optional*):
177
+ The input sizes formatted as (num_frames, height, width) per each video.
178
+ Returns:
179
+ `MultiModalData`: A `MultiModalData` object holding number of tokens per each of the provided
180
+ input modalities, along with other useful data.
181
+ """
182
+
183
+ vision_data = {}
184
+ if image_sizes is not None:
185
+ images_kwargs = Qwen3VLProcessorKwargs._defaults.get("images_kwargs", {})
186
+ images_kwargs.update(kwargs)
187
+ merge_size = images_kwargs.get("merge_size", None) or self.image_processor.merge_size
188
+
189
+ num_image_patches = [
190
+ self.image_processor.get_number_of_image_patches(*image_size, images_kwargs)
191
+ for image_size in image_sizes
192
+ ]
193
+ num_image_tokens = [(num_patches // merge_size**2) for num_patches in num_image_patches]
194
+ vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches})
195
+
196
+ if video_sizes is not None:
197
+ videos_kwargs = Qwen3VLProcessorKwargs._defaults.get("videos_kwargs", {})
198
+ videos_kwargs.update(kwargs)
199
+ num_video_patches = [
200
+ self.video_processor.get_number_of_video_patches(*video_size, videos_kwargs)
201
+ for video_size in video_sizes
202
+ ]
203
+ num_video_tokens = [(num_patches // merge_size**2) for num_patches in num_video_patches]
204
+ vision_data["num_video_tokens"] = num_video_tokens
205
+
206
+ return MultiModalData(**vision_data)
207
+
208
+ def post_process_image_text_to_text(
209
+ self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
210
+ ):
211
+ """
212
+ Post-process the output of the model to decode the text.
213
+
214
+ Args:
215
+ generated_outputs (`torch.Tensor` or `np.ndarray`):
216
+ The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
217
+ or `(sequence_length,)`.
218
+ skip_special_tokens (`bool`, *optional*, defaults to `True`):
219
+ Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
220
+ clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
221
+ Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
222
+ **kwargs:
223
+ Additional arguments to be passed to the tokenizer's `batch_decode method`.
224
+
225
+ Returns:
226
+ `list[str]`: The decoded text.
227
+ """
228
+ return self.tokenizer.batch_decode(
229
+ generated_outputs,
230
+ skip_special_tokens=skip_special_tokens,
231
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
232
+ **kwargs,
233
+ )
234
+
235
+ def _calculate_timestamps(self, indices: list[int] | np.ndarray, video_fps: float, merge_size: int = 2):
236
+ if not isinstance(indices, list):
237
+ indices = indices.tolist()
238
+ if len(indices) % merge_size != 0:
239
+ indices.extend(indices[-1] for _ in range(merge_size - len(indices) % merge_size))
240
+ timestamps = [idx / video_fps for idx in indices]
241
+ # @JJJYmmm frames are merged by self.merge_size, \
242
+ # so we need to average the timestamps between the first/last frame within the temporal patch
243
+ timestamps = [
244
+ (timestamps[i] + timestamps[i + merge_size - 1]) / 2 for i in range(0, len(timestamps), merge_size)
245
+ ]
246
+ return timestamps
247
+
248
+
249
+ __all__ = ["ZFQwen3VLProcessor"]
processor_config.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor"
4
+ },
5
+ "image_processor": {
6
+ "auto_map": {
7
+ "AutoImageProcessor": "image_processing_qwen2_vl_fast.ZFQwen2VLImageProcessorFast",
8
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor"
9
+ },
10
+ "data_format": "channels_first",
11
+ "do_convert_rgb": true,
12
+ "do_normalize": true,
13
+ "do_rescale": true,
14
+ "do_resize": true,
15
+ "image_mean": [
16
+ 0.5,
17
+ 0.5,
18
+ 0.5
19
+ ],
20
+ "image_processor_type": "ZFQwen2VLImageProcessorFast",
21
+ "image_std": [
22
+ 0.5,
23
+ 0.5,
24
+ 0.5
25
+ ],
26
+ "merge_size": 2,
27
+ "patch_size": 16,
28
+ "resample": 3,
29
+ "rescale_factor": 0.00392156862745098,
30
+ "size": {
31
+ "longest_edge": 16777216,
32
+ "shortest_edge": 65536
33
+ },
34
+ "temporal_patch_size": 2
35
+ },
36
+ "processor_class": "ZFQwen3VLProcessor",
37
+ "video_processor": {
38
+ "auto_map": {
39
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor",
40
+ "AutoVideoProcessor": "video_processing_qwen3_vl.ZFQwen3VLVideoProcessor"
41
+ },
42
+ "data_format": "channels_first",
43
+ "default_to_square": true,
44
+ "do_convert_rgb": true,
45
+ "do_normalize": true,
46
+ "do_rescale": true,
47
+ "do_resize": true,
48
+ "do_sample_frames": true,
49
+ "fps": 2,
50
+ "image_mean": [
51
+ 0.5,
52
+ 0.5,
53
+ 0.5
54
+ ],
55
+ "image_std": [
56
+ 0.5,
57
+ 0.5,
58
+ 0.5
59
+ ],
60
+ "max_frames": 768,
61
+ "merge_size": 2,
62
+ "min_frames": 4,
63
+ "patch_size": 16,
64
+ "resample": 3,
65
+ "rescale_factor": 0.00392156862745098,
66
+ "return_metadata": false,
67
+ "size": {
68
+ "longest_edge": 25165824,
69
+ "shortest_edge": 4096
70
+ },
71
+ "temporal_patch_size": 2,
72
+ "video_processor_type": "ZFQwen3VLVideoProcessor"
73
+ }
74
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
3
+ size 11422650
tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "auto_map": {
4
+ "AutoProcessor": "processing_qwen3_vl.ZFQwen3VLProcessor"
5
+ },
6
+ "backend": "tokenizers",
7
+ "bos_token": null,
8
+ "clean_up_tokenization_spaces": false,
9
+ "eos_token": "<|im_end|>",
10
+ "errors": "replace",
11
+ "extra_special_tokens": [
12
+ "<|im_start|>",
13
+ "<|im_end|>",
14
+ "<|object_ref_start|>",
15
+ "<|object_ref_end|>",
16
+ "<|box_start|>",
17
+ "<|box_end|>",
18
+ "<|quad_start|>",
19
+ "<|quad_end|>",
20
+ "<|vision_start|>",
21
+ "<|vision_end|>",
22
+ "<|vision_pad|>",
23
+ "<|image_pad|>",
24
+ "<|video_pad|>"
25
+ ],
26
+ "is_local": true,
27
+ "model_max_length": 262144,
28
+ "pad_token": "<|endoftext|>",
29
+ "processor_class": "ZFQwen3VLProcessor",
30
+ "split_special_tokens": false,
31
+ "tokenizer_class": "Qwen2Tokenizer",
32
+ "unk_token": null
33
+ }
video_processing_qwen3_vl.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+
3
+ import numpy as np
4
+ import torch
5
+
6
+ from transformers.feature_extraction_utils import BatchFeature
7
+ from transformers.image_utils import ChannelDimension, PILImageResampling, SizeDict, get_image_size
8
+ from transformers.processing_utils import Unpack, VideosKwargs
9
+ from transformers.utils import logging
10
+ from transformers.utils.generic import TensorType
11
+ from transformers.utils.doc import add_start_docstrings
12
+ from transformers.video_processing_utils import BASE_VIDEO_PROCESSOR_DOCSTRING, BaseVideoProcessor
13
+ from transformers.video_utils import VideoMetadata, group_videos_by_shape, reorder_videos
14
+
15
+
16
+ logger = logging.get_logger(__name__)
17
+
18
+
19
+ def smart_resize(
20
+ num_frames: int,
21
+ height: int,
22
+ width: int,
23
+ temporal_factor: int = 2,
24
+ factor: int = 32,
25
+ min_pixels: int = 128 * 128,
26
+ max_pixels: int = 16 * 16 * 2 * 2 * 2 * 6144,
27
+ ):
28
+ if height < factor or width < factor:
29
+ raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
30
+ elif max(height, width) / min(height, width) > 200:
31
+ raise ValueError(
32
+ f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
33
+ )
34
+ h_bar = round(height / factor) * factor
35
+ w_bar = round(width / factor) * factor
36
+ t_bar = math.ceil(num_frames / temporal_factor) * temporal_factor
37
+
38
+ if t_bar * h_bar * w_bar > max_pixels:
39
+ beta = math.sqrt((num_frames * height * width) / max_pixels)
40
+ h_bar = max(factor, math.floor(height / beta / factor) * factor)
41
+ w_bar = max(factor, math.floor(width / beta / factor) * factor)
42
+ elif t_bar * h_bar * w_bar < min_pixels:
43
+ beta = math.sqrt(min_pixels / (num_frames * height * width))
44
+ h_bar = math.ceil(height * beta / factor) * factor
45
+ w_bar = math.ceil(width * beta / factor) * factor
46
+
47
+ return h_bar, w_bar
48
+
49
+
50
+ class Qwen3VLVideoProcessorInitKwargs(VideosKwargs, total=False):
51
+ patch_size: int
52
+ temporal_patch_size: int
53
+ merge_size: int
54
+ min_frames: int
55
+ max_frames: int
56
+
57
+
58
+ @add_start_docstrings(
59
+ "Constructs a fast Qwen3-VL image processor that dynamically resizes videos based on the original videos.",
60
+ BASE_VIDEO_PROCESSOR_DOCSTRING,
61
+ """
62
+ patch_size (`int`, *optional*, defaults to 16):
63
+ The spacial patch size of the vision encoder.
64
+ temporal_patch_size (`int`, *optional*, defaults to 2):
65
+ The temporal patch size of the vision encoder.
66
+ merge_size (`int`, *optional*, defaults to 2):
67
+ The merge size of the vision encoder to llm encoder.
68
+ """,
69
+ )
70
+ class ZFQwen3VLVideoProcessor(BaseVideoProcessor):
71
+ resample = PILImageResampling.BICUBIC
72
+ size = {"shortest_edge": 128 * 32 * 32, "longest_edge": 32 * 32 * 768}
73
+ image_mean = [0.5, 0.5, 0.5]
74
+ image_std = [0.5, 0.5, 0.5]
75
+ do_resize = True
76
+ do_rescale = True
77
+ do_normalize = True
78
+ do_convert_rgb = True
79
+ patch_size = 16
80
+ temporal_patch_size = 2
81
+ merge_size = 2
82
+ fps = 2
83
+ min_frames = 4
84
+ max_frames = 768
85
+ do_sample_frames = True
86
+ valid_kwargs = Qwen3VLVideoProcessorInitKwargs
87
+ model_input_names = ["pixel_values_videos", "video_grid_thw"]
88
+
89
+ def __init__(self, **kwargs: Unpack[Qwen3VLVideoProcessorInitKwargs]):
90
+ super().__init__(**kwargs)
91
+ if self.size is not None and (
92
+ self.size.get("shortest_edge", None) is None or self.size.get("longest_edge", None) is None
93
+ ):
94
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
95
+
96
+ def _further_process_kwargs(
97
+ self,
98
+ size: SizeDict | None = None,
99
+ **kwargs,
100
+ ) -> dict:
101
+ """
102
+ Update kwargs that need further processing before being validated
103
+ Can be overridden by subclasses to customize the processing of kwargs.
104
+ """
105
+ if size is not None and ("shortest_edge" not in size or "longest_edge" not in size):
106
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
107
+
108
+ return super()._further_process_kwargs(size=size, **kwargs)
109
+
110
+ def sample_frames(
111
+ self,
112
+ metadata: VideoMetadata,
113
+ num_frames: int | None = None,
114
+ fps: int | float | None = None,
115
+ **kwargs,
116
+ ):
117
+ """
118
+ Default sampling function which uniformly samples the desired number of frames between 0 and total number of frames.
119
+ If `fps` is passed along with metadata, `fps` frames per second are sampled uniformty. Arguments `num_frames`
120
+ and `fps` are mutually exclusive.
121
+
122
+ Args:
123
+ video (`torch.Tensor`):
124
+ Video that need to be sampled.
125
+ metadata (`VideoMetadata`):
126
+ Metadata of the video containing information about total duration, fps and total number of frames.
127
+ num_frames (`int`, *optional*):
128
+ Maximum number of frames to sample. Defaults to `self.num_frames`.
129
+ fps (`int` or `float`, *optional*):
130
+ Target frames to sample per second. Defaults to `self.fps`.
131
+ Returns:
132
+ torch.Tensor:
133
+ Sampled video frames.
134
+ """
135
+ if fps is not None and num_frames is not None:
136
+ raise ValueError("`num_frames` and `fps` are mutually exclusive arguments, please use only one!")
137
+
138
+ total_num_frames = metadata.total_num_frames
139
+ fps = fps if fps is not None else self.fps
140
+
141
+ # If num_frames is not given but fps is, calculate num_frames from fps
142
+ if num_frames is None and fps is not None:
143
+ if metadata.fps is None:
144
+ metadata.fps = 24
145
+ logger.warning_once(
146
+ "Asked to sample `fps` frames per second but no video metadata was provided which is required when sampling with `fps`. "
147
+ "Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
148
+ )
149
+ num_frames = int(total_num_frames / metadata.fps * fps)
150
+ num_frames = min(max(num_frames, self.min_frames), self.max_frames, total_num_frames)
151
+
152
+ if num_frames is None:
153
+ num_frames = min(max(total_num_frames, self.min_frames), self.max_frames)
154
+
155
+ indices = np.linspace(0, total_num_frames - 1, num_frames).round().astype(int)
156
+
157
+ return indices
158
+
159
+ def _preprocess(
160
+ self,
161
+ videos: list[torch.Tensor],
162
+ do_convert_rgb: bool = True,
163
+ do_resize: bool = True,
164
+ size: SizeDict | None = None,
165
+ interpolation: PILImageResampling = PILImageResampling.BICUBIC,
166
+ do_rescale: bool = True,
167
+ rescale_factor: float = 1 / 255.0,
168
+ do_normalize: bool = True,
169
+ image_mean: float | list[float] | None = None,
170
+ image_std: float | list[float] | None = None,
171
+ patch_size: int | None = None,
172
+ temporal_patch_size: int | None = None,
173
+ merge_size: int | None = None,
174
+ return_tensors: str | TensorType | None = None,
175
+ **kwargs,
176
+ ):
177
+ grouped_videos, grouped_videos_index = group_videos_by_shape(videos)
178
+ resized_videos_grouped = {}
179
+
180
+ for shape, stacked_videos in grouped_videos.items():
181
+ B, T, C, H, W = stacked_videos.shape
182
+ num_frames, height, width = T, H, W
183
+ if do_resize:
184
+ resized_height, resized_width = smart_resize(
185
+ num_frames=num_frames,
186
+ height=height,
187
+ width=width,
188
+ temporal_factor=temporal_patch_size,
189
+ factor=patch_size * merge_size,
190
+ min_pixels=size.shortest_edge,
191
+ max_pixels=size.longest_edge,
192
+ )
193
+ stacked_videos = stacked_videos.view(B * T, C, H, W)
194
+ stacked_videos = self.resize(
195
+ stacked_videos,
196
+ size=SizeDict(height=resized_height, width=resized_width),
197
+ interpolation=interpolation,
198
+ )
199
+ stacked_videos = stacked_videos.view(B, T, C, resized_height, resized_width)
200
+ resized_videos_grouped[shape] = stacked_videos
201
+ resized_videos = reorder_videos(resized_videos_grouped, grouped_videos_index)
202
+
203
+ # Group videos by size for further processing
204
+ # Needed in case do_resize is False, or resize returns videos with different sizes
205
+ grouped_videos, grouped_videos_index = group_videos_by_shape(resized_videos)
206
+ processed_videos_grouped = {}
207
+ processed_grids = {}
208
+ for shape, stacked_videos in grouped_videos.items():
209
+ resized_height, resized_width = get_image_size(stacked_videos[0], channel_dim=ChannelDimension.FIRST)
210
+
211
+ # Fused rescale and normalize
212
+ stacked_videos = self.rescale_and_normalize(
213
+ stacked_videos, do_rescale, rescale_factor, do_normalize, image_mean, image_std
214
+ )
215
+ patches = stacked_videos
216
+
217
+ # Check that videos have `num_frames` divisible by `temporal_patch_size`
218
+ T = patches.shape[1]
219
+ if pad := -T % temporal_patch_size:
220
+ repeats = patches[:, -1:].expand(-1, pad, -1, -1, -1)
221
+ patches = torch.cat((patches, repeats), dim=1)
222
+ batch_size, grid_t, channel = patches.shape[:3]
223
+ grid_t = grid_t // temporal_patch_size
224
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
225
+
226
+ patches = patches.view(
227
+ batch_size,
228
+ grid_t,
229
+ temporal_patch_size,
230
+ channel,
231
+ grid_h // merge_size,
232
+ merge_size,
233
+ patch_size,
234
+ grid_w // merge_size,
235
+ merge_size,
236
+ patch_size,
237
+ )
238
+ patches = patches.permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
239
+ flatten_patches = patches.reshape(
240
+ batch_size,
241
+ grid_t * grid_h * grid_w,
242
+ channel * temporal_patch_size * patch_size * patch_size,
243
+ )
244
+
245
+ processed_videos_grouped[shape] = flatten_patches
246
+ processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
247
+
248
+ processed_videos = reorder_videos(processed_videos_grouped, grouped_videos_index)
249
+ processed_grids = reorder_videos(processed_grids, grouped_videos_index)
250
+ pixel_values_videos = torch.cat(processed_videos, dim=0)
251
+ video_grid_thw = torch.tensor(processed_grids)
252
+ data = {
253
+ "pixel_values_videos": pixel_values_videos,
254
+ "video_grid_thw": video_grid_thw,
255
+ }
256
+
257
+ return BatchFeature(data=data, tensor_type=return_tensors)
258
+
259
+
260
+ __all__ = ["ZFQwen3VLVideoProcessor"]