Cyril666 commited on
Commit
dff609e
·
verified ·
1 Parent(s): d7c7af6

Upload folder using huggingface_hub

Browse files
LICENSE.txt ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Tencent is pleased to support the open source community by making Penguin-VL 高效视觉语言理解模型 available.
2
+
3
+ Copyright (C) 2026 Tencent. All rights reserved.
4
+
5
+ The open-source software and/or models included in this distribution may have been modified by Tencent (“Tencent Modifications”). All Tencent Modifications are Copyright (C) Tencent.
6
+
7
+ Penguin-VL 高效视觉语言理解模型 is licensed under the License Terms of Penguin-VL 高效视觉语言理解模型, except for the third-party components listed below, which remain licensed under their respective original terms. Penguin-VL 高效视觉语言理解模型 does not impose any additional restrictions beyond those specified in the original licenses of these third-party components. Users are required to comply with all applicable terms and conditions of the original licenses and to ensure that the use of these third-party components conforms to all relevant laws and regulations.
8
+
9
+ For the avoidance of doubt, Penguin-VL 高效视觉语言理解模型 refers solely to training code, inference code, parameters, and weights made publicly available by Tencent in accordance with the License Terms of Penguin-VL 高效视觉语言理解模型.
10
+
11
+ Terms of the License Terms of Penguin-VL 高效视觉语言理解模型:
12
+ --------------------------------------------------------------------
13
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
14
+
15
+ 0. Additional Territorial Limitation ** Penguin-VL 高效视觉语言理解模型 IS NOT INTENDED FOR USE WITHIN THE EUROPEAN UNION.** IN THE EVENT OF ANY CONFLICT, THIS CLAUSE SHALL PREVAIL.
16
+
17
+ 1. Definitions.
18
+
19
+ "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
20
+
21
+ "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
22
+
23
+ "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%!)(MISSING) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
24
+
25
+ "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
28
+
29
+ "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
30
+
31
+ "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
32
+
33
+ "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
34
+
35
+ "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
36
+
37
+ "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
38
+
39
+ 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
40
+
41
+ 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
42
+
43
+ 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
44
+
45
+ You must give any other recipients of the Work or Derivative Works a copy of this License; and
46
+
47
+ You must cause any modified files to carry prominent notices stating that You changed the files; and
48
+
49
+ You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
50
+
51
+ If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
52
+
53
+ You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
54
+
55
+ 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
56
+
57
+ 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
58
+
59
+ 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
60
+
61
+ 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
62
+
63
+ 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
64
+
65
+
66
+ END OF TERMS AND CONDITIONS
67
+
68
+
69
+
70
+ Dependencies and Licenses:
71
+
72
+ This open-source project, Penguin-VL 高效视觉语言理解模型, builds upon the following open-source models and/or software components, each of which remains licensed under its original license. Certain models or software may include modifications made by Tencent (“Tencent Modifications”), which are Copyright (C) Tencent.
73
+
74
+ In case you believe there have been errors in the attribution below, you may submit the concerns to us for review and correction.
75
+
76
+
77
+ Open Source Model Licensed under the Apache License 2.0:
78
+ --------------------------------------------------------------------
79
+ 1. Qwen3
80
+ Copyright 2024 Alibaba Cloud
81
+ Please note this model has been modified by Tencent in this distribution.
82
+
83
+
84
+
85
+ Terms of the Apache License 2.0:
86
+ --------------------------------------------------------------------
87
+ Apache License
88
+ Version 2.0, January 2004
89
+ http://www.apache.org/licenses/
90
+
91
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
92
+
93
+ 1. Definitions.
94
+
95
+ "License" shall mean the terms and conditions for use, reproduction,
96
+ and distribution as defined by Sections 1 through 9 of this document.
97
+
98
+ "Licensor" shall mean the copyright owner or entity authorized by
99
+ the copyright owner that is granting the License.
100
+
101
+ "Legal Entity" shall mean the union of the acting entity and all
102
+ other entities that control, are controlled by, or are under common
103
+ control with that entity. For the purposes of this definition,
104
+ "control" means (i) the power, direct or indirect, to cause the
105
+ direction or management of such entity, whether by contract or
106
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
107
+ outstanding shares, or (iii) beneficial ownership of such entity.
108
+
109
+ "You" (or "Your") shall mean an individual or Legal Entity
110
+ exercising permissions granted by this License.
111
+
112
+ "Source" form shall mean the preferred form for making modifications,
113
+ including but not limited to software source code, documentation
114
+ source, and configuration files.
115
+
116
+ "Object" form shall mean any form resulting from mechanical
117
+ transformation or translation of a Source form, including but
118
+ not limited to compiled object code, generated documentation,
119
+ and conversions to other media types.
120
+
121
+ "Work" shall mean the work of authorship, whether in Source or
122
+ Object form, made available under the License, as indicated by a
123
+ copyright notice that is included in or attached to the work
124
+ (an example is provided in the Appendix below).
125
+
126
+ "Derivative Works" shall mean any work, whether in Source or Object
127
+ form, that is based on (or derived from) the Work and for which the
128
+ editorial revisions, annotations, elaborations, or other modifications
129
+ represent, as a whole, an original work of authorship. For the purposes
130
+ of this License, Derivative Works shall not include works that remain
131
+ separable from, or merely link (or bind by name) to the interfaces of,
132
+ the Work and Derivative Works thereof.
133
+
134
+ "Contribution" shall mean any work of authorship, including
135
+ the original version of the Work and any modifications or additions
136
+ to that Work or Derivative Works thereof, that is intentionally
137
+ submitted to Licensor for inclusion in the Work by the copyright owner
138
+ or by an individual or Legal Entity authorized to submit on behalf of
139
+ the copyright owner. For the purposes of this definition, "submitted"
140
+ means any form of electronic, verbal, or written communication sent
141
+ to the Licensor or its representatives, including but not limited to
142
+ communication on electronic mailing lists, source code control systems,
143
+ and issue tracking systems that are managed by, or on behalf of, the
144
+ Licensor for the purpose of discussing and improving the Work, but
145
+ excluding communication that is conspicuously marked or otherwise
146
+ designated in writing by the copyright owner as "Not a Contribution."
147
+
148
+ "Contributor" shall mean Licensor and any individual or Legal Entity
149
+ on behalf of whom a Contribution has been received by Licensor and
150
+ subsequently incorporated within the Work.
151
+
152
+ 2. Grant of Copyright License. Subject to the terms and conditions of
153
+ this License, each Contributor hereby grants to You a perpetual,
154
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
155
+ copyright license to reproduce, prepare Derivative Works of,
156
+ publicly display, publicly perform, sublicense, and distribute the
157
+ Work and such Derivative Works in Source or Object form.
158
+
159
+ 3. Grant of Patent License. Subject to the terms and conditions of
160
+ this License, each Contributor hereby grants to You a perpetual,
161
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
162
+ (except as stated in this section) patent license to make, have made,
163
+ use, offer to sell, sell, import, and otherwise transfer the Work,
164
+ where such license applies only to those patent claims licensable
165
+ by such Contributor that are necessarily infringed by their
166
+ Contribution(s) alone or by combination of their Contribution(s)
167
+ with the Work to which such Contribution(s) was submitted. If You
168
+ institute patent litigation against any entity (including a
169
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
170
+ or a Contribution incorporated within the Work constitutes direct
171
+ or contributory patent infringement, then any patent licenses
172
+ granted to You under this License for that Work shall terminate
173
+ as of the date such litigation is filed.
174
+
175
+ 4. Redistribution. You may reproduce and distribute copies of the
176
+ Work or Derivative Works thereof in any medium, with or without
177
+ modifications, and in Source or Object form, provided that You
178
+ meet the following conditions:
179
+
180
+ (a) You must give any other recipients of the Work or
181
+ Derivative Works a copy of this License; and
182
+
183
+ (b) You must cause any modified files to carry prominent notices
184
+ stating that You changed the files; and
185
+
186
+ (c) You must retain, in the Source form of any Derivative Works
187
+ that You distribute, all copyright, patent, trademark, and
188
+ attribution notices from the Source form of the Work,
189
+ excluding those notices that do not pertain to any part of
190
+ the Derivative Works; and
191
+
192
+ (d) If the Work includes a "NOTICE" text file as part of its
193
+ distribution, then any Derivative Works that You distribute must
194
+ include a readable copy of the attribution notices contained
195
+ within such NOTICE file, excluding those notices that do not
196
+ pertain to any part of the Derivative Works, in at least one
197
+ of the following places: within a NOTICE text file distributed
198
+ as part of the Derivative Works; within the Source form or
199
+ documentation, if provided along with the Derivative Works; or,
200
+ within a display generated by the Derivative Works, if and
201
+ wherever such third-party notices normally appear. The contents
202
+ of the NOTICE file are for informational purposes only and
203
+ do not modify the License. You may add Your own attribution
204
+ notices within Derivative Works that You distribute, alongside
205
+ or as an addendum to the NOTICE text from the Work, provided
206
+ that such additional attribution notices cannot be construed
207
+ as modifying the License.
208
+
209
+ You may add Your own copyright statement to Your modifications and
210
+ may provide additional or different license terms and conditions
211
+ for use, reproduction, or distribution of Your modifications, or
212
+ for any such Derivative Works as a whole, provided Your use,
213
+ reproduction, and distribution of the Work otherwise complies with
214
+ the conditions stated in this License.
215
+
216
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
217
+ any Contribution intentionally submitted for inclusion in the Work
218
+ by You to the Licensor shall be under the terms and conditions of
219
+ this License, without any additional terms or conditions.
220
+ Notwithstanding the above, nothing herein shall supersede or modify
221
+ the terms of any separate license agreement you may have executed
222
+ with Licensor regarding such Contributions.
223
+
224
+ 6. Trademarks. This License does not grant permission to use the trade
225
+ names, trademarks, service marks, or product names of the Licensor,
226
+ except as required for reasonable and customary use in describing the
227
+ origin of the Work and reproducing the content of the NOTICE file.
228
+
229
+ 7. Disclaimer of Warranty. Unless required by applicable law or
230
+ agreed to in writing, Licensor provides the Work (and each
231
+ Contributor provides its Contributions) on an "AS IS" BASIS,
232
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
233
+ implied, including, without limitation, any warranties or conditions
234
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
235
+ PARTICULAR PURPOSE. You are solely responsible for determining the
236
+ appropriateness of using or redistributing the Work and assume any
237
+ risks associated with Your exercise of permissions under this License.
238
+
239
+ 8. Limitation of Liability. In no event and under no legal theory,
240
+ whether in tort (including negligence), contract, or otherwise,
241
+ unless required by applicable law (such as deliberate and grossly
242
+ negligent acts) or agreed to in writing, shall any Contributor be
243
+ liable to You for damages, including any direct, indirect, special,
244
+ incidental, or consequential damages of any character arising as a
245
+ result of this License or out of the use or inability to use the
246
+ Work (including but not limited to damages for loss of goodwill,
247
+ work stoppage, computer failure or malfunction, or any and all
248
+ other commercial damages or losses), even if such Contributor
249
+ has been advised of the possibility of such damages.
250
+
251
+ 9. Accepting Warranty or Additional Liability. While redistributing
252
+ the Work or Derivative Works thereof, You may choose to offer,
253
+ and charge a fee for, acceptance of support, warranty, indemnity,
254
+ or other liability obligations and/or rights consistent with this
255
+ License. However, in accepting such obligations, You may act only
256
+ on Your own behalf and on Your sole responsibility, not on behalf
257
+ of any other Contributor, and only if You agree to indemnify,
258
+ defend, and hold each Contributor harmless for any liability
259
+ incurred by, or claims asserted against, such Contributor by reason
260
+ of your accepting any such warranty or additional liability.
261
+
262
+ END OF TERMS AND CONDITIONS
263
+
264
+ APPENDIX: How to apply the Apache License to your work.
265
+
266
+ To apply the Apache License to your work, attach the following
267
+ boilerplate notice, with the fields enclosed by brackets "[]"
268
+ replaced with your own identifying information. (Don't include
269
+ the brackets!) The text should be enclosed in the appropriate
270
+ comment syntax for the file format. We also recommend that a
271
+ file or class name and description of purpose be included on the
272
+ same "printed page" as the copyright notice for easier
273
+ identification within third-party archives.
274
+
275
+ Copyright [yyyy] [name of copyright owner]
276
+
277
+ Licensed under the Apache License, Version 2.0 (the "License");
278
+ you may not use this file except in compliance with the License.
279
+ You may obtain a copy of the License at
280
+
281
+ http://www.apache.org/licenses/LICENSE-2.0
282
+
283
+ Unless required by applicable law or agreed to in writing, software
284
+ distributed under the License is distributed on an "AS IS" BASIS,
285
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
286
+ See the License for the specific language governing permissions and
287
+ limitations under the License.
288
+
289
+
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - Qwen/Qwen3-0.6B
9
+ library_name: transformers
10
+ tags:
11
+ - multi-modal
12
+ - large-language-model
13
+ - vision-language-model
14
+ - vision-encoder
15
+ ---
16
+
17
+ <p align="center">
18
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
19
+ </p>
20
+
21
+
22
+ <h2 align="center">Vision Encoder of Penguin-VL</h2>
23
+ <h4 align="center">
24
+ Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
25
+ </h4>
26
+
27
+ <h4 align="center">
28
+ <b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> |
29
+ <b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> |
30
+ <b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
31
+ <br><br>
32
+ <a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
33
+ <a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
34
+ <a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
35
+ <a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
36
+ </h4>
37
+
38
+ ---
39
+
40
+ ## 📰 News
41
+
42
+ * **2026.03** — PenguinVL-Encoder now available for general use.
43
+ * **2026.03** — Released PenguinVL-2B, PenguinVL-8B.
44
+
45
+ ---
46
+
47
+ ## 🌟 Model Overview
48
+
49
+ PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
50
+
51
+ Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
52
+
53
+ ### Key Characteristics
54
+
55
+ - 🧠 **LLM-based Vision Encoder**
56
+ The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
57
+ This provides strong semantic priors and native compatibility with the downstream LLM.
58
+
59
+ ---
60
+
61
+ ## 🧪 Quick Start — Transformers Inference
62
+
63
+ ```python
64
+ import torch
65
+ from transformers import AutoModel, AutoImageProcessor
66
+ from transformers.image_utils import load_image
67
+
68
+ model_name = "tencent/Penguin-Encoder"
69
+ image_path = "your_img.jpg"
70
+ images = load_image(image_path)
71
+
72
+ model = AutoModel.from_pretrained(
73
+ model_name,
74
+ trust_remote_code=True,
75
+ device_map="auto",
76
+ torch_dtype=torch.bfloat16,
77
+ attn_implementation="flash_attention_2",
78
+ )
79
+ processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
80
+
81
+ inputs = processor(images=images, merge_size=1)
82
+ inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
83
+ if "pixel_values" in inputs:
84
+ inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
85
+ image_features = model(**inputs)
86
+ ```
87
+
88
+ ## 🌎 Model Zoo
89
+ | Model | Base Model | HF Link |
90
+ | -------------------- | ------------ | ------------------------------------------------------------ |
91
+ | PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
92
+ | PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
93
+ | PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
94
+
95
+ ## 🚀 Main Results
96
+ Ablation Study:
97
+
98
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/JOSRpV_qEbTqdbYwH-hJr.png)
99
+
100
+ Main Results can see the ablation section in our paper.
101
+
102
+ ## Citation
103
+
104
+ If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
105
+ ```bibtex
106
+ @article{Penguin-VL,
107
+ title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
108
+ author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
109
+ journal={arXiv preprint arXiv:2603.06569},
110
+ year={2026}
111
+ }
112
+ ```
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "PenguinVLVisionEncoderModel"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_penguinvl_encoder.PenguinVLVisionEncoderConfig",
7
+ "AutoModel": "modeling_penguinvl_encoder.PenguinVLVisionEncoderModel"
8
+ },
9
+ "attention_bias": false,
10
+ "attention_dropout": 0.0,
11
+ "bos_token_id": 151643,
12
+ "eos_token_id": 151645,
13
+ "head_dim": 128,
14
+ "hidden_act": "silu",
15
+ "hidden_size": 1024,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "layer_norm_eps": 1e-06,
19
+ "max_position_embeddings": 40960,
20
+ "max_window_layers": 28,
21
+ "model_type": "penguinvl_vision_encoder",
22
+ "num_attention_heads": 16,
23
+ "num_channels": 3,
24
+ "num_hidden_layers": 28,
25
+ "num_key_value_heads": 8,
26
+ "patch_size": 14,
27
+ "rms_norm_eps": 1e-06,
28
+ "rope_scaling": null,
29
+ "rope_theta": 1000000,
30
+ "sliding_window": null,
31
+ "tie_word_embeddings": true,
32
+ "torch_dtype": "bfloat16",
33
+ "transformers_version": "4.51.3",
34
+ "use_cache": true,
35
+ "use_sliding_window": false,
36
+ "vocab_size": 151936
37
+ }
configuration_penguinvl_encoder.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """PenguinVL vision encoder model configuration."""
2
+
3
+ from transformers import Qwen3Config
4
+
5
+
6
+ class PenguinVLVisionEncoderConfig(Qwen3Config):
7
+
8
+ model_type = "penguinvl_vision_encoder"
9
+
10
+ def __init__(
11
+ self,
12
+ hidden_size=1536,
13
+ intermediate_size=8960,
14
+ num_hidden_layers=12,
15
+ num_attention_heads=12,
16
+ num_channels=3,
17
+ patch_size=14,
18
+ layer_norm_eps=1e-6,
19
+ attention_dropout=0.0,
20
+ num_key_value_heads=2,
21
+ **kwargs,
22
+ ):
23
+ super().__init__(**kwargs)
24
+
25
+ self.hidden_size = hidden_size
26
+ self.intermediate_size = intermediate_size
27
+ self.num_hidden_layers = num_hidden_layers
28
+ self.num_attention_heads = num_attention_heads
29
+ self.num_channels = num_channels
30
+ self.patch_size = patch_size
31
+ self.attention_dropout = attention_dropout
32
+ self.num_key_value_heads = num_key_value_heads
33
+ self.layer_norm_eps = layer_norm_eps
image_processing_penguinvl.py ADDED
@@ -0,0 +1,548 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adopted from https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py.
2
+ # Below is the original copyright:
3
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+ """Image processor class for PenguinVL."""
22
+
23
+ import math
24
+ from typing import Dict, List, Optional, Union
25
+
26
+ import numpy as np
27
+
28
+ import torch
29
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
30
+ from transformers.image_utils import ImageInput
31
+ from transformers.image_transforms import (
32
+ convert_to_rgb,
33
+ resize,
34
+ to_channel_dimension_format,
35
+ )
36
+ from transformers.image_utils import (
37
+ OPENAI_CLIP_MEAN,
38
+ OPENAI_CLIP_STD,
39
+ ChannelDimension,
40
+ ImageInput,
41
+ PILImageResampling,
42
+ get_image_size,
43
+ infer_channel_dimension_format,
44
+ is_scaled_image,
45
+ is_valid_image,
46
+ make_list_of_images,
47
+ to_numpy_array,
48
+ )
49
+ try:
50
+ from transformers.image_utils import VideoInput
51
+ except:
52
+ from transformers.video_utils import VideoInput
53
+ from transformers.utils import TensorType, is_vision_available, logging
54
+
55
+
56
+ logger = logging.get_logger(__name__)
57
+
58
+
59
+ if is_vision_available():
60
+ from PIL import Image
61
+
62
+
63
+ def is_valid_video(video) -> bool:
64
+ if isinstance(video, (list, tuple)):
65
+ return all(is_valid_image(frame) for frame in video)
66
+ elif isinstance(video, np.ndarray):
67
+ return video.ndim == 4
68
+ elif isinstance(video, torch.Tensor):
69
+ return video.ndim == 4
70
+ return False
71
+
72
+
73
+ def make_batched_images(images) -> List[List[ImageInput]]:
74
+ """
75
+ Normalize visual inputs to ``List[List[ImageInput]]`` – a list of *clips*,
76
+ where each clip is a list of frames.
77
+
78
+ Supported input formats::
79
+
80
+ Nested clips : [[image], [f1, f2, ...], ...] → returned as-is
81
+ Flat frames : [f1, f2, ...] → [[f1, f2, ...]]
82
+ Single image : image → [[image]]
83
+
84
+ Returns:
85
+ List of clips, where each clip is a list of valid images / frames.
86
+ """
87
+ if isinstance(images, (list, tuple)) and len(images) > 0:
88
+ if isinstance(images[0], (list, tuple)):
89
+ return [list(clip) for clip in images]
90
+ if all(is_valid_image(f) for f in images):
91
+ return [list(images)]
92
+ if is_valid_image(images):
93
+ return [[images]]
94
+ raise ValueError(f"Could not make batched images from {images}")
95
+
96
+
97
+ def simple_batched_resize(
98
+ images,
99
+ factor: int = 28,
100
+ min_tokens: int = 4 * 4,
101
+ max_tokens: int = 16384,
102
+ input_data_format: str = None,
103
+ frame_types=None
104
+ ):
105
+ """
106
+ Compute per-frame target (h, w) for a video frame list under a token budget (key/intermediate may differ).
107
+
108
+ Uses the Temporal Redundancy-Aware (TRA) token compression strategy: key and intermediate frames
109
+ can have different target areas (e.g. 1:16 ratio when compressing) to stay within max_tokens.
110
+
111
+ Args:
112
+ images: List of video frames (each PIL Image or ndarray).
113
+ factor: Alignment granularity (height and width are multiples of factor), default 28.
114
+ min_tokens: Minimum tokens per frame (used to derive min_pixels), default 16.
115
+ max_tokens: Token cap for total pixel budget, default 16384.
116
+ input_data_format: Channel format when not PIL, e.g. "channels_first".
117
+ frame_types: Per-frame type list, 0=key, 1=intermediate; None means all key.
118
+
119
+ Returns:
120
+ image_sizes: List of (h, w) per frame, one-to-one with images.
121
+ """
122
+ min_pixels = min_tokens * factor * factor * 1.5
123
+ max_pixels = max_tokens * factor * factor * 0.95
124
+
125
+ # --- Base info ---
126
+ first_image = images[0]
127
+ if isinstance(first_image, Image.Image):
128
+ width, height = first_image.size
129
+ else:
130
+ height, width = get_image_size(first_image, channel_dim=input_data_format)
131
+
132
+ aspect_ratio = height / width
133
+ raw_area = height * width
134
+
135
+ num_frames = len(images)
136
+ if frame_types is not None:
137
+ ft_list = frame_types.tolist() if hasattr(frame_types, 'tolist') else frame_types
138
+ num_intermediate = ft_list.count(1)
139
+ num_key = ft_list.count(0)
140
+ else:
141
+ num_key = num_frames
142
+ num_intermediate = 0
143
+ ft_list = [0] * num_frames
144
+
145
+ def get_dims_from_area(target_area, ar, fac):
146
+ """Compute aligned (h, w) from target area and aspect ratio; area = w²·ar => w = sqrt(area/ar)."""
147
+ w_new = math.sqrt(target_area / ar)
148
+ h_new = w_new * ar
149
+
150
+ h_bar = round(h_new / fac) * fac
151
+ w_bar = round(w_new / fac) * fac
152
+ h_bar = max(h_bar, fac)
153
+ w_bar = max(w_bar, fac)
154
+
155
+ return h_bar, w_bar
156
+
157
+ # --- Stage 1: No-downscale check ---
158
+ # If total pixels within budget, keep original size for both key and intermediate frames.
159
+ total_raw_pixels = num_frames * raw_area
160
+ target_key_area = raw_area
161
+ target_intermediate_area = raw_area
162
+
163
+ if total_raw_pixels > max_pixels:
164
+ # --- Stage 2: Sync compression ---
165
+ # Over budget: compress with 1:16 area ratio, intermediate_area = key_area / 16.
166
+ # Constraint: N_key·A_key + N_intermediate·(A_key/16) = max_pixels => A_key = max_pixels / (N_key + N_intermediate/16).
167
+ effective_count = num_key + (num_intermediate / 16.0)
168
+ calc_key_area = max_pixels / effective_count
169
+ calc_intermediate_area = calc_key_area / 16.0
170
+
171
+ # --- Stage 3: Intermediate-frame floor ---
172
+ # If computed intermediate area is below min_pixels, pin intermediate to min_pixels and give remaining budget to key.
173
+ if calc_intermediate_area >= min_pixels:
174
+ target_key_area = calc_key_area
175
+ target_intermediate_area = calc_intermediate_area
176
+ else:
177
+ target_intermediate_area = min_pixels
178
+ pixels_taken_by_intermediate = num_intermediate * min_pixels
179
+ remaining_for_key = max_pixels - pixels_taken_by_intermediate
180
+ target_key_area = remaining_for_key / num_key
181
+
182
+ # --- Stage 4: Key-frame hard floor ---
183
+ if target_key_area < min_pixels:
184
+ target_key_area = min_pixels
185
+
186
+ # --- Area to aligned dimensions ---
187
+ k_h, k_w = get_dims_from_area(target_key_area, aspect_ratio, factor)
188
+ if num_intermediate > 0:
189
+ i_h, i_w = get_dims_from_area(target_intermediate_area, aspect_ratio, factor)
190
+ else:
191
+ i_h, i_w = 0, 0
192
+
193
+ def ensure_min_hw(h, w, min_p, raw_ar):
194
+ """If area still below min_pixels after alignment (rounding), recompute from min area and align upward."""
195
+ if h * w < min_p:
196
+ w = math.sqrt(min_p / raw_ar)
197
+ h = w * raw_ar
198
+ h = math.ceil(h / factor) * factor
199
+ w = math.ceil(w / factor) * factor
200
+ return h, w
201
+
202
+ k_h, k_w = ensure_min_hw(k_h, k_w, min_pixels, aspect_ratio)
203
+ if num_intermediate > 0:
204
+ i_h, i_w = ensure_min_hw(i_h, i_w, min_pixels, aspect_ratio)
205
+
206
+ image_sizes = [
207
+ (i_h, i_w) if ft_list[i] == 1 else (k_h, k_w)
208
+ for i in range(num_frames)
209
+ ]
210
+ return image_sizes
211
+
212
+
213
+ class PenguinVLImageProcessor(BaseImageProcessor):
214
+ r"""
215
+ Constructs a PenguinVL image processor that dynamically resizes images based on the original images.
216
+
217
+ Args:
218
+ do_resize (`bool`, *optional*, defaults to `True`):
219
+ Whether to resize the image's (height, width) dimensions.
220
+ resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
221
+ Resampling filter to use when resizing the image.
222
+ do_rescale (`bool`, *optional*, defaults to `True`):
223
+ Whether to rescale the image by the specified scale `rescale_factor`.
224
+ rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
225
+ Scale factor to use if rescaling the image.
226
+ do_normalize (`bool`, *optional*, defaults to `True`):
227
+ Whether to normalize the image.
228
+ image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
229
+ Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
230
+ image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
231
+ Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
232
+ do_convert_rgb (`bool`, *optional*, defaults to `True`):
233
+ Whether to convert the image to RGB.
234
+ min_pixels (`int`, *optional*, defaults to `56 * 56`):
235
+ The min pixels of the image to resize the image.
236
+ max_pixels (`int`, *optional*, defaults to `28 * 28 * 1280`):
237
+ The max pixels of the image to resize the image.
238
+ patch_size (`int`, *optional*, defaults to 14):
239
+ The spacial patch size of the vision encoder.
240
+ """
241
+
242
+ model_input_names = ["pixel_values", "grid_sizes", "merge_sizes"]
243
+
244
+ def __init__(
245
+ self,
246
+ do_resize: bool = True,
247
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
248
+ do_rescale: bool = True,
249
+ rescale_factor: Union[int, float] = 1 / 255,
250
+ do_normalize: bool = True,
251
+ image_mean: Optional[Union[float, List[float]]] = None,
252
+ image_std: Optional[Union[float, List[float]]] = None,
253
+ do_convert_rgb: bool = True,
254
+ min_tokens: int = 4 * 4,
255
+ max_tokens: int = 16384,
256
+ patch_size: int = 14,
257
+ **kwargs,
258
+ ) -> None:
259
+ super().__init__(**kwargs)
260
+ self.do_resize = do_resize
261
+ self.resample = resample
262
+ self.do_rescale = do_rescale
263
+ self.rescale_factor = rescale_factor
264
+ self.do_normalize = do_normalize
265
+ self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
266
+ self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
267
+ self.min_tokens = min_tokens
268
+ self.max_tokens = max_tokens
269
+ self.patch_size = patch_size
270
+ self.do_convert_rgb = do_convert_rgb
271
+
272
+ def _allocate_token_budget(self, clips, clip_merge_sizes, input_data_format):
273
+ """Distribute self.max_tokens across clips proportionally to their raw token counts."""
274
+ clip_raw_tokens = []
275
+ for clip, ms in zip(clips, clip_merge_sizes):
276
+ first_frame = clip[0]
277
+ if isinstance(first_frame, Image.Image):
278
+ w, h = first_frame.size
279
+ else:
280
+ h, w = get_image_size(first_frame, channel_dim=input_data_format)
281
+ factor = self.patch_size * ms
282
+ clip_raw_tokens.append(len(clip) * h * w / (factor * factor))
283
+
284
+ total_raw_tokens = sum(clip_raw_tokens)
285
+ if total_raw_tokens <= self.max_tokens:
286
+ return [self.max_tokens] * len(clips)
287
+
288
+ return [
289
+ max(self.min_tokens * len(clip), raw * self.max_tokens / total_raw_tokens)
290
+ for clip, raw in zip(clips, clip_raw_tokens)
291
+ ]
292
+
293
+ def _preprocess(
294
+ self,
295
+ images: Union[ImageInput, VideoInput],
296
+ target_size: List[int],
297
+ merge_size: int = 1,
298
+ do_resize: bool = None,
299
+ resample: PILImageResampling = None,
300
+ do_rescale: bool = None,
301
+ rescale_factor: float = None,
302
+ do_normalize: bool = None,
303
+ image_mean: Optional[Union[float, List[float]]] = None,
304
+ image_std: Optional[Union[float, List[float]]] = None,
305
+ do_convert_rgb: bool = None,
306
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
307
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
308
+ ):
309
+ """
310
+ Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
311
+
312
+ Args:
313
+ images (`ImageInput`):
314
+ Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
315
+ target_size (`List[int]`):
316
+ The target size to resize the image to. Should be a list of two integers: [target_height, target_width].
317
+ merge_size (`int`, *optional*, defaults to `1`):
318
+ The merge size after the vision encoder.
319
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
320
+ Whether to resize the image.
321
+ resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
322
+ Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
323
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
324
+ Whether to rescale the image.
325
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
326
+ Scale factor to use if rescaling the image.
327
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
328
+ Whether to normalize the image.
329
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
330
+ Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
331
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
332
+ Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
333
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
334
+ Whether to convert the image to RGB.
335
+ data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
336
+ The channel dimension format for the output image. Can be one of:
337
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
338
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
339
+ - Unset: Use the channel dimension format of the input image.
340
+ input_data_format (`ChannelDimension` or `str`, *optional*):
341
+ The channel dimension format for the input image. Can be one of:
342
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
343
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
344
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
345
+ """
346
+ images = make_list_of_images(images)
347
+
348
+ if do_convert_rgb:
349
+ images = [convert_to_rgb(image) for image in images]
350
+
351
+ # All transformations expect numpy arrays.
352
+ images = [to_numpy_array(image) for image in images]
353
+
354
+ if is_scaled_image(images[0]) and do_rescale:
355
+ logger.warning_once(
356
+ "It looks like you are trying to rescale already rescaled images. If the input"
357
+ " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
358
+ )
359
+ if input_data_format is None:
360
+ # We assume that all images have the same channel dimension format.
361
+ input_data_format = infer_channel_dimension_format(images[0])
362
+
363
+ height, width = get_image_size(images[0], channel_dim=input_data_format)
364
+ resized_height, resized_width = height, width
365
+ processed_images = []
366
+ for image in images:
367
+ if do_resize:
368
+ resized_height, resized_width = target_size
369
+ image = resize(
370
+ image, size=(resized_height, resized_width), resample=resample, input_data_format=input_data_format
371
+ )
372
+
373
+ if do_rescale:
374
+ image = self.rescale(image, scale=rescale_factor, input_data_format=input_data_format)
375
+
376
+ if do_normalize:
377
+ image = self.normalize(
378
+ image=image, mean=image_mean, std=image_std, input_data_format=input_data_format
379
+ )
380
+
381
+ image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
382
+ processed_images.append(image)
383
+
384
+ patches = np.array(processed_images)
385
+ if data_format == ChannelDimension.LAST:
386
+ patches = patches.transpose(0, 3, 1, 2)
387
+ t = patches.shape[0]
388
+ channel = patches.shape[1]
389
+ grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
390
+ patches = patches.reshape(
391
+ t,
392
+ channel,
393
+ grid_h // merge_size,
394
+ merge_size,
395
+ self.patch_size,
396
+ grid_w // merge_size,
397
+ merge_size,
398
+ self.patch_size,
399
+ )
400
+ patches = patches.transpose(0, 2, 5, 3, 6, 1, 4, 7)
401
+ flatten_patches = patches.reshape(
402
+ t * grid_h * grid_w, channel * self.patch_size * self.patch_size
403
+ )
404
+
405
+ return flatten_patches, (t, grid_h, grid_w)
406
+
407
+ def preprocess(
408
+ self,
409
+ images: ImageInput,
410
+ do_resize: bool = None,
411
+ resample: PILImageResampling = None,
412
+ do_rescale: bool = None,
413
+ rescale_factor: float = None,
414
+ do_normalize: bool = None,
415
+ image_mean: Optional[Union[float, List[float]]] = None,
416
+ image_std: Optional[Union[float, List[float]]] = None,
417
+ do_convert_rgb: bool = None,
418
+ merge_size: Optional[Union[int, List[int]]] = None,
419
+ frame_types: Optional[Union[int, List[int]]] = None,
420
+ return_tensors: Optional[Union[str, TensorType]] = None,
421
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
422
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
423
+ ):
424
+ """
425
+ Args:
426
+ images (`ImageInput`):
427
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
428
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
429
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
430
+ Whether to resize the image.
431
+ resample (`int`, *optional*, defaults to `self.resample`):
432
+ Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
433
+ has an effect if `do_resize` is set to `True`.
434
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
435
+ Whether to rescale the image.
436
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
437
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
438
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
439
+ Whether to normalize the image.
440
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
441
+ Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
442
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
443
+ Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
444
+ `True`.
445
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
446
+ Whether to convert the image to RGB.
447
+ return_tensors (`str` or `TensorType`, *optional*):
448
+ The type of tensors to return. Can be one of:
449
+ - Unset: Return a list of `np.ndarray`.
450
+ - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
451
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
452
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
453
+ - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
454
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
455
+ The channel dimension format for the output image. Can be one of:
456
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
457
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
458
+ - Unset: Use the channel dimension format of the input image.
459
+ input_data_format (`ChannelDimension` or `str`, *optional*):
460
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
461
+ from the input image. Can be one of:
462
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
463
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
464
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
465
+
466
+ """
467
+ do_resize = do_resize if do_resize is not None else self.do_resize
468
+ resample = resample if resample is not None else self.resample
469
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
470
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
471
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
472
+ image_mean = image_mean if image_mean is not None else self.image_mean
473
+ image_std = image_std if image_std is not None else self.image_std
474
+ merge_size = merge_size if merge_size is not None else self.merge_size
475
+ do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
476
+
477
+ clips = make_batched_images(images)
478
+ num_clips = len(clips)
479
+
480
+ if isinstance(merge_size, (list, tuple)):
481
+ assert len(merge_size) == num_clips, (
482
+ f"merge_size length ({len(merge_size)}) must match number of clips ({num_clips})"
483
+ )
484
+ clip_merge_sizes = list(merge_size)
485
+ else:
486
+ clip_merge_sizes = [merge_size] * num_clips
487
+
488
+ if frame_types is None:
489
+ clip_frame_types = [None] * num_clips
490
+ elif isinstance(frame_types, (list, tuple)) and len(frame_types) > 0:
491
+ if isinstance(frame_types[0], (list, tuple)) or frame_types[0] is None:
492
+ assert len(frame_types) == num_clips, (
493
+ f"frame_types length ({len(frame_types)}) must match number of clips ({num_clips})"
494
+ )
495
+ clip_frame_types = list(frame_types)
496
+ else:
497
+ assert num_clips == 1, "Flat frame_types is only supported for a single clip"
498
+ clip_frame_types = [frame_types]
499
+ else:
500
+ clip_frame_types = [None] * num_clips
501
+
502
+ pixel_values, grid_sizes, per_frame_merge_sizes = [], [], []
503
+
504
+ clip_max_tokens_list = self._allocate_token_budget(
505
+ clips, clip_merge_sizes, input_data_format,
506
+ )
507
+
508
+ for clip, ms, ft, clip_max_tokens in zip(clips, clip_merge_sizes, clip_frame_types, clip_max_tokens_list):
509
+ target_sizes = simple_batched_resize(
510
+ clip,
511
+ factor=self.patch_size * ms,
512
+ min_tokens=self.min_tokens,
513
+ max_tokens=clip_max_tokens,
514
+ input_data_format=input_data_format,
515
+ frame_types=ft,
516
+ )
517
+
518
+ for frame, target_size in zip(clip, target_sizes):
519
+ patches, grid_size = self._preprocess(
520
+ frame,
521
+ target_size=target_size,
522
+ merge_size=ms,
523
+ do_resize=do_resize,
524
+ resample=resample,
525
+ do_rescale=do_rescale,
526
+ rescale_factor=rescale_factor,
527
+ do_normalize=do_normalize,
528
+ image_mean=image_mean,
529
+ image_std=image_std,
530
+ data_format=data_format,
531
+ do_convert_rgb=do_convert_rgb,
532
+ input_data_format=input_data_format,
533
+ )
534
+ pixel_values.append(patches)
535
+ grid_sizes.append(grid_size)
536
+ per_frame_merge_sizes.append(ms)
537
+
538
+ pixel_values = np.concatenate(pixel_values, axis=0)
539
+ grid_sizes = np.array(grid_sizes)
540
+ merge_sizes = np.array(per_frame_merge_sizes)
541
+
542
+ data = {
543
+ "pixel_values": pixel_values,
544
+ "grid_sizes": grid_sizes,
545
+ "merge_sizes": merge_sizes,
546
+ }
547
+
548
+ return BatchFeature(data=data, tensor_type=return_tensors)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c42f10b42add06635ad0d38092114ad07ec24931290aaf12105fa4503a74ab2
3
+ size 1764318128
modeling_penguinvl_encoder.py ADDED
@@ -0,0 +1,548 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from torch import nn
2
+ import torch
3
+ import math
4
+ import warnings
5
+ from functools import partial
6
+ from .configuration_penguinvl_encoder import PenguinVLVisionEncoderConfig
7
+ from transformers.modeling_utils import PreTrainedModel
8
+ from transformers.models.qwen3.modeling_qwen3 import Qwen3Model, Qwen3Attention, rotate_half, Qwen3DecoderLayer
9
+ from typing import List, Optional, Tuple, Union
10
+ from transformers.modeling_outputs import BaseModelOutputWithPast
11
+ from transformers.processing_utils import Unpack
12
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
13
+ from transformers.cache_utils import Cache, DynamicCache
14
+ from transformers.utils import logging, is_flash_attn_greater_or_equal_2_10, is_flash_attn_2_available
15
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
16
+ from torch.nn.init import _calculate_fan_in_and_fan_out
17
+ import torch.nn.functional as F
18
+ if is_flash_attn_2_available():
19
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
20
+ from flash_attn import flash_attn_varlen_func
21
+
22
+ logger = logging.get_logger(__name__)
23
+
24
+ class PenguinVLVisionEncoderEmbeddings(nn.Module):
25
+
26
+ def __init__(self, config: PenguinVLVisionEncoderConfig):
27
+ super().__init__()
28
+ self.config = config
29
+ self.embed_dim = config.hidden_size
30
+ self.patch_size = config.patch_size
31
+
32
+ self.patch_embedding = nn.Conv2d(
33
+ in_channels=config.num_channels,
34
+ out_channels=self.embed_dim,
35
+ kernel_size=self.patch_size,
36
+ stride=self.patch_size,
37
+ padding="valid",
38
+ )
39
+
40
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
41
+ hidden_states = hidden_states.view(
42
+ -1, self.config.num_channels, self.patch_size, self.patch_size
43
+ )
44
+ patch_embeds = self.patch_embedding(hidden_states)
45
+ embeddings = patch_embeds.view(-1, self.embed_dim)
46
+
47
+ return embeddings
48
+
49
+
50
+ # Adapted from Qwen2VLRotaryEmbedding in transformers/models/qwen2/modeling_qwen2.py
51
+ class VisualRotaryEmbedding(nn.Module):
52
+ def __init__(
53
+ self,
54
+ dim=None,
55
+ max_position_embeddings=2048,
56
+ base=10000,
57
+ device=None,
58
+ scaling_factor=1.0,
59
+ rope_type="default",
60
+ config = None,
61
+ ):
62
+ super().__init__()
63
+ # TODO (joao): remove the `if` below, only used for BC
64
+ self.rope_kwargs = {}
65
+ if config is None:
66
+ logger.warning_once(
67
+ "`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the "
68
+ "`config` argument. All other arguments will be removed in v4.46"
69
+ )
70
+ self.rope_kwargs = {
71
+ "rope_type": rope_type,
72
+ "factor": scaling_factor,
73
+ "dim": dim,
74
+ "base": base,
75
+ "max_position_embeddings": max_position_embeddings,
76
+ }
77
+ self.rope_type = rope_type
78
+ self.max_seq_len_cached = max_position_embeddings
79
+ self.original_max_seq_len = max_position_embeddings
80
+ else:
81
+ # BC: "rope_type" was originally "type"
82
+ if config.rope_scaling is not None:
83
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
84
+ else:
85
+ self.rope_type = "default"
86
+ self.max_seq_len_cached = config.max_position_embeddings
87
+ self.original_max_seq_len = config.max_position_embeddings
88
+
89
+ self.config = config
90
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
91
+
92
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, **self.rope_kwargs)
93
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
94
+ self.original_inv_freq = self.inv_freq
95
+
96
+ def _dynamic_frequency_update(self, position_ids, device):
97
+ """
98
+ dynamic RoPE layers should recompute `inv_freq` in the following situations:
99
+ 1 - growing beyond the cached sequence length (allow scaling)
100
+ 2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
101
+ """
102
+ seq_len = torch.max(position_ids) + 1
103
+ if seq_len > self.max_seq_len_cached: # growth
104
+ inv_freq, self.attention_scaling = self.rope_init_fn(
105
+ self.config, device, seq_len=seq_len, **self.rope_kwargs
106
+ )
107
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation
108
+ self.max_seq_len_cached = seq_len
109
+
110
+ if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset
111
+ self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
112
+ self.max_seq_len_cached = self.original_max_seq_len
113
+
114
+ @torch.no_grad()
115
+ def forward(self, x, position_ids):
116
+ if "dynamic" in self.rope_type:
117
+ self._dynamic_frequency_update(position_ids, device=x.device)
118
+
119
+ inv_freq_expanded = self.inv_freq[None, None, :, None].float().expand(2, position_ids.shape[1], -1, 1)
120
+ position_ids_expanded = position_ids[:, :, None, :].float() # shape (2, bs, 1, positions)
121
+ # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
122
+ device_type = x.device.type
123
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
124
+ with torch.autocast(device_type=device_type, enabled=False):
125
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
126
+ emb = torch.cat((freqs, freqs), dim=-1)
127
+ cos = emb.cos()
128
+ sin = emb.sin()
129
+
130
+ # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
131
+ cos = cos * self.attention_scaling
132
+ sin = sin * self.attention_scaling
133
+
134
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
135
+
136
+
137
+ def apply_multimodal_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1):
138
+ rope_section = [cos.shape[-1] // 2, cos.shape[-1] // 2]
139
+ cos = torch.cat([m[i % 2] for i, m in enumerate(cos.split(rope_section, dim=-1))], dim=-1).unsqueeze(unsqueeze_dim)
140
+ sin = torch.cat([m[i % 2] for i, m in enumerate(sin.split(rope_section, dim=-1))], dim=-1).unsqueeze(unsqueeze_dim)
141
+
142
+ q_embed = (q * cos) + (rotate_half(q) * sin)
143
+ k_embed = (k * cos) + (rotate_half(k) * sin)
144
+ return q_embed, k_embed
145
+
146
+
147
+ class PenguinVLAttention(Qwen3Attention):
148
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
149
+
150
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
151
+ def __init__(self, *args, **kwargs):
152
+ super().__init__(*args, **kwargs)
153
+ self.is_causal = False
154
+
155
+ def forward(
156
+ self,
157
+ hidden_states: torch.Tensor,
158
+ position_embeddings: Tuple[torch.Tensor, torch.Tensor],
159
+ attention_mask: Optional[torch.Tensor],
160
+ past_key_value: Optional[Cache] = None,
161
+ cache_position: Optional[torch.LongTensor] = None,
162
+ cu_seqlens: Optional[torch.Tensor] = None,
163
+ **kwargs: Unpack[FlashAttentionKwargs],
164
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
165
+ input_shape = hidden_states.shape[:-1]
166
+ hidden_shape = (*input_shape, -1, self.head_dim)
167
+
168
+ query_states = self.q_norm(self.q_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
169
+ key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
170
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
171
+
172
+ cos, sin = position_embeddings
173
+ query_states, key_states = apply_multimodal_rotary_pos_emb(query_states, key_states, cos, sin)
174
+
175
+ if past_key_value is not None:
176
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
177
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
178
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
179
+
180
+ # This is before the transpose
181
+ seq_len = query_states.shape[2]
182
+
183
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
184
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
185
+ # cast them back in the correct dtype just to be sure everything works as expected.
186
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
187
+ # in fp32. (usually our RMSNorm modules handle it correctly)
188
+ target_dtype = None
189
+ if query_states.dtype == torch.float32:
190
+ if torch.is_autocast_enabled():
191
+ target_dtype = torch.get_autocast_gpu_dtype()
192
+ # Handle the case where the model is quantized
193
+ elif hasattr(self.config, "_pre_quantization_dtype"):
194
+ target_dtype = self.config._pre_quantization_dtype
195
+ else:
196
+ target_dtype = next(layer for layer in self.modules() if isinstance(layer, torch.nn.Linear)).weight.dtype
197
+
198
+ # FA2 always relies on the value set in the module, so remove it if present in kwargs to avoid passing it twice
199
+ kwargs.pop("is_causal", None)
200
+
201
+ # Reashape to the expected shape for Flash Attention
202
+ query_states = query_states.transpose(1, 2).squeeze(0)
203
+ key_states = key_states.transpose(1, 2).squeeze(0)
204
+ value_states = value_states.transpose(1, 2).squeeze(0)
205
+
206
+ max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
207
+ attn_output = flash_attn_varlen_func(
208
+ query_states,
209
+ key_states,
210
+ value_states,
211
+ cu_seqlens_q=cu_seqlens,
212
+ cu_seqlens_k=cu_seqlens,
213
+ max_seqlen_q=max_seqlen,
214
+ max_seqlen_k=max_seqlen,
215
+ dropout_p=0.0 if not self.training else self.attention_dropout,
216
+ causal=self.is_causal
217
+ )
218
+
219
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
220
+ attn_output = self.o_proj(attn_output)
221
+ return attn_output, None
222
+
223
+
224
+ class PenguinVLDecoderLayer(Qwen3DecoderLayer):
225
+ def __init__(self, config: PenguinVLVisionEncoderConfig, layer_idx: int):
226
+ super(PenguinVLDecoderLayer, self).__init__(config, layer_idx)
227
+ self.self_attn = PenguinVLAttention(config, layer_idx)
228
+
229
+ def forward(
230
+ self,
231
+ hidden_states: torch.Tensor,
232
+ attention_mask: Optional[torch.Tensor] = None,
233
+ position_ids: Optional[torch.LongTensor] = None,
234
+ past_key_value: Optional[Cache] = None,
235
+ output_attentions: Optional[bool] = False,
236
+ use_cache: Optional[bool] = False,
237
+ cache_position: Optional[torch.LongTensor] = None,
238
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
239
+ cu_seqlens: Optional[torch.Tensor] = None,
240
+ **kwargs: Unpack[FlashAttentionKwargs],
241
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
242
+ residual = hidden_states
243
+
244
+ hidden_states = self.input_layernorm(hidden_states)
245
+
246
+ # Self Attention
247
+ hidden_states, self_attn_weights = self.self_attn(
248
+ hidden_states=hidden_states,
249
+ attention_mask=attention_mask,
250
+ position_ids=position_ids,
251
+ past_key_value=past_key_value,
252
+ output_attentions=output_attentions,
253
+ use_cache=use_cache,
254
+ cache_position=cache_position,
255
+ position_embeddings=position_embeddings,
256
+ cu_seqlens=cu_seqlens,
257
+ **kwargs,
258
+ )
259
+ hidden_states = residual + hidden_states
260
+
261
+ # Fully Connected
262
+ residual = hidden_states
263
+ hidden_states = self.post_attention_layernorm(hidden_states)
264
+ hidden_states = self.mlp(hidden_states)
265
+ hidden_states = residual + hidden_states
266
+
267
+ outputs = (hidden_states,)
268
+ if output_attentions:
269
+ outputs += (self_attn_weights,)
270
+
271
+ return outputs
272
+
273
+
274
+ class PenguinVLVisionEncoderFromQwen3Model(Qwen3Model):
275
+ def __init__(self, config: PenguinVLVisionEncoderConfig):
276
+ super().__init__(config)
277
+ self.layers = nn.ModuleList(
278
+ [PenguinVLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
279
+ )
280
+ self.rotary_emb = VisualRotaryEmbedding(config=config)
281
+ del self.embed_tokens
282
+
283
+ @staticmethod
284
+ def _prepare_4d_causal_attention_mask_with_cache_position(
285
+ attention_mask: torch.Tensor,
286
+ sequence_length: int,
287
+ target_length: int,
288
+ dtype: torch.dtype,
289
+ device: torch.device,
290
+ cache_position: torch.Tensor,
291
+ batch_size: int,
292
+ config: PenguinVLVisionEncoderConfig,
293
+ past_key_values: Cache,
294
+ ):
295
+ """
296
+ Override the original causal mask method to create full attention mask instead.
297
+ Creates a full attention 4D mask of shape `(batch_size, 1, query_length, key_value_length)`
298
+ from a 2D mask of shape `(batch_size, key_value_length)`.
299
+
300
+ For vision encoding, we want full attention between all patches, not causal attention.
301
+ """
302
+ if attention_mask is not None and attention_mask.dim() == 4:
303
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
304
+ full_attention_mask = attention_mask
305
+ else:
306
+ # Create full attention mask (all zeros, meaning attend to all positions)
307
+ # We only mask based on the provided attention_mask for padding
308
+ if attention_mask is not None:
309
+ # Use the provided attention_mask to handle padding
310
+ min_dtype = torch.finfo(dtype).min
311
+ full_attention_mask = torch.zeros(
312
+ (sequence_length, target_length), dtype=dtype, device=device
313
+ )
314
+ # Expand to 4D
315
+ full_attention_mask = full_attention_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
316
+
317
+ # Apply padding mask if provided
318
+ full_attention_mask = full_attention_mask.clone() # copy to contiguous memory for in-place edit
319
+ if attention_mask.shape[-1] > target_length:
320
+ attention_mask = attention_mask[:, :target_length]
321
+ mask_length = attention_mask.shape[-1]
322
+ padding_mask = attention_mask[:, None, None, :] == 0
323
+ full_attention_mask[:, :, :, :mask_length] = full_attention_mask[:, :, :, :mask_length].masked_fill(
324
+ padding_mask, min_dtype
325
+ )
326
+ else:
327
+ # No attention mask provided, create all-zeros mask (full attention)
328
+ full_attention_mask = torch.zeros(
329
+ (batch_size, 1, sequence_length, target_length), dtype=dtype, device=device
330
+ )
331
+ return full_attention_mask
332
+
333
+ def get_rope_index(self, grid_sizes, merge_sizes, position_ids):
334
+ position_ids = position_ids.contiguous()
335
+ batch_size = grid_sizes.shape[0]
336
+
337
+ # Vision Part: Generate 2D position indices for vision tokens
338
+ vision_pos_ids = []
339
+ for (t, h, w), merge_size in zip(grid_sizes, merge_sizes):
340
+ # Generate height position indices
341
+ hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w).to(position_ids.device)
342
+ hpos_ids = hpos_ids.reshape(
343
+ h // merge_size,
344
+ merge_size,
345
+ w // merge_size,
346
+ merge_size,
347
+ )
348
+ hpos_ids = hpos_ids.permute(0, 2, 1, 3)
349
+ hpos_ids = hpos_ids.flatten()
350
+
351
+ # Generate width position indices
352
+ wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1).to(position_ids.device)
353
+ wpos_ids = wpos_ids.reshape(
354
+ h // merge_size,
355
+ merge_size,
356
+ w // merge_size,
357
+ merge_size,
358
+ )
359
+ wpos_ids = wpos_ids.permute(0, 2, 1, 3)
360
+ wpos_ids = wpos_ids.flatten()
361
+
362
+ # Stack height and width to create 2D positions
363
+ vision_pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
364
+
365
+ num_start_idx = 0
366
+ for batch_idx in range(batch_size):
367
+ pos_len = vision_pos_ids[batch_idx].shape[0]
368
+ position_ids[:, 0, num_start_idx: num_start_idx+pos_len] = vision_pos_ids[batch_idx].permute(1, 0)
369
+ num_start_idx += pos_len
370
+
371
+ return position_ids
372
+
373
+
374
+ def forward(
375
+ self,
376
+ input_ids: Optional[torch.LongTensor] = None,
377
+ attention_mask: Optional[torch.Tensor] = None,
378
+ position_ids: Optional[torch.LongTensor] = None,
379
+ past_key_values: Optional[Cache] = None,
380
+ inputs_embeds: Optional[torch.FloatTensor] = None,
381
+ use_cache: Optional[bool] = None,
382
+ output_attentions: Optional[bool] = None,
383
+ output_hidden_states: Optional[bool] = None,
384
+ cache_position: Optional[torch.LongTensor] = None,
385
+ grid_sizes: Optional[torch.Tensor] = None,
386
+ merge_sizes: Optional[torch.Tensor] = None,
387
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
388
+ ) -> BaseModelOutputWithPast:
389
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
390
+ output_hidden_states = (
391
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
392
+ )
393
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
394
+
395
+ if (input_ids is None) ^ (inputs_embeds is not None):
396
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
397
+
398
+ if self.gradient_checkpointing and self.training and use_cache:
399
+ logger.warning_once(
400
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
401
+ )
402
+ use_cache = False
403
+
404
+ # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
405
+ if not isinstance(past_key_values, (type(None), Cache)):
406
+ raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
407
+
408
+ if inputs_embeds is None:
409
+ inputs_embeds = self.embed_tokens(input_ids)
410
+
411
+ if use_cache and past_key_values is None:
412
+ past_key_values = DynamicCache()
413
+
414
+ if cache_position is None:
415
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
416
+ cache_position = torch.arange(
417
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
418
+ )
419
+
420
+ # the hard coded `2` is for temporal, height and width.
421
+ if position_ids is None:
422
+ position_ids = cache_position.view(1, 1, -1).expand(2, inputs_embeds.shape[0], -1)
423
+ elif position_ids.dim() == 2:
424
+ position_ids = position_ids[None, ...].expand(2, position_ids.shape[0], -1)
425
+ position_ids = self.get_rope_index(grid_sizes, merge_sizes, position_ids)
426
+
427
+ causal_mask = None
428
+
429
+ hidden_states = inputs_embeds
430
+
431
+ # create position embeddings to be shared across the decoder layers
432
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
433
+
434
+ # decoder layers
435
+ all_hidden_states = () if output_hidden_states else None
436
+ all_self_attns = () if output_attentions else None
437
+
438
+ # Calculate cumulative sequence lengths for the grid sizes
439
+ cu_seqlens = torch.repeat_interleave(grid_sizes[:, 1] * grid_sizes[:, 2], grid_sizes[:, 0]).cumsum(dim=0, dtype=torch.int32)
440
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
441
+
442
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
443
+ if output_hidden_states:
444
+ all_hidden_states += (hidden_states,)
445
+
446
+ if self.gradient_checkpointing and self.training:
447
+ layer_outputs = self._gradient_checkpointing_func(
448
+ partial(decoder_layer.__call__, **flash_attn_kwargs),
449
+ hidden_states,
450
+ causal_mask,
451
+ position_ids,
452
+ past_key_values,
453
+ output_attentions,
454
+ use_cache,
455
+ cache_position,
456
+ position_embeddings,
457
+ cu_seqlens,
458
+ )
459
+ else:
460
+ layer_outputs = decoder_layer(
461
+ hidden_states,
462
+ attention_mask=causal_mask,
463
+ position_ids=position_ids,
464
+ past_key_value=past_key_values,
465
+ output_attentions=output_attentions,
466
+ use_cache=use_cache,
467
+ cache_position=cache_position,
468
+ position_embeddings=position_embeddings,
469
+ cu_seqlens=cu_seqlens,
470
+ **flash_attn_kwargs,
471
+ )
472
+
473
+ hidden_states = layer_outputs[0]
474
+
475
+ if output_attentions:
476
+ all_self_attns += (layer_outputs[1],)
477
+
478
+ hidden_states = self.norm(hidden_states)
479
+
480
+ # add hidden states from the last decoder layer
481
+ if output_hidden_states:
482
+ all_hidden_states += (hidden_states,)
483
+
484
+ return BaseModelOutputWithPast(
485
+ last_hidden_state=hidden_states,
486
+ past_key_values=past_key_values if use_cache else None,
487
+ hidden_states=all_hidden_states,
488
+ attentions=all_self_attns,
489
+ )
490
+
491
+
492
+ class PenguinVLVisionEncoderModel(PreTrainedModel):
493
+
494
+ config_class = PenguinVLVisionEncoderConfig
495
+ base_model_prefix = "penguinvl_vision_encoder"
496
+ main_input_name = "pixel_values"
497
+ supports_gradient_checkpointing = True
498
+ _no_split_modules = [
499
+ "PenguinVLVisionEncoderEmbeddings",
500
+ ]
501
+ _supports_flash_attn_2 = True
502
+ _supports_sdpa = True
503
+
504
+ def __init__(self, config: PenguinVLVisionEncoderConfig):
505
+ super().__init__(config=config)
506
+ self.embeddings = PenguinVLVisionEncoderEmbeddings(config)
507
+ self.encoder = PenguinVLVisionEncoderFromQwen3Model(config)
508
+
509
+ self.post_init()
510
+
511
+
512
+ def forward(self, pixel_values, grid_sizes, merge_sizes=None) -> torch.Tensor:
513
+ hidden_states = self.embeddings(pixel_values)
514
+ encoder_output = self.encoder(
515
+ inputs_embeds=hidden_states[None, ...],
516
+ grid_sizes=grid_sizes,
517
+ merge_sizes=merge_sizes,
518
+ output_hidden_states=True,
519
+ )
520
+ hidden_states = encoder_output.hidden_states
521
+ hidden_states = hidden_states[-1].squeeze(0)
522
+
523
+ hidden_states_chunks = hidden_states.split(grid_sizes.prod(dim=1).tolist(), dim=0)
524
+ outputs = []
525
+
526
+ for hidden_states, grid_size, merge_size in zip(hidden_states_chunks, grid_sizes, merge_sizes):
527
+ # NOTE: previous implementation, which supports downsampling with any factor
528
+ c = hidden_states.shape[-1]
529
+ hidden_states = hidden_states.view(
530
+ grid_size[0], grid_size[1] // merge_size, grid_size[2] // merge_size, merge_size, merge_size, c
531
+ ).permute(0, 1, 3, 2, 4, 5)
532
+ hidden_states = hidden_states.reshape(
533
+ grid_size[0], grid_size[1], grid_size[2], c
534
+ ).permute(0, 3, 1, 2)
535
+ hidden_states = torch.nn.functional.interpolate(
536
+ hidden_states,
537
+ size=(grid_size[1] // merge_size, grid_size[2] // merge_size),
538
+ mode='bilinear'
539
+ )
540
+ hidden_states = hidden_states.permute(0, 2, 3, 1).view(-1, c)
541
+
542
+ # NOTE: simplified implementation, which only supports downsampling with integer factor
543
+ # NOTE: this implementation is mathematically equivalent to the previous one when merge_size is 1 or 2 but may cause slightly different results
544
+ # hidden_states = hidden_states.view(-1, merge_size * merge_size, hidden_states.size(-1))
545
+ # hidden_states = hidden_states.mean(dim=1)
546
+
547
+ outputs.append(hidden_states)
548
+ return torch.cat(outputs, dim=0)
preprocessor_config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoImageProcessor": "image_processing_penguinvl.PenguinVLImageProcessor"
4
+ },
5
+ "do_convert_rgb": true,
6
+ "do_normalize": true,
7
+ "do_rescale": true,
8
+ "do_resize": true,
9
+ "image_mean": [
10
+ 0.5,
11
+ 0.5,
12
+ 0.5
13
+ ],
14
+ "image_processor_type": "PenguinVLImageProcessor",
15
+ "image_std": [
16
+ 0.5,
17
+ 0.5,
18
+ 0.5
19
+ ],
20
+ "max_tokens": 16384,
21
+ "min_tokens": 16,
22
+ "patch_size": 14,
23
+ "resample": 3,
24
+ "rescale_factor": 0.00392156862745098
25
+ }