lvyufeng commited on
Commit
629b298
·
verified ·
1 Parent(s): 6cc7e1d

Upload folder using huggingface_hub

Browse files
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ *.swp
2
+
3
+ *.bak
4
+ *.bak*
5
+
6
+ bak/
LICENSE ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT
2
+ Tencent HunyuanOCR Release Date: November 25, 2025
3
+ THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.
4
+ By clicking to agree or by using, reproducing, modifying, distributing, performing or displaying any portion or element of the Tencent Hunyuan Works, including via any Hosted Service, You will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
5
+ 1. DEFINITIONS.
6
+ a. “Acceptable Use Policy” shall mean the policy made available by Tencent as set forth in the Exhibit A.
7
+ b. “Agreement” shall mean the terms and conditions for use, reproduction, distribution, modification, performance and displaying of Tencent Hunyuan Works or any portion or element thereof set forth herein.
8
+ c. “Documentation” shall mean the specifications, manuals and documentation for Tencent Hunyuan made publicly available by Tencent.
9
+ d. “Hosted Service” shall mean a hosted service offered via an application programming interface (API), web access, or any other electronic or remote means.
10
+ e. “Licensee,” “You” or “Your” shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Tencent Hunyuan Works for any purpose and in any field of use.
11
+ f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
12
+ g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
13
+ h. “Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
14
+ i. “Tencent,” “We” or “Us” shall mean the applicable entity or entities in the Tencent corporate family that own(s) intellectual property or other rights embodied in or utilized by the Materials.
15
+ j. “Tencent Hunyuan” shall mean the large language models, text/image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us, including, without limitation to, Tencent HunyuanOCR released at [https://huggingface.co/tencent/HunyuanOCR].
16
+ k. “Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
17
+ l. “Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
18
+ m. “Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
19
+ n. “including” shall mean including but not limited to.
20
+ 2. GRANT OF RIGHTS.
21
+ We grant You, for the Territory only, a non-exclusive, non-transferable and royalty-free limited license under Tencent’s intellectual property or other rights owned by Us embodied in or utilized by the Materials to use, reproduce, distribute, create derivative works of (including Model Derivatives), and make modifications to the Materials, only in accordance with the terms of this Agreement and the Acceptable Use Policy, and You must not violate (or encourage or permit anyone else to violate) any term of this Agreement or the Acceptable Use Policy.
22
+ 3. DISTRIBUTION.
23
+ You may, subject to Your compliance with this Agreement, distribute or make available to Third Parties the Tencent Hunyuan Works, exclusively in the Territory, provided that You meet all of the following conditions:
24
+ a. You must provide all such Third Party recipients of the Tencent Hunyuan Works or products or services using them a copy of this Agreement;
25
+ b. You must cause any modified files to carry prominent notices stating that You changed the files;
26
+ c. You are encouraged to: (i) publish at least one technology introduction blogpost or one public statement expressing Your experience of using the Tencent Hunyuan Works; and (ii) mark the products or services developed by using the Tencent Hunyuan Works to indicate that the product/service is “Powered by Tencent Hunyuan”; and
27
+ d. All distributions to Third Parties (other than through a Hosted Service) must be accompanied by a “Notice” text file that contains the following notice: “Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2025 Tencent. All Rights Reserved. The trademark rights of “Tencent Hunyuan” are owned by Tencent or its affiliate.”
28
+ e. In the event that You use, integrate, implement, or otherwise deploy the Tencent Hunyuan Works, in whole or in part, to provide, enable, or support any service, product, or functionality to third parties, You shall clearly, accurately, and prominently disclose to all end users the full legal name and entity of the actual provider of such service, product, or functionality. You shall expressly and conspicuously state that Tencent is not affiliated with, associated with, sponsoring, or endorsing any such service, product, or functionality. You shall not use or display any name, logo, trademark, trade name, or other indicia of Tencent in any manner that could be construed as, or be likely to create, confusion, deception, or a false impression regarding any relationship, affiliation, sponsorship, or endorsement by Tencent.
29
+ You may add Your own copyright statement to Your modifications and, except as set forth in this Section and in Section 5, may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Model Derivatives as a whole, provided Your use, reproduction, modification, distribution, performance and display of the work otherwise complies with the terms and conditions of this Agreement (including as regards the Territory). If You receive Tencent Hunyuan Works from a Licensee as part of an integrated end user product, then this Section 3 of this Agreement will not apply to You.
30
+ 4. ADDITIONAL COMMERCIAL TERMS.
31
+ If, on the Tencent Hunyuan version release date, the monthly active users of all products or services made available by or for Licensee is greater than 100 million monthly active users in the preceding calendar month, You must request a license from Tencent, which Tencent may grant to You in its sole discretion, and You are not authorized to exercise any of the rights under this Agreement unless or until Tencent otherwise expressly grants You such rights.
32
+ 5. RULES OF USE.
33
+ a. Your use of the Tencent Hunyuan Works must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Tencent Hunyuan Works, which is hereby incorporated by reference into this Agreement. You must include the use restrictions referenced in these Sections 5(a) and 5(b) as an enforceable provision in any agreement (e.g., license agreement, terms of use, etc.) governing the use and/or distribution of Tencent Hunyuan Works and You must provide notice to subsequent users to whom You distribute that Tencent Hunyuan Works are subject to the use restrictions in these Sections 5(a) and 5(b).
34
+ b. You must not use the Tencent Hunyuan Works or any Output or results of the Tencent Hunyuan Works to improve any other AI model (other than Tencent Hunyuan or Model Derivatives thereof).
35
+ c. You must not use, reproduce, modify, distribute, or display the Tencent Hunyuan Works, Output or results of the Tencent Hunyuan Works outside the Territory. Any such use outside the Territory is unlicensed and unauthorized under this Agreement.
36
+ 6. INTELLECTUAL PROPERTY.
37
+ a. Subject to Tencent’s ownership of Tencent Hunyuan Works made by or for Tencent and intellectual property rights therein, conditioned upon Your compliance with the terms and conditions of this Agreement, as between You and Tencent, You will be the owner of any derivative works and modifications of the Materials and any Model Derivatives that are made by or for You.
38
+ b. No trademark licenses are granted under this Agreement, and in connection with the Tencent Hunyuan Works, Licensee may not use any name or mark owned by or associated with Tencent or any of its affiliates, except as required for reasonable and customary use in describing and distributing the Tencent Hunyuan Works. Tencent hereby grants You a license to use “Tencent Hunyuan” (the “Mark”) in the Territory solely as required to comply with the provisions of Section 3(c), provided that You comply with any applicable laws related to trademark protection. All goodwill arising out of Your use of the Mark will inure to the benefit of Tencent.
39
+ c. If You commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any person or entity alleging that the Materials or any Output, or any portion of any of the foregoing, infringe any intellectual property or other right owned or licensable by You, then all licenses granted to You under this Agreement shall terminate as of the date such lawsuit or other proceeding is filed. You will defend, indemnify and hold harmless Us from and against any claim by any Third Party arising out of or related to Your or the Third Party’s use or distribution of the Tencent Hunyuan Works.
40
+ d. Tencent claims no rights in Outputs You generate. You and Your users are solely responsible for Outputs and their subsequent uses.
41
+ 7. DISCLAIMERS OF WARRANTY AND LIMITATIONS OF LIABILITY.
42
+ a. We are not obligated to support, update, provide training for, or develop any further version of the Tencent Hunyuan Works or to grant any license thereto.
43
+ b. UNLESS AND ONLY TO THE EXTENT REQUIRED BY APPLICABLE LAW, THE TENCENT HUNYUAN WORKS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OF ANY KIND INCLUDING ANY WARRANTIES OF TITLE, MERCHANTABILITY, NONINFRINGEMENT, COURSE OF DEALING, USAGE OF TRADE, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING, REPRODUCING, MODIFYING, PERFORMING, DISPLAYING OR DISTRIBUTING ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS AND ASSUME ANY AND ALL RISKS ASSOCIATED WITH YOUR OR A THIRD PARTY’S USE OR DISTRIBUTION OF ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS AND YOUR EXERCISE OF RIGHTS AND PERMISSIONS UNDER THIS AGREEMENT.
44
+ c. TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL TENCENT OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, FOR ANY DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, CONSEQUENTIAL OR PUNITIVE DAMAGES, OR LOST PROFITS OF ANY KIND ARISING FROM THIS AGREEMENT OR RELATED TO ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS, EVEN IF TENCENT OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
45
+ 8. SURVIVAL AND TERMINATION.
46
+ a. The term of this Agreement shall commence upon Your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
47
+ b. We may terminate this Agreement if You breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, You must promptly delete and cease use of the Tencent Hunyuan Works. Sections 6(a), 6(c), 7 and 9 shall survive the termination of this Agreement.
48
+ 9. GOVERNING LAW AND JURISDICTION.
49
+ a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of the Hong Kong Special Administrative Region of the People’s Republic of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
50
+ b. Exclusive jurisdiction and venue for any dispute arising out of or relating to this Agreement will be a court of competent jurisdiction in the Hong Kong Special Administrative Region of the People’s Republic of China, and Tencent and Licensee consent to the exclusive jurisdiction of such court with respect to any such dispute.
51
+
52
+ EXHIBIT A
53
+ ACCEPTABLE USE POLICY
54
+
55
+ Tencent reserves the right to update this Acceptable Use Policy from time to time.
56
+ Last modified: November 5, 2024
57
+
58
+ Tencent endeavors to promote safe and fair use of its tools and features, including Tencent Hunyuan. You agree not to use Tencent Hunyuan or Model Derivatives:
59
+ 1. Outside the Territory;
60
+ 2. In any way that violates any applicable national, federal, state, local, international or any other law or regulation;
61
+ 3. To harm Yourself or others;
62
+ 4. To repurpose or distribute output from Tencent Hunyuan or any Model Derivatives to harm Yourself or others;
63
+ 5. To override or circumvent the safety guardrails and safeguards We have put in place;
64
+ 6. For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
65
+ 7. To generate or disseminate verifiably false information and/or content with the purpose of harming others or influencing elections;
66
+ 8. To generate or facilitate false online engagement, including fake reviews and other means of fake online engagement;
67
+ 9. To intentionally defame, disparage or otherwise harass others;
68
+ 10. To generate and/or disseminate malware (including ransomware) or any other content to be used for the purpose of harming electronic systems;
69
+ 11. To generate or disseminate personal identifiable information with the purpose of harming others;
70
+ 12. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
71
+ 13. To impersonate another individual without consent, authorization, or legal right;
72
+ 14. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
73
+ 15. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
74
+ 16. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
75
+ 17. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;
76
+ 18. To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
77
+ 19. For military purposes;
78
+ 20. To engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or other professional practices.
README.md ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
+ ---
9
+
10
+ <p align="center">
11
+ <img src="https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/assets/hyocr-head-img.png?raw=true" width="80%"/> <br>
12
+ </p>
13
+
14
+
15
+ <p align="center">
16
+ <a href="https://huggingface.co/spaces/tencent/HunyuanOCR"><b>🎯 Demo</b></a> |
17
+ <a href="https://huggingface.co/tencent/HunyuanOCR"><b>📥 Model Download</b></a> |
18
+ <a href="https://arxiv.org/abs/2511.19575"><b>📄 Technical Report</b></a> |
19
+ <a href="https://github.com/Tencent-Hunyuan/HunyuanOCR"><b>🌟 Github</b></a>
20
+ </p>
21
+
22
+ <h2>
23
+ <p align="center">
24
+ <a href="https://arxiv.org/abs/2511.19575">HunyuanOCR</a>
25
+ </p>
26
+ </h2>
27
+
28
+
29
+ ## Notice
30
+
31
+ The official repo of [HunyuanOCR](https://huggingface.co/tencent/HunyuanOCR) do not support official `transformers` and only provide a commit version to use. We modify the official implementation as `remote_code` to support the official `transformers` version. You can use [HunyuanOCR] with the latest version of `transformers` easily.
32
+
33
+ ## 📖 Introduction
34
+ **HunyuanOCR** stands as a leading end-to-end OCR expert VLM powered by Hunyuan's native multimodal architecture. With a remarkably lightweight 1B parameter design, it has achieved multiple state-of-the-art benchmarks across the industry. The model demonstrates mastery in **complex multilingual document parsing** while excelling in practical applications including **text spotting, open-field information extraction, video subtitle extraction, and photo translation**.
35
+
36
+
37
+ ## 🚀 Quick Start with Transformers
38
+
39
+ ### Installation
40
+
41
+ #### Use Pytorch + Transformers
42
+
43
+ ```bash
44
+ pip install transformers==4.57.3
45
+ ```
46
+
47
+ #### Use Mindspore + MindNLP
48
+
49
+ ```bash
50
+ pip install transformers==4.57.3
51
+ pip install git+https://github.com/mindspore-lab/mindnlp
52
+ ```
53
+
54
+
55
+ ### Model Inference
56
+
57
+ #### MindSpore + MindNLP
58
+
59
+ ```python
60
+ import mindtorch
61
+ import mindnlp
62
+ from transformers import AutoProcessor
63
+ from transformers import AutoModel
64
+ from PIL import Image
65
+
66
+ def clean_repeated_substrings(text):
67
+ """Clean repeated substrings in text"""
68
+ n = len(text)
69
+ if n<8000:
70
+ return text
71
+ for length in range(2, n // 10 + 1):
72
+ candidate = text[-length:]
73
+ count = 0
74
+ i = n - length
75
+
76
+ while i >= 0 and text[i:i + length] == candidate:
77
+ count += 1
78
+ i -= length
79
+
80
+ if count >= 10:
81
+ return text[:n - length * (count - 1)]
82
+
83
+ return text
84
+
85
+ model_name_or_path = "lvyufeng/HunyuanOCR"
86
+ processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)
87
+ img_path = "image_ocr.jpg"
88
+ image_inputs = Image.open(img_path)
89
+ messages1 = [
90
+ {"role": "system", "content": ""},
91
+ {
92
+ "role": "user",
93
+ "content": [
94
+ {"type": "image", "image": img_path},
95
+ {"type": "text", "text": (
96
+ "检测并识别图片中的文字,将文本坐标格式化输出。"
97
+ )},
98
+ ],
99
+ }
100
+ ]
101
+ messages = [messages1]
102
+ texts = [
103
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
104
+ for msg in messages
105
+ ]
106
+
107
+ inputs = processor(
108
+ text=texts,
109
+ images=image_inputs,
110
+ padding=True,
111
+ return_tensors="pt",
112
+ )
113
+ model = AutoModel.from_pretrained(
114
+ model_name_or_path,
115
+ attn_implementation="eager",
116
+ dtype=mindtorch.float16,
117
+ device_map="auto",
118
+ trust_remote_code=True
119
+ )
120
+ with mindtorch.no_grad():
121
+ device = next(model.parameters()).device
122
+ inputs = inputs.to(device)
123
+ generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
124
+ if "input_ids" in inputs:
125
+ input_ids = inputs.input_ids
126
+ else:
127
+ print("inputs: # fallback", inputs)
128
+ input_ids = inputs.inputs
129
+ generated_ids_trimmed = [
130
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
131
+ ]
132
+ output_texts = clean_repeated_substrings(processor.batch_decode(
133
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
134
+ ))
135
+ print(output_texts)
136
+
137
+ ```
138
+
139
+ #### Pytorch + Transformers
140
+
141
+ ```python
142
+ import torch
143
+ from transformers import AutoProcessor
144
+ from transformers import AutoModel
145
+ from PIL import Image
146
+
147
+ def clean_repeated_substrings(text):
148
+ """Clean repeated substrings in text"""
149
+ n = len(text)
150
+ if n<8000:
151
+ return text
152
+ for length in range(2, n // 10 + 1):
153
+ candidate = text[-length:]
154
+ count = 0
155
+ i = n - length
156
+
157
+ while i >= 0 and text[i:i + length] == candidate:
158
+ count += 1
159
+ i -= length
160
+
161
+ if count >= 10:
162
+ return text[:n - length * (count - 1)]
163
+
164
+ return text
165
+
166
+ model_name_or_path = "lvyufeng/HunyuanOCR"
167
+ processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)
168
+ img_path = "image_ocr.jpg"
169
+ image_inputs = Image.open(img_path)
170
+ messages1 = [
171
+ {"role": "system", "content": ""},
172
+ {
173
+ "role": "user",
174
+ "content": [
175
+ {"type": "image", "image": img_path},
176
+ {"type": "text", "text": (
177
+ "检测并识别图片中的文字,将文本坐标格式化输出。"
178
+ )},
179
+ ],
180
+ }
181
+ ]
182
+ messages = [messages1]
183
+ texts = [
184
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
185
+ for msg in messages
186
+ ]
187
+
188
+ inputs = processor(
189
+ text=texts,
190
+ images=image_inputs,
191
+ padding=True,
192
+ return_tensors="pt",
193
+ )
194
+ model = AutoModel.from_pretrained(
195
+ model_name_or_path,
196
+ attn_implementation="eager",
197
+ dtype=torch.float16,
198
+ device_map="auto",
199
+ trust_remote_code=True
200
+ )
201
+ with torch.no_grad():
202
+ device = next(model.parameters()).device
203
+ inputs = inputs.to(device)
204
+ generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
205
+ if "input_ids" in inputs:
206
+ input_ids = inputs.input_ids
207
+ else:
208
+ print("inputs: # fallback", inputs)
209
+ input_ids = inputs.inputs
210
+ generated_ids_trimmed = [
211
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
212
+ ]
213
+ output_texts = clean_repeated_substrings(processor.batch_decode(
214
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
215
+ ))
216
+ print(output_texts)
217
+ ```
218
+
219
+ ## 💬 Application-oriented Prompts
220
+
221
+ | Task | English | Chinese |
222
+ |------|---------|---------|
223
+ | **Spotting** | Detect and recognize text in the image, and output the text coordinates in a formatted manner. | 检测并识别图片中的文字,将文本坐标格式化输出。 |
224
+ | **Parsing** | • Identify the formula in the image and represent it using LaTeX format.<br><br>• Parse the table in the image into HTML.<br><br>• Parse the chart in the image; use Mermaid format for flowcharts and Markdown for other charts.<br><br>• Extract all information from the main body of the document image and represent it in markdown format, ignoring headers and footers. Tables should be expressed in HTML format, formulas in the document should be represented using LaTeX format, and the parsing should be organized according to the reading order. | • 识别图片中的公式,用 LaTeX 格式表示。<br><br>• 把图中的表格解析为 HTML。<br><br>• 解析图中的图表,对于流程图使用 Mermaid 格式表示,其他图表使用 Markdown 格式表示。<br><br>• 提取文档图片中正文的所有信息用 markdown 格式表示,其中页眉、页脚部分忽略,表格用 html 格式表达,文档中公式用 latex 格式表示,按照阅读顺序组织进行解析。 |
225
+ | **Information Extraction** | • Output the value of Key.<br><br>• Extract the content of the fields: ['key1','key2', ...] from the image and return it in JSON format.<br><br>• Extract the subtitles from the image. | • 输出 Key 的值。<br><br>• 提取图片中的: ['key1','key2', ...] 的字段内容,并按照 JSON 格式返回。<br><br>• 提取图片中的字幕。 |
226
+ | **Translation** | First extract the text, then translate the text content into English. If it is a document, ignore the header and footer. Formulas should be represented in LaTeX format, and tables should be represented in HTML format. | 先提取文字,再将文字内容翻译为英文。若是文档,则其中页眉、页脚忽略。公式用latex格式表示,表格用html格式表示。 |
227
+
228
+
229
+ ## 📚 Citation
230
+ ```
231
+ @misc{hunyuanvisionteam2025hunyuanocrtechnicalreport,
232
+ title={HunyuanOCR Technical Report},
233
+ author={Hunyuan Vision Team and Pengyuan Lyu and Xingyu Wan and Gengluo Li and Shangpin Peng and Weinong Wang and Liang Wu and Huawen Shen and Yu Zhou and Canhui Tang and Qi Yang and Qiming Peng and Bin Luo and Hower Yang and Xinsong Zhang and Jinnian Zhang and Houwen Peng and Hongming Yang and Senhao Xie and Longsha Zhou and Ge Pei and Binghong Wu and Kan Wu and Jieneng Yang and Bochao Wang and Kai Liu and Jianchen Zhu and Jie Jiang and Linus and Han Hu and Chengquan Zhang},
234
+ year={2025},
235
+ journal={arXiv preprint arXiv:2511.19575},
236
+ url={https://arxiv.org/abs/2511.19575},
237
+ }
238
+ ```
239
+
240
+ ## 🙏 Acknowledgements
241
+ We would like to thank [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR), [dots.ocr](https://github.com/rednote-hilab/dots.ocr) for their valuable models and ideas.
242
+ We also appreciate the benchmarks: [OminiDocBench](https://github.com/opendatalab/OmniDocBench), [OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR/tree/main/OCRBench), [DoTA](https://github.com/liangyupu/DIMTDA).
243
+
config.json ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HunYuanVLForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_hunyuan_vl.HunYuanVLConfig",
7
+ "AutoModel": "modeling_hunyuan_vl.HunYuanVLForConditionalGeneration",
8
+ "AutoModelForSeq2SeqLM": "modeling_hunyuan_vl.HunYuanVLForConditionalGeneration"
9
+ },
10
+ "attention_bias": false,
11
+ "attention_dropout": 0.0,
12
+ "attention_head_dim": 128,
13
+ "bos_token_id": 120000,
14
+ "eod_token_id": 120020,
15
+ "eos_token_id": 120020,
16
+ "head_dim": 128,
17
+ "hidden_act": "silu",
18
+ "hidden_size": 1024,
19
+ "image_start_token_id": 120118,
20
+ "image_end_token_id": 120119,
21
+ "image_token_id": 120120,
22
+ "image_newline_token_id": 120121,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 3584,
25
+ "max_position_embeddings": 32768,
26
+ "mlp_bias": false,
27
+ "model_type": "hunyuan_vl",
28
+ "norm_type": "rms",
29
+ "num_attention_heads": 16,
30
+ "num_experts": 1,
31
+ "num_hidden_layers": 24,
32
+ "num_key_value_heads": 8,
33
+ "org_vocab_size": 120818,
34
+ "pad_id": 120002,
35
+ "pad_token_id": -1,
36
+ "pretraining_tp": 1,
37
+ "rms_norm_eps": 1e-05,
38
+ "rope_scaling": {
39
+ "alpha": 1000.0,
40
+ "beta_fast": 32,
41
+ "beta_slow": 1,
42
+ "factor": 1.0,
43
+ "mscale": 1.0,
44
+ "mscale_all_dim": 1.0,
45
+ "type": "xdrope",
46
+ "xdrope_section": [
47
+ 16,
48
+ 16,
49
+ 16,
50
+ 16
51
+ ]
52
+ },
53
+ "rope_theta": 10000.0,
54
+ "routed_scaling_factor": 1.0,
55
+ "sep_token_id": 0,
56
+ "text_end_id": 8,
57
+ "text_start_id": 7,
58
+ "tie_word_embeddings": true,
59
+ "dtype": "bfloat16",
60
+ "transformers_version": "4.49.0",
61
+ "use_cache": true,
62
+ "use_qk_norm": true,
63
+ "use_cla": false,
64
+ "vision_config": {
65
+ "add_patchemb_bias": true,
66
+ "attention_dropout": 0.0,
67
+ "cat_extra_token": 1,
68
+ "hidden_act": "gelu",
69
+ "hidden_dropout": 0.0,
70
+ "hidden_size": 1152,
71
+ "img_max_token_num": 4096,
72
+ "intermediate_size": 4304,
73
+ "interpolate_mode": "bilinear",
74
+ "max_image_size": 2048,
75
+ "max_vit_seq_len": 16384,
76
+ "num_attention_heads": 16,
77
+ "num_channels": 3,
78
+ "num_hidden_layers": 27,
79
+ "out_hidden_size": 1024,
80
+ "patch_size": 16,
81
+ "rms_norm_eps": 1e-05,
82
+ "spatial_merge_size": 2
83
+ },
84
+ "vocab_size": 120818
85
+ }
configuration_hunyuan_vl.py ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
2
+ # This file was automatically generated from src/transformers/models/hunyuan_vl/modular_hunyuan_vl.py.
3
+ # Do NOT edit this file manually as any edits will be overwritten by the generation of
4
+ # the file from the modular. If any change should be done, please apply the change to the
5
+ # modular_hunyuan_vl.py file directly. One of our CI enforces this.
6
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
7
+ # coding=utf-8
8
+ # Copyright (C) 2025 THL A29 Limited, a Tencent company and the HuggingFace Inc. team. All rights reserved.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+
24
+
25
+ class HunYuanVLVisionConfig(PretrainedConfig):
26
+ model_type = "hunyuan_vl"
27
+ base_config_key = "vision_config"
28
+
29
+ def __init__(
30
+ self,
31
+ hidden_act="gelu",
32
+ hidden_size=1152,
33
+ intermediate_size=4304,
34
+ interpolate_mode="bilinear",
35
+ rms_norm_eps=1e-05,
36
+ learnable_mlp_pooling_size=0,
37
+ num_attention_heads=16,
38
+ num_key_value_heads=None,
39
+ num_channels=3,
40
+ num_hidden_layers=27,
41
+ out_hidden_size=4096,
42
+ patch_size=16,
43
+ remove_prenorm=True,
44
+ spatial_merge_size=2,
45
+ temporal_patch_size=1,
46
+ resize_resolution=2048,
47
+ img_max_token_num=4096,
48
+ max_image_size=2048,
49
+ video_max_image_size=768,
50
+ video_min_image_size=256,
51
+ min_image_size=512,
52
+ anyres_vit_max_image_size=2048,
53
+ max_vit_seq_len=16384,
54
+ text_hidden_size=3072,
55
+ **kwargs,
56
+ ):
57
+ super().__init__(**kwargs)
58
+
59
+ self.hidden_act = hidden_act
60
+ self.hidden_size = hidden_size
61
+ self.intermediate_size = intermediate_size
62
+ self.interpolate_mode = interpolate_mode
63
+ self.learnable_mlp_pooling_size = learnable_mlp_pooling_size
64
+ self.num_attention_heads = num_attention_heads
65
+ if not num_key_value_heads:
66
+ self.num_key_value_heads = num_attention_heads
67
+ else:
68
+ self.num_key_value_heads = num_key_value_heads
69
+ self.num_channels = num_channels
70
+ self.num_hidden_layers = num_hidden_layers
71
+ self.out_hidden_size = out_hidden_size
72
+ self.patch_size = patch_size
73
+ self.remove_prenorm = remove_prenorm
74
+ self.spatial_merge_size = spatial_merge_size
75
+ self.temporal_patch_size = temporal_patch_size
76
+ self.rms_norm_eps = rms_norm_eps
77
+
78
+ self.resize_resolution = resize_resolution
79
+ self.img_max_token_num = img_max_token_num
80
+ self.max_image_size = max_image_size
81
+ self.min_image_size = min_image_size
82
+ self.video_max_image_size = video_max_image_size
83
+ self.video_min_image_size = video_min_image_size
84
+ self.anyres_vit_max_image_size = anyres_vit_max_image_size
85
+ self.max_vit_seq_len = max_vit_seq_len
86
+ self.text_hidden_size = text_hidden_size
87
+
88
+
89
+ class HunYuanVLTextConfig(PretrainedConfig):
90
+ r"""
91
+ This is the configuration class to store the configuration of a [`HunYuanVLTextConfig`]. It is used to instantiate an
92
+ HunYuan model according to the specified arguments, defining the model architecture. Instantiating a configuration
93
+ with the defaults will yield a similar configuration to that of the HunYuan-7B.
94
+ Hunyuan-7B-Instruct [tencent/Hunyuan-7B-Instruct](https://huggingface.co/tencent/Hunyuan-7B-Instruct).
95
+
96
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
97
+ documentation from [`PretrainedConfig`] for more information.
98
+
99
+
100
+ Args:
101
+ vocab_size (`int`, *optional*, defaults to 290943):
102
+ Vocabulary size of the HunYuan model. Defines the number of different tokens that can be represented by the
103
+ `inputs_ids` passed when calling [`HunYuanVLTextConfig`]
104
+ hidden_size (`int`, *optional*, defaults to 4096):
105
+ Dimension of the hidden representations.
106
+ intermediate_size (`int`, *optional*, defaults to 11008):
107
+ Dimension of the MLP representations or shared MLP representations.
108
+ num_hidden_layers (`int`, *optional*, defaults to 32):
109
+ Number of hidden layers in the Transformer decoder.
110
+ num_attention_heads (`int`, *optional*, defaults to 32):
111
+ Number of attention heads for each attention layer in the Transformer decoder.
112
+ num_key_value_heads (`int`, *optional*):
113
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
114
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
115
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
116
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
117
+ by meanpooling all the original heads within that group. For more details checkout [this
118
+ paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to
119
+ `num_attention_heads`.
120
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
121
+ The non-linear activation function (function or string) in the decoder.
122
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
123
+ The maximum sequence length that this model might ever be used with.
124
+ initializer_range (`float`, *optional*, defaults to 0.02):
125
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
126
+ rms_norm_eps (`float`, *optional*, defaults to 1e-05):
127
+ The epsilon used by the rms normalization layers.
128
+ use_cache (`bool`, *optional*, defaults to `True`):
129
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
130
+ relevant if `config.is_decoder=True`.
131
+ pad_token_id (`int`, *optional*, defaults to 0):
132
+ Padding token id.
133
+ bos_token_id (`int`, *optional*, defaults to 1):
134
+ Beginning of stream token id.
135
+ eos_token_id (`int`, *optional*, defaults to 2):
136
+ End of stream token id.
137
+ eod_token_id (int, *optional*, defaults to 3):
138
+ Token ID representing the end-of-document marker. Used to indicate the termination of a text sequence.
139
+ Example: In multi-document processing, this token helps the model distinguish between separate documents.
140
+ pretraining_tp (`int`, *optional*, defaults to 1):
141
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
142
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
143
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
144
+ issue](https://github.com/pytorch/pytorch/issues/76232).
145
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
146
+ Whether to tie weight embeddings
147
+ rope_theta (`float`, *optional*, defaults to 10000.0):
148
+ The base period of the RoPE embeddings.
149
+ rope_scaling (`Dict`, *optional*):
150
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
151
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
152
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
153
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
154
+ these scaling strategies behave:
155
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
156
+ experimental feature, subject to breaking API changes in future versions.
157
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
158
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
159
+ attention_dropout (`float`, *optional*, defaults to 0.0):
160
+ The dropout ratio for the attention probabilities.
161
+ head_dim (`int`, *optional*, defaults to 128):
162
+ The attention head dimension.
163
+ """
164
+
165
+ model_type = "hunyuan_vl_text"
166
+ keys_to_ignore_at_inference = ["past_key_values"]
167
+
168
+ def __init__(
169
+ self,
170
+ vocab_size=290943,
171
+ hidden_size=4096,
172
+ intermediate_size: int = 11008,
173
+ num_hidden_layers=32,
174
+ num_attention_heads=32,
175
+ num_key_value_heads=None,
176
+ hidden_act="silu",
177
+ max_position_embeddings=2048,
178
+ initializer_range=0.02,
179
+ rms_norm_eps=1e-5,
180
+ use_cache=True,
181
+ pad_token_id=0,
182
+ bos_token_id=1,
183
+ eos_token_id=2,
184
+ eod_token_id=3,
185
+ pretraining_tp=1,
186
+ tie_word_embeddings=False,
187
+ rope_theta=10000.0,
188
+ rope_scaling=None,
189
+ attention_bias=False,
190
+ attention_dropout=0.0,
191
+ head_dim=None,
192
+ **kwargs,
193
+ ):
194
+ self.vocab_size = vocab_size
195
+ self.max_position_embeddings = max_position_embeddings
196
+ self.hidden_size = hidden_size
197
+ self.intermediate_size = intermediate_size
198
+ self.num_hidden_layers = num_hidden_layers
199
+ self.num_attention_heads = num_attention_heads
200
+ self.head_dim = head_dim
201
+ # for backward compatibility
202
+ if num_key_value_heads is None:
203
+ num_key_value_heads = num_attention_heads
204
+
205
+ self.num_key_value_heads = num_key_value_heads
206
+ self.hidden_act = hidden_act
207
+ self.initializer_range = initializer_range
208
+ self.rms_norm_eps = rms_norm_eps
209
+ self.pretraining_tp = pretraining_tp
210
+ self.use_cache = use_cache
211
+ self.rope_theta = rope_theta
212
+ self.rope_scaling = rope_scaling
213
+ # self._rope_scaling_validation() # TODO: Need validation?
214
+ self.attention_bias = attention_bias
215
+ self.attention_dropout = attention_dropout
216
+
217
+ super().__init__(
218
+ pad_token_id=pad_token_id,
219
+ bos_token_id=bos_token_id,
220
+ eos_token_id=eos_token_id,
221
+ tie_word_embeddings=tie_word_embeddings,
222
+ **kwargs,
223
+ )
224
+
225
+ def _rope_scaling_validation(self):
226
+ """
227
+ Validate the `rope_scaling` configuration.
228
+ """
229
+ if self.rope_scaling is None:
230
+ return
231
+
232
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
233
+ raise ValueError(
234
+ "`rope_scaling` must be a dictionary with with two fields, `type` and `factor` or `type` and `alpha`, "
235
+ f"got {self.rope_scaling}"
236
+ )
237
+ rope_scaling_type = self.rope_scaling.get("type", None)
238
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
239
+ rope_scaling_alpha = self.rope_scaling.get("alpha", None)
240
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
241
+ raise ValueError(
242
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
243
+ )
244
+ if rope_scaling_factor is None and rope_scaling_alpha is None:
245
+ raise ValueError("`rope_scaling`'s factor or alpha field must be have one, got both of none")
246
+ if rope_scaling_factor is not None:
247
+ if not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
248
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1.0, got {rope_scaling_factor}")
249
+ if rope_scaling_alpha is not None:
250
+ if not isinstance(rope_scaling_alpha, float) or rope_scaling_alpha <= 1.0:
251
+ raise ValueError(f"`rope_scaling`'s alpha field must be a float > 1.0, got {rope_scaling_alpha}")
252
+
253
+
254
+ class HunYuanVLConfig(PretrainedConfig):
255
+ model_type = "hunyuan_vl"
256
+ sub_configs = {"vision_config": HunYuanVLVisionConfig, "text_config": HunYuanVLTextConfig}
257
+ keys_to_ignore_at_inference = ["past_key_values"]
258
+
259
+ def __init__(
260
+ self,
261
+ text_config=None,
262
+ vision_config=None,
263
+ im_start_id=120118,
264
+ im_end_id=120119,
265
+ image_token_id=120120,
266
+ im_newline_id=120121,
267
+ video_start_id=120122,
268
+ video_end_id=120123,
269
+ **kwargs,
270
+ ):
271
+ # We need to init super() here so that it does not reset values
272
+ # that are in text config to the BaseClass defaults. The Base
273
+ # config has many text related defaults and not all defaults are same as for `HunYuanVLTextConfig`
274
+ super().__init__(**kwargs)
275
+
276
+ if isinstance(vision_config, dict):
277
+ self.vision_config = self.sub_configs["vision_config"](**vision_config)
278
+ elif vision_config is None:
279
+ self.vision_config = self.sub_configs["vision_config"]()
280
+
281
+ if isinstance(text_config, dict):
282
+ self.text_config = self.sub_configs["text_config"](**text_config)
283
+ elif text_config is None:
284
+ # For BC use all kwargs to init `TextConfig`
285
+ self.text_config = self.sub_configs["text_config"](**kwargs)
286
+
287
+ self.image_token_id = image_token_id
288
+ self.im_start_id = im_start_id
289
+ self.im_end_id = im_end_id
290
+ self.im_newline_id = im_newline_id
291
+ self.video_start_id = video_start_id
292
+ self.video_end_id = video_end_id
293
+
294
+ self.vision_config.text_hidden_size = self.text_config.hidden_size
295
+
296
+ # Attention implementation to use. It sets it recursively on sub-configs so we call it again in the end
297
+ self._attn_implementation = kwargs.pop("attn_implementation", None)
298
+
299
+ def __setattr__(self, key, value):
300
+ if (
301
+ (text_config := super().__getattribute__("__dict__").get("text_config")) is not None
302
+ and key not in ["dtype", "_attn_implementation_internal"]
303
+ and key in text_config.__dict__
304
+ ):
305
+ setattr(text_config, key, value)
306
+ else:
307
+ super().__setattr__(key, value)
308
+
309
+ def __getattribute__(self, key):
310
+ if "text_config" in super().__getattribute__("__dict__") and key not in [
311
+ "_name_or_path",
312
+ "model_type",
313
+ "dtype",
314
+ "_attn_implementation_internal",
315
+ ]:
316
+ text_config = super().__getattribute__("text_config")
317
+ if key in text_config.__dict__:
318
+ return getattr(text_config, key)
319
+
320
+ return super().__getattribute__(key)
321
+
322
+
323
+ __all__ = ["HunYuanVLConfig", "HunYuanVLVisionConfig", "HunYuanVLTextConfig"]
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 120000,
3
+ "pad_token_id": 120002,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 120007,
7
+ 120020
8
+ ],
9
+ "repetition_penalty": 1.03,
10
+ "top_k": 1,
11
+ "top_p": 1.0,
12
+ "temperature":0.0
13
+ }
image_processing_hunyuan_vl.py ADDED
@@ -0,0 +1,475 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Image processor class for HunYuanVLV1."""
2
+
3
+ import math
4
+ from typing import Optional, Union
5
+
6
+ import numpy as np
7
+ import torchvision.transforms as transforms
8
+
9
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
10
+ from transformers.image_transforms import (
11
+ convert_to_rgb,
12
+ resize,
13
+ to_channel_dimension_format,
14
+ )
15
+ from transformers.image_utils import (
16
+ OPENAI_CLIP_MEAN,
17
+ OPENAI_CLIP_STD,
18
+ ChannelDimension,
19
+ ImageInput,
20
+ PILImageResampling,
21
+ get_image_size,
22
+ infer_channel_dimension_format,
23
+ is_scaled_image,
24
+ make_flat_list_of_images,
25
+ make_list_of_images,
26
+ to_numpy_array,
27
+ valid_images,
28
+ validate_preprocess_arguments,
29
+ )
30
+ from transformers.utils import TensorType, logging
31
+ from transformers.video_utils import VideoInput, make_batched_videos
32
+
33
+
34
+ logger = logging.get_logger(__name__)
35
+
36
+
37
+ def smart_resize(
38
+ height: int, width: int, factor: int = 16, min_pixels: int = 512 * 512, max_pixels: int = 2048 * 2048
39
+ ):
40
+ """Rescales the image so that the following conditions are met:
41
+
42
+ 1. Both dimensions (height and width) are divisible by 'factor'.
43
+
44
+ 2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
45
+
46
+ 3. The aspect ratio of the image is maintained as closely as possible.
47
+
48
+ """
49
+ if max(height, width) / min(height, width) > 200:
50
+ raise ValueError(
51
+ f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
52
+ )
53
+ h_bar = round(height / factor) * factor
54
+ w_bar = round(width / factor) * factor
55
+ if h_bar * w_bar > max_pixels:
56
+ beta = math.sqrt((height * width) / max_pixels)
57
+ h_bar = max(factor, math.floor(height / beta / factor) * factor)
58
+ w_bar = max(factor, math.floor(width / beta / factor) * factor)
59
+ elif h_bar * w_bar < min_pixels:
60
+ beta = math.sqrt(min_pixels / (height * width))
61
+ h_bar = math.ceil(height * beta / factor) * factor
62
+ w_bar = math.ceil(width * beta / factor) * factor
63
+ return h_bar, w_bar
64
+
65
+
66
+ class HunYuanVLImageProcessor(BaseImageProcessor):
67
+ r"""
68
+ Constructs a HunYuanVLV1 image processor that dynamically resizes images based on the original images.
69
+
70
+ Args:
71
+ do_resize (`bool`, *optional*, defaults to `True`):
72
+ Whether to resize the image's (height, width) dimensions.
73
+ size (`dict[str, int]`, *optional*, defaults to `{"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}`):
74
+ Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
75
+ resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
76
+ Resampling filter to use when resizing the image.
77
+ do_rescale (`bool`, *optional*, defaults to `True`):
78
+ Whether to rescale the image by the specified scale `rescale_factor`.
79
+ rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
80
+ Scale factor to use if rescaling the image.
81
+ do_normalize (`bool`, *optional*, defaults to `True`):
82
+ Whether to normalize the image.
83
+ image_mean (`float` or `list[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
84
+ Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
85
+ image_std (`float` or `list[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
86
+ Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
87
+ do_convert_rgb (`bool`, *optional*, defaults to `True`):
88
+ Whether to convert the image to RGB.
89
+ min_pixels (`int`, *optional*, defaults to `512 * 512`):
90
+ The min pixels of the image to resize the image.
91
+ max_pixels (`int`, *optional*, defaults to `2048 * 2048`):
92
+ The max pixels of the image to resize the image.
93
+ patch_size (`int`, *optional*, defaults to 14):
94
+ The spatial patch size of the vision encoder.
95
+ temporal_patch_size (`int`, *optional*, defaults to 2):
96
+ The temporal patch size of the vision encoder.
97
+ merge_size (`int`, *optional*, defaults to 2):
98
+ The merge size of the vision encoder to llm encoder.
99
+ """
100
+
101
+ model_input_names = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw"]
102
+
103
+ def __init__(
104
+ self,
105
+ do_resize: bool = True,
106
+ size: Optional[dict[str, int]] = None,
107
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
108
+ do_rescale: bool = True,
109
+ rescale_factor: Union[int, float] = 1 / 255,
110
+ do_normalize: bool = True,
111
+ image_mean: Optional[Union[float, list[float]]] = None,
112
+ image_std: Optional[Union[float, list[float]]] = None,
113
+ do_convert_rgb: bool = True,
114
+ min_pixels: Optional[int] = None,
115
+ max_pixels: Optional[int] = None,
116
+ patch_size: int = 16,
117
+ temporal_patch_size: int = 2,
118
+ merge_size: int = 2,
119
+ **kwargs,
120
+ ) -> None:
121
+ super().__init__(**kwargs)
122
+ if size is not None and ("shortest_edge" not in size or "longest_edge" not in size):
123
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
124
+ else:
125
+ size = {"shortest_edge": 512*512, "longest_edge": 2048*2048}
126
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
127
+ if min_pixels is not None:
128
+ size["shortest_edge"] = min_pixels
129
+ if max_pixels is not None:
130
+ size["longest_edge"] = max_pixels
131
+ self.min_pixels = size["shortest_edge"]
132
+ self.max_pixels = size["longest_edge"]
133
+ self.size = size
134
+
135
+ self.do_resize = do_resize
136
+ self.resample = resample
137
+ self.do_rescale = do_rescale
138
+ self.rescale_factor = rescale_factor
139
+ self.do_normalize = do_normalize
140
+ self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
141
+ self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
142
+
143
+ self.patch_size = patch_size
144
+ self.temporal_patch_size = temporal_patch_size
145
+ self.merge_size = merge_size
146
+ self.do_convert_rgb = do_convert_rgb
147
+
148
+ # hard-code
149
+
150
+ def _preprocess(
151
+ self,
152
+ images: Union[ImageInput, VideoInput],
153
+ do_resize: Optional[bool] = None,
154
+ size: Optional[dict[str, int]] = None,
155
+ resample: PILImageResampling = None,
156
+ do_rescale: Optional[bool] = None,
157
+ rescale_factor: Optional[float] = None,
158
+ do_normalize: Optional[bool] = None,
159
+ image_mean: Optional[Union[float, list[float]]] = None,
160
+ image_std: Optional[Union[float, list[float]]] = None,
161
+ patch_size: Optional[int] = None,
162
+ temporal_patch_size: Optional[int] = None,
163
+ merge_size: Optional[int] = None,
164
+ do_convert_rgb: Optional[bool] = None,
165
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
166
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
167
+ ):
168
+ """
169
+ Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
170
+
171
+ Args:
172
+ images (`ImageInput`):
173
+ Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
174
+ vision_info (`list[Dict]`, *optional*):
175
+ Optional list of dictionaries containing additional information about vision inputs.
176
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
177
+ Whether to resize the image.
178
+ size (`dict[str, int]`, *optional*, defaults to `self.size`):
179
+ Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
180
+ resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
181
+ Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
182
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
183
+ Whether to rescale the image.
184
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
185
+ Scale factor to use if rescaling the image.
186
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
187
+ Whether to normalize the image.
188
+ image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
189
+ Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
190
+ image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
191
+ Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
192
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
193
+ The spatial patch size of the vision encoder.
194
+ temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
195
+ The temporal patch size of the vision encoder.
196
+ merge_size (`int`, *optional*, defaults to `self.merge_size`):
197
+ The merge size of the vision encoder to llm encoder.
198
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
199
+ Whether to convert the image to RGB.
200
+ data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
201
+ The channel dimension format for the output image. Can be one of:
202
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
203
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
204
+ - Unset: Use the channel dimension format of the input image.
205
+ input_data_format (`ChannelDimension` or `str`, *optional*):
206
+ The channel dimension format for the input image. Can be one of:
207
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
208
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
209
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
210
+ """
211
+ images = make_list_of_images(images)
212
+
213
+ if do_convert_rgb:
214
+ images = [convert_to_rgb(image) for image in images]
215
+
216
+ width, height = images[0].width, images[0].height
217
+ resized_width, resized_height = width, height
218
+ processed_images = []
219
+ for image in images:
220
+ if do_resize:
221
+ resized_width, resized_height = smart_resize(
222
+ width,
223
+ height,
224
+ factor=patch_size * merge_size,
225
+ min_pixels=size["shortest_edge"],
226
+ max_pixels=size["longest_edge"],
227
+ )
228
+ image = image.resize((resized_width, resized_height))
229
+
230
+ if do_normalize:
231
+ image = transforms.Compose([
232
+ transforms.ToTensor(),
233
+ transforms.Normalize(self.image_mean, self.image_std)
234
+ ])(image)
235
+ processed_images.append(image)
236
+
237
+
238
+ patches = np.array(processed_images)
239
+ channel = patches.shape[1]
240
+ grid_t = patches.shape[0] // temporal_patch_size
241
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
242
+ patches = patches.reshape(
243
+ 1,
244
+ channel,
245
+ grid_h // merge_size,
246
+ merge_size,
247
+ patch_size,
248
+ grid_w // merge_size,
249
+ merge_size,
250
+ patch_size,
251
+ )
252
+ patches = patches.transpose(0, 2, 3, 5, 6, 1, 4, 7)
253
+ flatten_patches = patches.reshape( 1 * grid_h * grid_w, channel * patch_size * patch_size)
254
+
255
+ return flatten_patches, (grid_t, grid_h, grid_w)
256
+
257
+ def preprocess(
258
+ self,
259
+ images: ImageInput,
260
+ videos: VideoInput = None,
261
+ do_resize: Optional[bool] = None,
262
+ size: Optional[dict[str, int]] = None,
263
+ min_pixels: Optional[int] = None,
264
+ max_pixels: Optional[int] = None,
265
+ resample: PILImageResampling = None,
266
+ do_rescale: Optional[bool] = None,
267
+ rescale_factor: Optional[float] = None,
268
+ do_normalize: Optional[bool] = None,
269
+ image_mean: Optional[Union[float, list[float]]] = None,
270
+ image_std: Optional[Union[float, list[float]]] = None,
271
+ patch_size: Optional[int] = None,
272
+ temporal_patch_size: Optional[int] = None,
273
+ merge_size: Optional[int] = None,
274
+ do_convert_rgb: Optional[bool] = None,
275
+ return_tensors: Optional[Union[str, TensorType]] = None,
276
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
277
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
278
+ ):
279
+ """
280
+ Args:
281
+ images (`ImageInput`):
282
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
283
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
284
+ videos (`VideoInput`):
285
+ Video to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
286
+ passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
287
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
288
+ Whether to resize the image.
289
+ size (`dict[str, int]`, *optional*, defaults to `self.size`):
290
+ Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
291
+ the longest edge resized to keep the input aspect ratio.
292
+ resample (`int`, *optional*, defaults to `self.resample`):
293
+ Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
294
+ has an effect if `do_resize` is set to `True`.
295
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
296
+ Whether to rescale the image.
297
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
298
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
299
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
300
+ Whether to normalize the image.
301
+ image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
302
+ Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
303
+ image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
304
+ Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
305
+ `True`.
306
+ min_pixels (`int`, *optional*, defaults to `self.min_pixels`):
307
+ The min pixels of the image to resize the image.
308
+ max_pixels (`int`, *optional*, defaults to `self.max_pixels`):
309
+ The max pixels of the image to resize the image.
310
+ patch_size (`int`, *optional*, defaults to `self.patch_size`):
311
+ The spatial patch size of the vision encoder.
312
+ temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
313
+ The temporal patch size of the vision encoder.
314
+ merge_size (`int`, *optional*, defaults to `self.merge_size`):
315
+ The merge size of the vision encoder to llm encoder.
316
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
317
+ Whether to convert the image to RGB.
318
+ return_tensors (`str` or `TensorType`, *optional*):
319
+ The type of tensors to return. Can be one of:
320
+ - Unset: Return a list of `np.ndarray`.
321
+ - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
322
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
323
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
324
+ - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
325
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
326
+ The channel dimension format for the output image. Can be one of:
327
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
328
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
329
+ - Unset: Use the channel dimension format of the input image.
330
+ input_data_format (`ChannelDimension` or `str`, *optional*):
331
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
332
+ from the input image. Can be one of:
333
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
334
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
335
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
336
+
337
+ """
338
+ min_pixels = min_pixels if min_pixels is not None else self.min_pixels
339
+ max_pixels = max_pixels if max_pixels is not None else self.max_pixels
340
+
341
+ if size is not None:
342
+ if "shortest_edge" not in size or "longest_edge" not in size:
343
+ raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
344
+ min_pixels = size["shortest_edge"]
345
+ elif min_pixels is not None and max_pixels is not None:
346
+ # backward compatibility: override size with min_pixels and max_pixels if they are provided
347
+ size = {"shortest_edge": min_pixels, "longest_edge": max_pixels}
348
+ else:
349
+ size = {**self.size}
350
+
351
+ do_resize = do_resize if do_resize is not None else self.do_resize
352
+
353
+ resample = resample if resample is not None else self.resample
354
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
355
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
356
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
357
+ image_mean = image_mean if image_mean is not None else self.image_mean
358
+ image_std = image_std if image_std is not None else self.image_std
359
+ patch_size = patch_size if patch_size is not None else self.patch_size
360
+ temporal_patch_size = temporal_patch_size if temporal_patch_size is not None else self.temporal_patch_size
361
+ merge_size = merge_size if merge_size is not None else self.merge_size
362
+ do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
363
+
364
+ if images is not None:
365
+ images = make_flat_list_of_images(images)
366
+
367
+ if images is not None and not valid_images(images):
368
+ raise ValueError(
369
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
370
+ "torch.Tensor, tf.Tensor or jax.ndarray."
371
+ )
372
+
373
+ validate_preprocess_arguments(
374
+ rescale_factor=rescale_factor,
375
+ do_normalize=do_normalize,
376
+ image_mean=image_mean,
377
+ image_std=image_std,
378
+ do_resize=do_resize,
379
+ size=size,
380
+ resample=resample,
381
+ )
382
+
383
+ data = {}
384
+ if images is not None:
385
+ pixel_values, vision_grid_thws = [], []
386
+ for image in images:
387
+ patches, image_grid_thw = self._preprocess(
388
+ image,
389
+ do_resize=do_resize,
390
+ size=size,
391
+ resample=resample,
392
+ do_rescale=do_rescale,
393
+ rescale_factor=rescale_factor,
394
+ do_normalize=do_normalize,
395
+ image_mean=image_mean,
396
+ image_std=image_std,
397
+ patch_size=patch_size,
398
+ temporal_patch_size=temporal_patch_size,
399
+ merge_size=merge_size,
400
+ data_format=data_format,
401
+ do_convert_rgb=do_convert_rgb,
402
+ input_data_format=input_data_format,
403
+ )
404
+ pixel_values.extend(patches)
405
+ vision_grid_thws.append(image_grid_thw)
406
+ pixel_values = np.array(pixel_values)
407
+ vision_grid_thws = np.array(vision_grid_thws)
408
+ data.update({"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws})
409
+
410
+ # kept for BC only and should be removed after v5.0
411
+ if videos is not None:
412
+ logger.warning(
413
+ "`HunYuanVLV1ImageProcessor` works only with image inputs and doesn't process videos anymore. "
414
+ "This is a deprecated behavior and will be removed in v5.0. "
415
+ "Your videos should be forwarded to `HunYuanVLV1VideoProcessor`. "
416
+ )
417
+ videos = make_batched_videos(videos)
418
+ pixel_values_videos, vision_grid_thws_videos = [], []
419
+ for images in videos:
420
+ patches, video_grid_thw = self._preprocess(
421
+ images,
422
+ do_resize=do_resize,
423
+ size=size,
424
+ resample=resample,
425
+ do_rescale=do_rescale,
426
+ rescale_factor=rescale_factor,
427
+ do_normalize=do_normalize,
428
+ image_mean=image_mean,
429
+ image_std=image_std,
430
+ patch_size=patch_size,
431
+ temporal_patch_size=temporal_patch_size,
432
+ merge_size=merge_size,
433
+ data_format=data_format,
434
+ do_convert_rgb=do_convert_rgb,
435
+ input_data_format=input_data_format,
436
+ )
437
+ pixel_values_videos.extend(patches)
438
+ vision_grid_thws_videos.append(video_grid_thw)
439
+ data.update(
440
+ {
441
+ "pixel_values_videos": np.array(pixel_values_videos),
442
+ "video_grid_thw": np.array(vision_grid_thws_videos),
443
+ }
444
+ )
445
+
446
+ return BatchFeature(data=data, tensor_type=return_tensors)
447
+
448
+ def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
449
+ """
450
+ A utility that returns number of image patches for a given image size.
451
+
452
+ Args:
453
+ height (`int`):
454
+ Height of the input image.
455
+ width (`int`):
456
+ Width of the input image.
457
+ images_kwargs (`dict`, *optional*)
458
+ Any kwargs to override defaults of the image processor.
459
+ Returns:
460
+ `int`: Number of image patches per image.
461
+ """
462
+ min_pixels = images_kwargs["min_pixels"] if "min_pixels" in images_kwargs else self.size["shortest_edge"]
463
+ max_pixels = images_kwargs["max_pixels"] if "max_pixels" in images_kwargs else self.size["longest_edge"]
464
+ patch_size = images_kwargs.get("patch_size", self.patch_size)
465
+ merge_size = images_kwargs.get("merge_size", self.merge_size)
466
+
467
+ factor = patch_size * merge_size
468
+ resized_height, resized_width = smart_resize(
469
+ height, width, factor, min_pixels=min_pixels, max_pixels=max_pixels
470
+ )
471
+ grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
472
+ return grid_h * (grid_w + 1) + 2
473
+
474
+
475
+ __all__ = ["HunYuanVLImageProcessor"]
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7a0f4cb7fdfe4dc2686f8554310a34b4859ae464ec948f89d954318e382382d
3
+ size 439600816
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fbfd70bed291d7920c65aacf4f07c8ea55e60dda253a529860880c5a7e4c00bd
3
+ size 453346288
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53d0b9f9a85aa21b3454f16f19845294fff7bab8e13aeaf3f7992b85fd35c473
3
+ size 461590008
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd82d09583ee16037f04532808d0f00332301fc6ed18aa0b75b902fa014402aa
3
+ size 637958736
model.safetensors.index.json ADDED
@@ -0,0 +1,720 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 1992416224
4
+ },
5
+ "weight_map": {
6
+ "model.embed_tokens.weight": "model-00004-of-00004.safetensors",
7
+ "model.layers.0.input_layernorm.weight": "model-00004-of-00004.safetensors",
8
+ "model.layers.0.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
9
+ "model.layers.0.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
10
+ "model.layers.0.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
11
+ "model.layers.0.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
12
+ "model.layers.0.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
13
+ "model.layers.0.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
14
+ "model.layers.0.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
15
+ "model.layers.0.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
16
+ "model.layers.0.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
17
+ "model.layers.0.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
18
+ "model.layers.1.input_layernorm.weight": "model-00004-of-00004.safetensors",
19
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
20
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
21
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
22
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
23
+ "model.layers.1.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
24
+ "model.layers.1.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
25
+ "model.layers.1.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
26
+ "model.layers.1.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
27
+ "model.layers.1.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
28
+ "model.layers.1.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
29
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00004.safetensors",
30
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
31
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
32
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
33
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
34
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
35
+ "model.layers.10.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
36
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
37
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
38
+ "model.layers.10.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
39
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
40
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
41
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
42
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
43
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
44
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
45
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
46
+ "model.layers.11.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
47
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
48
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
49
+ "model.layers.11.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
50
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
51
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
52
+ "model.layers.12.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
53
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
54
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
55
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
56
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
57
+ "model.layers.12.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
58
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
59
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
60
+ "model.layers.12.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
61
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
62
+ "model.layers.13.input_layernorm.weight": "model-00003-of-00004.safetensors",
63
+ "model.layers.13.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
64
+ "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
65
+ "model.layers.13.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
66
+ "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
67
+ "model.layers.13.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
68
+ "model.layers.13.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
69
+ "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
70
+ "model.layers.13.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
71
+ "model.layers.13.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
72
+ "model.layers.13.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
73
+ "model.layers.14.input_layernorm.weight": "model-00003-of-00004.safetensors",
74
+ "model.layers.14.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
75
+ "model.layers.14.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
76
+ "model.layers.14.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
77
+ "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
78
+ "model.layers.14.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
79
+ "model.layers.14.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
80
+ "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
81
+ "model.layers.14.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
82
+ "model.layers.14.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
83
+ "model.layers.14.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
84
+ "model.layers.15.input_layernorm.weight": "model-00004-of-00004.safetensors",
85
+ "model.layers.15.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
86
+ "model.layers.15.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
87
+ "model.layers.15.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
88
+ "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
89
+ "model.layers.15.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
90
+ "model.layers.15.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
91
+ "model.layers.15.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
92
+ "model.layers.15.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
93
+ "model.layers.15.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
94
+ "model.layers.15.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
95
+ "model.layers.16.input_layernorm.weight": "model-00004-of-00004.safetensors",
96
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
97
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
98
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
99
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
100
+ "model.layers.16.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
101
+ "model.layers.16.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
102
+ "model.layers.16.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
103
+ "model.layers.16.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
104
+ "model.layers.16.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
105
+ "model.layers.16.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
106
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00004.safetensors",
107
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
108
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
109
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
110
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
111
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
112
+ "model.layers.17.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
113
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
114
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
115
+ "model.layers.17.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
116
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
117
+ "model.layers.18.input_layernorm.weight": "model-00001-of-00004.safetensors",
118
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
119
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
120
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
121
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
122
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
123
+ "model.layers.18.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
124
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
125
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
126
+ "model.layers.18.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
127
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
128
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
129
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
130
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
131
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
132
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
133
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
134
+ "model.layers.19.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
135
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
136
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
137
+ "model.layers.19.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
138
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
139
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
140
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
141
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
142
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
143
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
144
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
145
+ "model.layers.2.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
146
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
147
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
148
+ "model.layers.2.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
149
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
150
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00004.safetensors",
151
+ "model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
152
+ "model.layers.20.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
153
+ "model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
154
+ "model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
155
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
156
+ "model.layers.20.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
157
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
158
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
159
+ "model.layers.20.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
160
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
161
+ "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
162
+ "model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
163
+ "model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
164
+ "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
165
+ "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
166
+ "model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
167
+ "model.layers.21.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
168
+ "model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
169
+ "model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
170
+ "model.layers.21.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
171
+ "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
172
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
173
+ "model.layers.22.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
174
+ "model.layers.22.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
175
+ "model.layers.22.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
176
+ "model.layers.22.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
177
+ "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
178
+ "model.layers.22.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
179
+ "model.layers.22.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
180
+ "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
181
+ "model.layers.22.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
182
+ "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
183
+ "model.layers.23.input_layernorm.weight": "model-00004-of-00004.safetensors",
184
+ "model.layers.23.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
185
+ "model.layers.23.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
186
+ "model.layers.23.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
187
+ "model.layers.23.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
188
+ "model.layers.23.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
189
+ "model.layers.23.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
190
+ "model.layers.23.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
191
+ "model.layers.23.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
192
+ "model.layers.23.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
193
+ "model.layers.23.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
194
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
195
+ "model.layers.3.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
196
+ "model.layers.3.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
197
+ "model.layers.3.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
198
+ "model.layers.3.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
199
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
200
+ "model.layers.3.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
201
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
202
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
203
+ "model.layers.3.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
204
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
205
+ "model.layers.4.input_layernorm.weight": "model-00002-of-00004.safetensors",
206
+ "model.layers.4.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
207
+ "model.layers.4.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
208
+ "model.layers.4.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
209
+ "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
210
+ "model.layers.4.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
211
+ "model.layers.4.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
212
+ "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
213
+ "model.layers.4.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
214
+ "model.layers.4.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
215
+ "model.layers.4.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
216
+ "model.layers.5.input_layernorm.weight": "model-00002-of-00004.safetensors",
217
+ "model.layers.5.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
218
+ "model.layers.5.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
219
+ "model.layers.5.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
220
+ "model.layers.5.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
221
+ "model.layers.5.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
222
+ "model.layers.5.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
223
+ "model.layers.5.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
224
+ "model.layers.5.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
225
+ "model.layers.5.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
226
+ "model.layers.5.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
227
+ "model.layers.6.input_layernorm.weight": "model-00003-of-00004.safetensors",
228
+ "model.layers.6.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
229
+ "model.layers.6.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
230
+ "model.layers.6.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
231
+ "model.layers.6.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
232
+ "model.layers.6.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
233
+ "model.layers.6.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
234
+ "model.layers.6.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
235
+ "model.layers.6.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
236
+ "model.layers.6.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
237
+ "model.layers.6.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
238
+ "model.layers.7.input_layernorm.weight": "model-00003-of-00004.safetensors",
239
+ "model.layers.7.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
240
+ "model.layers.7.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
241
+ "model.layers.7.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
242
+ "model.layers.7.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
243
+ "model.layers.7.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
244
+ "model.layers.7.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
245
+ "model.layers.7.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
246
+ "model.layers.7.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
247
+ "model.layers.7.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
248
+ "model.layers.7.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
249
+ "model.layers.8.input_layernorm.weight": "model-00004-of-00004.safetensors",
250
+ "model.layers.8.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
251
+ "model.layers.8.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
252
+ "model.layers.8.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
253
+ "model.layers.8.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
254
+ "model.layers.8.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
255
+ "model.layers.8.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
256
+ "model.layers.8.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
257
+ "model.layers.8.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
258
+ "model.layers.8.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
259
+ "model.layers.8.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
260
+ "model.layers.9.input_layernorm.weight": "model-00004-of-00004.safetensors",
261
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
262
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
263
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
264
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
265
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
266
+ "model.layers.9.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
267
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
268
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
269
+ "model.layers.9.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
270
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
271
+ "model.norm.weight": "model-00004-of-00004.safetensors",
272
+ "vit.perceive.after_rms.weight": "model-00004-of-00004.safetensors",
273
+ "vit.perceive.before_rms.weight": "model-00003-of-00004.safetensors",
274
+ "vit.perceive.image_begin": "model-00003-of-00004.safetensors",
275
+ "vit.perceive.image_end": "model-00003-of-00004.safetensors",
276
+ "vit.perceive.image_newline": "model-00003-of-00004.safetensors",
277
+ "vit.perceive.image_sep": "model-00003-of-00004.safetensors",
278
+ "vit.perceive.mlp.bias": "model-00004-of-00004.safetensors",
279
+ "vit.perceive.mlp.weight": "model-00003-of-00004.safetensors",
280
+ "vit.perceive.proj.0.bias": "model-00004-of-00004.safetensors",
281
+ "vit.perceive.proj.0.weight": "model-00003-of-00004.safetensors",
282
+ "vit.perceive.proj.2.bias": "model-00004-of-00004.safetensors",
283
+ "vit.perceive.proj.2.weight": "model-00003-of-00004.safetensors",
284
+ "vit.embeddings.patch_embedding.bias": "model-00004-of-00004.safetensors",
285
+ "vit.embeddings.patch_embedding.weight": "model-00003-of-00004.safetensors",
286
+ "vit.embeddings.position_embedding.weight": "model-00003-of-00004.safetensors",
287
+ "vit.layers.0.input_layernorm.bias": "model-00001-of-00004.safetensors",
288
+ "vit.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
289
+ "vit.layers.0.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
290
+ "vit.layers.0.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
291
+ "vit.layers.0.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
292
+ "vit.layers.0.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
293
+ "vit.layers.0.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
294
+ "vit.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
295
+ "vit.layers.0.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
296
+ "vit.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
297
+ "vit.layers.0.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
298
+ "vit.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
299
+ "vit.layers.0.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
300
+ "vit.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
301
+ "vit.layers.0.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
302
+ "vit.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
303
+ "vit.layers.1.input_layernorm.bias": "model-00001-of-00004.safetensors",
304
+ "vit.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
305
+ "vit.layers.1.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
306
+ "vit.layers.1.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
307
+ "vit.layers.1.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
308
+ "vit.layers.1.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
309
+ "vit.layers.1.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
310
+ "vit.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
311
+ "vit.layers.1.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
312
+ "vit.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
313
+ "vit.layers.1.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
314
+ "vit.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
315
+ "vit.layers.1.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
316
+ "vit.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
317
+ "vit.layers.1.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
318
+ "vit.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
319
+ "vit.layers.10.input_layernorm.bias": "model-00002-of-00004.safetensors",
320
+ "vit.layers.10.input_layernorm.weight": "model-00004-of-00004.safetensors",
321
+ "vit.layers.10.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
322
+ "vit.layers.10.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
323
+ "vit.layers.10.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
324
+ "vit.layers.10.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
325
+ "vit.layers.10.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
326
+ "vit.layers.10.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
327
+ "vit.layers.10.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
328
+ "vit.layers.10.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
329
+ "vit.layers.10.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
330
+ "vit.layers.10.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
331
+ "vit.layers.10.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
332
+ "vit.layers.10.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
333
+ "vit.layers.10.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
334
+ "vit.layers.10.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
335
+ "vit.layers.11.input_layernorm.bias": "model-00002-of-00004.safetensors",
336
+ "vit.layers.11.input_layernorm.weight": "model-00001-of-00004.safetensors",
337
+ "vit.layers.11.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
338
+ "vit.layers.11.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
339
+ "vit.layers.11.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
340
+ "vit.layers.11.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
341
+ "vit.layers.11.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
342
+ "vit.layers.11.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
343
+ "vit.layers.11.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
344
+ "vit.layers.11.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
345
+ "vit.layers.11.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
346
+ "vit.layers.11.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
347
+ "vit.layers.11.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
348
+ "vit.layers.11.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
349
+ "vit.layers.11.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
350
+ "vit.layers.11.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
351
+ "vit.layers.12.input_layernorm.bias": "model-00002-of-00004.safetensors",
352
+ "vit.layers.12.input_layernorm.weight": "model-00001-of-00004.safetensors",
353
+ "vit.layers.12.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
354
+ "vit.layers.12.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
355
+ "vit.layers.12.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
356
+ "vit.layers.12.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
357
+ "vit.layers.12.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
358
+ "vit.layers.12.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
359
+ "vit.layers.12.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
360
+ "vit.layers.12.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
361
+ "vit.layers.12.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
362
+ "vit.layers.12.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
363
+ "vit.layers.12.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
364
+ "vit.layers.12.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
365
+ "vit.layers.12.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
366
+ "vit.layers.12.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
367
+ "vit.layers.13.input_layernorm.bias": "model-00002-of-00004.safetensors",
368
+ "vit.layers.13.input_layernorm.weight": "model-00001-of-00004.safetensors",
369
+ "vit.layers.13.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
370
+ "vit.layers.13.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
371
+ "vit.layers.13.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
372
+ "vit.layers.13.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
373
+ "vit.layers.13.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
374
+ "vit.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
375
+ "vit.layers.13.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
376
+ "vit.layers.13.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
377
+ "vit.layers.13.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
378
+ "vit.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
379
+ "vit.layers.13.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
380
+ "vit.layers.13.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
381
+ "vit.layers.13.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
382
+ "vit.layers.13.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
383
+ "vit.layers.14.input_layernorm.bias": "model-00003-of-00004.safetensors",
384
+ "vit.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
385
+ "vit.layers.14.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
386
+ "vit.layers.14.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
387
+ "vit.layers.14.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
388
+ "vit.layers.14.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
389
+ "vit.layers.14.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
390
+ "vit.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
391
+ "vit.layers.14.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
392
+ "vit.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
393
+ "vit.layers.14.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
394
+ "vit.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
395
+ "vit.layers.14.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
396
+ "vit.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
397
+ "vit.layers.14.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
398
+ "vit.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
399
+ "vit.layers.15.input_layernorm.bias": "model-00003-of-00004.safetensors",
400
+ "vit.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
401
+ "vit.layers.15.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
402
+ "vit.layers.15.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
403
+ "vit.layers.15.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
404
+ "vit.layers.15.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
405
+ "vit.layers.15.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
406
+ "vit.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
407
+ "vit.layers.15.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
408
+ "vit.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
409
+ "vit.layers.15.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
410
+ "vit.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
411
+ "vit.layers.15.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
412
+ "vit.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
413
+ "vit.layers.15.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
414
+ "vit.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
415
+ "vit.layers.16.input_layernorm.bias": "model-00003-of-00004.safetensors",
416
+ "vit.layers.16.input_layernorm.weight": "model-00003-of-00004.safetensors",
417
+ "vit.layers.16.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
418
+ "vit.layers.16.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
419
+ "vit.layers.16.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
420
+ "vit.layers.16.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
421
+ "vit.layers.16.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
422
+ "vit.layers.16.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
423
+ "vit.layers.16.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
424
+ "vit.layers.16.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
425
+ "vit.layers.16.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
426
+ "vit.layers.16.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
427
+ "vit.layers.16.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
428
+ "vit.layers.16.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
429
+ "vit.layers.16.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
430
+ "vit.layers.16.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
431
+ "vit.layers.17.input_layernorm.bias": "model-00003-of-00004.safetensors",
432
+ "vit.layers.17.input_layernorm.weight": "model-00003-of-00004.safetensors",
433
+ "vit.layers.17.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
434
+ "vit.layers.17.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
435
+ "vit.layers.17.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
436
+ "vit.layers.17.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
437
+ "vit.layers.17.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
438
+ "vit.layers.17.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
439
+ "vit.layers.17.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
440
+ "vit.layers.17.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
441
+ "vit.layers.17.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
442
+ "vit.layers.17.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
443
+ "vit.layers.17.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
444
+ "vit.layers.17.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
445
+ "vit.layers.17.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
446
+ "vit.layers.17.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
447
+ "vit.layers.18.input_layernorm.bias": "model-00003-of-00004.safetensors",
448
+ "vit.layers.18.input_layernorm.weight": "model-00003-of-00004.safetensors",
449
+ "vit.layers.18.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
450
+ "vit.layers.18.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
451
+ "vit.layers.18.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
452
+ "vit.layers.18.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
453
+ "vit.layers.18.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
454
+ "vit.layers.18.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
455
+ "vit.layers.18.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
456
+ "vit.layers.18.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
457
+ "vit.layers.18.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
458
+ "vit.layers.18.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
459
+ "vit.layers.18.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
460
+ "vit.layers.18.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
461
+ "vit.layers.18.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
462
+ "vit.layers.18.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
463
+ "vit.layers.19.input_layernorm.bias": "model-00003-of-00004.safetensors",
464
+ "vit.layers.19.input_layernorm.weight": "model-00004-of-00004.safetensors",
465
+ "vit.layers.19.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
466
+ "vit.layers.19.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
467
+ "vit.layers.19.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
468
+ "vit.layers.19.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
469
+ "vit.layers.19.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
470
+ "vit.layers.19.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
471
+ "vit.layers.19.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
472
+ "vit.layers.19.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
473
+ "vit.layers.19.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
474
+ "vit.layers.19.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
475
+ "vit.layers.19.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
476
+ "vit.layers.19.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
477
+ "vit.layers.19.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
478
+ "vit.layers.19.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
479
+ "vit.layers.2.input_layernorm.bias": "model-00001-of-00004.safetensors",
480
+ "vit.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
481
+ "vit.layers.2.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
482
+ "vit.layers.2.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
483
+ "vit.layers.2.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
484
+ "vit.layers.2.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
485
+ "vit.layers.2.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
486
+ "vit.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
487
+ "vit.layers.2.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
488
+ "vit.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
489
+ "vit.layers.2.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
490
+ "vit.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
491
+ "vit.layers.2.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
492
+ "vit.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
493
+ "vit.layers.2.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
494
+ "vit.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
495
+ "vit.layers.20.input_layernorm.bias": "model-00003-of-00004.safetensors",
496
+ "vit.layers.20.input_layernorm.weight": "model-00004-of-00004.safetensors",
497
+ "vit.layers.20.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
498
+ "vit.layers.20.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
499
+ "vit.layers.20.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
500
+ "vit.layers.20.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
501
+ "vit.layers.20.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
502
+ "vit.layers.20.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
503
+ "vit.layers.20.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
504
+ "vit.layers.20.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
505
+ "vit.layers.20.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
506
+ "vit.layers.20.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
507
+ "vit.layers.20.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
508
+ "vit.layers.20.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
509
+ "vit.layers.20.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
510
+ "vit.layers.20.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
511
+ "vit.layers.21.input_layernorm.bias": "model-00004-of-00004.safetensors",
512
+ "vit.layers.21.input_layernorm.weight": "model-00004-of-00004.safetensors",
513
+ "vit.layers.21.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
514
+ "vit.layers.21.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
515
+ "vit.layers.21.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
516
+ "vit.layers.21.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
517
+ "vit.layers.21.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
518
+ "vit.layers.21.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
519
+ "vit.layers.21.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
520
+ "vit.layers.21.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
521
+ "vit.layers.21.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
522
+ "vit.layers.21.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
523
+ "vit.layers.21.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
524
+ "vit.layers.21.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
525
+ "vit.layers.21.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
526
+ "vit.layers.21.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
527
+ "vit.layers.22.input_layernorm.bias": "model-00004-of-00004.safetensors",
528
+ "vit.layers.22.input_layernorm.weight": "model-00001-of-00004.safetensors",
529
+ "vit.layers.22.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
530
+ "vit.layers.22.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
531
+ "vit.layers.22.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
532
+ "vit.layers.22.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
533
+ "vit.layers.22.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
534
+ "vit.layers.22.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
535
+ "vit.layers.22.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
536
+ "vit.layers.22.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
537
+ "vit.layers.22.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
538
+ "vit.layers.22.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
539
+ "vit.layers.22.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
540
+ "vit.layers.22.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
541
+ "vit.layers.22.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
542
+ "vit.layers.22.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
543
+ "vit.layers.23.input_layernorm.bias": "model-00004-of-00004.safetensors",
544
+ "vit.layers.23.input_layernorm.weight": "model-00001-of-00004.safetensors",
545
+ "vit.layers.23.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
546
+ "vit.layers.23.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
547
+ "vit.layers.23.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
548
+ "vit.layers.23.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
549
+ "vit.layers.23.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
550
+ "vit.layers.23.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
551
+ "vit.layers.23.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
552
+ "vit.layers.23.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
553
+ "vit.layers.23.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
554
+ "vit.layers.23.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
555
+ "vit.layers.23.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
556
+ "vit.layers.23.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
557
+ "vit.layers.23.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
558
+ "vit.layers.23.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
559
+ "vit.layers.24.input_layernorm.bias": "model-00004-of-00004.safetensors",
560
+ "vit.layers.24.input_layernorm.weight": "model-00002-of-00004.safetensors",
561
+ "vit.layers.24.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
562
+ "vit.layers.24.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
563
+ "vit.layers.24.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
564
+ "vit.layers.24.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
565
+ "vit.layers.24.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
566
+ "vit.layers.24.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
567
+ "vit.layers.24.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
568
+ "vit.layers.24.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
569
+ "vit.layers.24.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
570
+ "vit.layers.24.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
571
+ "vit.layers.24.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
572
+ "vit.layers.24.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
573
+ "vit.layers.24.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
574
+ "vit.layers.24.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
575
+ "vit.layers.25.input_layernorm.bias": "model-00004-of-00004.safetensors",
576
+ "vit.layers.25.input_layernorm.weight": "model-00002-of-00004.safetensors",
577
+ "vit.layers.25.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
578
+ "vit.layers.25.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
579
+ "vit.layers.25.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
580
+ "vit.layers.25.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
581
+ "vit.layers.25.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
582
+ "vit.layers.25.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
583
+ "vit.layers.25.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
584
+ "vit.layers.25.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
585
+ "vit.layers.25.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
586
+ "vit.layers.25.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
587
+ "vit.layers.25.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
588
+ "vit.layers.25.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
589
+ "vit.layers.25.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
590
+ "vit.layers.25.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
591
+ "vit.layers.26.input_layernorm.bias": "model-00004-of-00004.safetensors",
592
+ "vit.layers.26.input_layernorm.weight": "model-00002-of-00004.safetensors",
593
+ "vit.layers.26.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
594
+ "vit.layers.26.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
595
+ "vit.layers.26.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
596
+ "vit.layers.26.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
597
+ "vit.layers.26.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
598
+ "vit.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
599
+ "vit.layers.26.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
600
+ "vit.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
601
+ "vit.layers.26.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
602
+ "vit.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
603
+ "vit.layers.26.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
604
+ "vit.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
605
+ "vit.layers.26.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
606
+ "vit.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
607
+ "vit.layers.3.input_layernorm.bias": "model-00001-of-00004.safetensors",
608
+ "vit.layers.3.input_layernorm.weight": "model-00002-of-00004.safetensors",
609
+ "vit.layers.3.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
610
+ "vit.layers.3.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
611
+ "vit.layers.3.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
612
+ "vit.layers.3.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
613
+ "vit.layers.3.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
614
+ "vit.layers.3.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
615
+ "vit.layers.3.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
616
+ "vit.layers.3.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
617
+ "vit.layers.3.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
618
+ "vit.layers.3.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
619
+ "vit.layers.3.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
620
+ "vit.layers.3.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
621
+ "vit.layers.3.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
622
+ "vit.layers.3.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
623
+ "vit.layers.4.input_layernorm.bias": "model-00001-of-00004.safetensors",
624
+ "vit.layers.4.input_layernorm.weight": "model-00002-of-00004.safetensors",
625
+ "vit.layers.4.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
626
+ "vit.layers.4.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
627
+ "vit.layers.4.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
628
+ "vit.layers.4.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
629
+ "vit.layers.4.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
630
+ "vit.layers.4.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
631
+ "vit.layers.4.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
632
+ "vit.layers.4.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
633
+ "vit.layers.4.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
634
+ "vit.layers.4.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
635
+ "vit.layers.4.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
636
+ "vit.layers.4.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
637
+ "vit.layers.4.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
638
+ "vit.layers.4.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
639
+ "vit.layers.5.input_layernorm.bias": "model-00001-of-00004.safetensors",
640
+ "vit.layers.5.input_layernorm.weight": "model-00002-of-00004.safetensors",
641
+ "vit.layers.5.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
642
+ "vit.layers.5.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
643
+ "vit.layers.5.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
644
+ "vit.layers.5.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
645
+ "vit.layers.5.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
646
+ "vit.layers.5.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
647
+ "vit.layers.5.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
648
+ "vit.layers.5.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
649
+ "vit.layers.5.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
650
+ "vit.layers.5.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
651
+ "vit.layers.5.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
652
+ "vit.layers.5.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
653
+ "vit.layers.5.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
654
+ "vit.layers.5.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
655
+ "vit.layers.6.input_layernorm.bias": "model-00001-of-00004.safetensors",
656
+ "vit.layers.6.input_layernorm.weight": "model-00003-of-00004.safetensors",
657
+ "vit.layers.6.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
658
+ "vit.layers.6.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
659
+ "vit.layers.6.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
660
+ "vit.layers.6.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
661
+ "vit.layers.6.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
662
+ "vit.layers.6.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
663
+ "vit.layers.6.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
664
+ "vit.layers.6.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
665
+ "vit.layers.6.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
666
+ "vit.layers.6.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
667
+ "vit.layers.6.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
668
+ "vit.layers.6.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
669
+ "vit.layers.6.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
670
+ "vit.layers.6.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
671
+ "vit.layers.7.input_layernorm.bias": "model-00002-of-00004.safetensors",
672
+ "vit.layers.7.input_layernorm.weight": "model-00003-of-00004.safetensors",
673
+ "vit.layers.7.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
674
+ "vit.layers.7.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
675
+ "vit.layers.7.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
676
+ "vit.layers.7.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
677
+ "vit.layers.7.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
678
+ "vit.layers.7.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
679
+ "vit.layers.7.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
680
+ "vit.layers.7.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
681
+ "vit.layers.7.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
682
+ "vit.layers.7.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
683
+ "vit.layers.7.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
684
+ "vit.layers.7.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
685
+ "vit.layers.7.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
686
+ "vit.layers.7.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
687
+ "vit.layers.8.input_layernorm.bias": "model-00002-of-00004.safetensors",
688
+ "vit.layers.8.input_layernorm.weight": "model-00004-of-00004.safetensors",
689
+ "vit.layers.8.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
690
+ "vit.layers.8.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
691
+ "vit.layers.8.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
692
+ "vit.layers.8.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
693
+ "vit.layers.8.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
694
+ "vit.layers.8.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
695
+ "vit.layers.8.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
696
+ "vit.layers.8.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
697
+ "vit.layers.8.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
698
+ "vit.layers.8.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
699
+ "vit.layers.8.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
700
+ "vit.layers.8.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
701
+ "vit.layers.8.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
702
+ "vit.layers.8.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
703
+ "vit.layers.9.input_layernorm.bias": "model-00002-of-00004.safetensors",
704
+ "vit.layers.9.input_layernorm.weight": "model-00004-of-00004.safetensors",
705
+ "vit.layers.9.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
706
+ "vit.layers.9.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
707
+ "vit.layers.9.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
708
+ "vit.layers.9.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
709
+ "vit.layers.9.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
710
+ "vit.layers.9.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
711
+ "vit.layers.9.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
712
+ "vit.layers.9.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
713
+ "vit.layers.9.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
714
+ "vit.layers.9.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
715
+ "vit.layers.9.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
716
+ "vit.layers.9.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
717
+ "vit.layers.9.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
718
+ "vit.layers.9.self_attn.v_proj.weight": "model-00004-of-00004.safetensors"
719
+ }
720
+ }
modeling_hunyuan_vl.py ADDED
@@ -0,0 +1,1058 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
2
+ # This file was automatically generated from src/transformers/models/hunyuan_vl/modular_hunyuan_vl.py.
3
+ # Do NOT edit this file manually as any edits will be overwritten by the generation of
4
+ # the file from the modular. If any change should be done, please apply the change to the
5
+ # modular_hunyuan_vl.py file directly. One of our CI enforces this.
6
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
7
+ # coding=utf-8
8
+ # Copyright (C) 2025 THL A29 Limited, a Tencent company and the HuggingFace Inc. team. All rights reserved.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+ from typing import Callable, Optional, Union
23
+
24
+ import torch
25
+ from torch import nn
26
+
27
+ from transformers.activations import ACT2FN
28
+ from transformers.cache_utils import Cache, DynamicCache
29
+ from transformers.generation import GenerationMixin
30
+ from transformers.integrations import use_kernel_forward_from_hub
31
+ from transformers.masking_utils import create_causal_mask
32
+ from transformers.modeling_layers import GradientCheckpointingLayer
33
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
34
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
35
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
36
+ from transformers.processing_utils import Unpack
37
+ from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple
38
+ from transformers.utils.deprecation import deprecate_kwarg
39
+ from .configuration_hunyuan_vl import HunYuanVLConfig, HunYuanVLTextConfig, HunYuanVLVisionConfig
40
+
41
+
42
+ class HunYuanVisionMLP(nn.Module):
43
+ def __init__(self, config: HunYuanVLConfig):
44
+ super().__init__()
45
+ self.config = config
46
+ self.hidden_size = config.hidden_size
47
+ self.intermediate_size = config.intermediate_size
48
+ self.act_fn = ACT2FN[config.hidden_act]
49
+ self.dense_h_to_4h = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
50
+ self.dense_4h_to_h = nn.Linear(self.intermediate_size, self.hidden_size, bias=True)
51
+
52
+ def forward(self, x):
53
+ intermediate = self.dense_h_to_4h(x)
54
+ intermediate = self.act_fn(intermediate)
55
+ output = self.dense_4h_to_h(intermediate)
56
+ return output
57
+
58
+
59
+ @use_kernel_forward_from_hub("RMSNorm")
60
+ class HunYuanVLRMSNorm(nn.Module):
61
+ def __init__(self, hidden_size, eps=1e-6):
62
+ """
63
+ HunYuanVLRMSNorm is equivalent to T5LayerNorm
64
+ """
65
+ super().__init__()
66
+ self.weight = nn.Parameter(torch.ones(hidden_size))
67
+ self.variance_epsilon = eps
68
+
69
+ def forward(self, hidden_states):
70
+ input_dtype = hidden_states.dtype
71
+ hidden_states = hidden_states.to(torch.float32)
72
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
73
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
74
+ return self.weight * hidden_states.to(input_dtype)
75
+
76
+ def extra_repr(self):
77
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
78
+
79
+
80
+ class HunYuanVLMLP(nn.Module):
81
+ def __init__(self, config: HunYuanVLConfig, layer_idx=None, is_shared_mlp=False):
82
+ super().__init__()
83
+ self.config = config
84
+ self.hidden_size = config.hidden_size
85
+ self.intermediate_size = config.intermediate_size
86
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
87
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
88
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
89
+ self.act_fn = ACT2FN[config.hidden_act]
90
+ self.layer_idx = layer_idx
91
+
92
+ def forward(self, x):
93
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
94
+ return down_proj
95
+
96
+
97
+ class HunYuanVisionPatchEmbed(nn.Module):
98
+ def __init__(self, config: HunYuanVLVisionConfig):
99
+ super().__init__()
100
+
101
+ self.config = config
102
+ self.embed_dim = config.hidden_size
103
+ self.patch_size = config.patch_size
104
+ self.num_channels = config.num_channels
105
+ self.spatial_merge_size = config.spatial_merge_size
106
+ self.interpolate_mode = config.interpolate_mode
107
+
108
+ self.patch_embedding = nn.Conv2d(
109
+ in_channels=config.num_channels,
110
+ out_channels=self.embed_dim,
111
+ kernel_size=self.patch_size,
112
+ stride=self.patch_size,
113
+ bias=True,
114
+ )
115
+
116
+ self.max_num_patches = (config.max_image_size // self.patch_size) ** 2
117
+ self.num_positions = self.max_num_patches + 1
118
+ self.position_edge = int(self.num_positions**0.5)
119
+ # first token is cls token, skip it
120
+ self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
121
+
122
+ self.patch_pos_embed = None
123
+
124
+ def forward(self, pixel_values: torch.Tensor, grid_thw: list[list[int]]) -> torch.Tensor:
125
+ num_patches, hidden_size = pixel_values.shape
126
+ pixel_values = pixel_values.reshape(num_patches, self.num_channels, self.patch_size, self.patch_size)
127
+
128
+ patch_embeds = self.patch_embedding(pixel_values)
129
+ patch_embeds = patch_embeds.squeeze(-1).squeeze(-1).unsqueeze(0)
130
+
131
+ if self.patch_pos_embed is None:
132
+ patch_pos_shape = (1, self.position_edge, self.position_edge, self.embed_dim)
133
+ self.patch_pos_embed = (
134
+ self.position_embedding.weight[1:, :].reshape(patch_pos_shape).permute(0, 3, 1, 2).float()
135
+ )
136
+
137
+ patch_pos_embed_list = []
138
+ for grid in grid_thw:
139
+ _, h0, w0 = grid
140
+ # we add a small number to avoid floating point error in the interpolation
141
+ # see discussion at https://github.com/facebookresearch/dino/issues/8
142
+ h0, w0 = h0 + 0.1, w0 + 0.1
143
+ patch_pos_embed = nn.functional.interpolate(
144
+ self.patch_pos_embed,
145
+ scale_factor=((h0 / self.position_edge).item(), (w0 / self.position_edge).item()),
146
+ mode=self.interpolate_mode,
147
+ align_corners=False,
148
+ )
149
+
150
+ patch_pos_embed = (
151
+ patch_pos_embed.reshape(self.embed_dim, -1).transpose(0, 1).unsqueeze(0).to(patch_embeds.dtype)
152
+ )
153
+ patch_pos_embed_list.append(patch_pos_embed)
154
+
155
+ patch_pos_embed = torch.cat(patch_pos_embed_list, dim=1)
156
+ embeddings = patch_embeds + patch_pos_embed
157
+
158
+ return embeddings
159
+
160
+
161
+ class HunYuanVisionPatchMerger(nn.Module):
162
+ def __init__(
163
+ self,
164
+ in_channels,
165
+ out_channels,
166
+ spatial_merge_size,
167
+ rms_norm_eps,
168
+ **kwargs,
169
+ ):
170
+ super().__init__()
171
+
172
+ embed_std = out_channels**-0.5
173
+ self.spatial_merge_size = spatial_merge_size
174
+ self.proj = nn.Sequential(
175
+ nn.Conv2d(in_channels, in_channels * 2, kernel_size=spatial_merge_size, stride=spatial_merge_size),
176
+ nn.GELU(),
177
+ nn.Conv2d(in_channels * 2, in_channels * 4, kernel_size=1),
178
+ )
179
+ self.mlp = nn.Linear(in_channels * 4, out_channels)
180
+ self.image_newline = nn.Parameter(torch.randn(in_channels * 4) * embed_std)
181
+ self.image_begin = nn.Parameter(torch.randn(out_channels) * embed_std)
182
+ self.image_end = nn.Parameter(torch.randn(out_channels) * embed_std)
183
+ self.image_sep = nn.Parameter(torch.randn(out_channels) * embed_std)
184
+
185
+ self.before_rms = HunYuanVLRMSNorm(in_channels, eps=rms_norm_eps)
186
+ self.after_rms = HunYuanVLRMSNorm(out_channels, eps=rms_norm_eps)
187
+
188
+ def forward(self, x, size=(16, 16)):
189
+ x = self.before_rms(x)
190
+ h, w = size
191
+ dtype = x.dtype
192
+ x = x.permute(0, 2, 1).reshape(x.shape[0], -1, int(h.item()), int(w.item()))
193
+ x = self.proj(x) # b,c,h,w
194
+ b, c, h, w = x.shape
195
+ x = torch.cat(
196
+ [x, self.image_newline.reshape(1, c, 1, 1).expand(b, c, h, 1).to(dtype, non_blocking=True)], dim=-1
197
+ )
198
+ x = x.reshape(b, c, -1).permute(0, 2, 1)
199
+ x = self.mlp(x)
200
+
201
+ begin = self.image_begin.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
202
+ end = self.image_end.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
203
+ x = torch.cat([begin, x, end], dim=1)
204
+
205
+ return self.after_rms(x)
206
+
207
+
208
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
209
+ """
210
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
211
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
212
+ """
213
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
214
+ if n_rep == 1:
215
+ return hidden_states
216
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
217
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
218
+
219
+
220
+ def eager_attention_forward(
221
+ module: nn.Module,
222
+ query: torch.Tensor,
223
+ key: torch.Tensor,
224
+ value: torch.Tensor,
225
+ attention_mask: Optional[torch.Tensor],
226
+ scaling: float,
227
+ dropout: float = 0.0,
228
+ **kwargs: Unpack[TransformersKwargs],
229
+ ):
230
+ key_states = repeat_kv(key, module.num_key_value_groups)
231
+ value_states = repeat_kv(value, module.num_key_value_groups)
232
+
233
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
234
+ if attention_mask is not None:
235
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
236
+ attn_weights = attn_weights + causal_mask
237
+
238
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
239
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
240
+ attn_output = torch.matmul(attn_weights, value_states)
241
+ attn_output = attn_output.transpose(1, 2).contiguous()
242
+
243
+ return attn_output, attn_weights
244
+
245
+
246
+ class HunYuanVisionAttention(nn.Module):
247
+ def __init__(self, config: HunYuanVLConfig):
248
+ super().__init__()
249
+ self.config = config
250
+ self.is_causal = False # used in flash_attention
251
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
252
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
253
+ self.scaling = self.head_dim**-0.5
254
+ self.attention_dropout = config.attention_dropout
255
+ self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=True)
256
+ self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True)
257
+ self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True)
258
+ self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=True)
259
+
260
+ def forward(
261
+ self,
262
+ hidden_states: torch.Tensor,
263
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
264
+ position_ids: Optional[torch.LongTensor] = None,
265
+ attention_mask: Optional[torch.Tensor] = None,
266
+ past_key_values: Optional[Cache] = None,
267
+ cache_position: Optional[torch.LongTensor] = None,
268
+ **kwargs: Unpack[TransformersKwargs],
269
+ ) -> tuple[torch.Tensor, torch.Tensor]:
270
+ input_shape = hidden_states.shape[:-1]
271
+ hidden_shape = (*input_shape, -1, self.head_dim)
272
+
273
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
274
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
275
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
276
+
277
+ attention_interface: Callable = eager_attention_forward
278
+ if self.config._attn_implementation != "eager":
279
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
280
+
281
+ attn_output, attn_weights = attention_interface(
282
+ self,
283
+ query_states,
284
+ key_states,
285
+ value_states,
286
+ attention_mask,
287
+ dropout=0.0 if not self.training else self.attention_dropout,
288
+ scaling=self.scaling,
289
+ **kwargs,
290
+ )
291
+
292
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
293
+ attn_output = self.o_proj(attn_output)
294
+ return attn_output, attn_weights
295
+
296
+
297
+ class HunYuanVisionBlock(GradientCheckpointingLayer):
298
+ def __init__(self, config: HunYuanVLVisionConfig):
299
+ super().__init__()
300
+ self.hidden_size = config.hidden_size
301
+ self.self_attn = HunYuanVisionAttention(config)
302
+ self.mlp = HunYuanVisionMLP(config)
303
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
304
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
305
+
306
+ def forward(
307
+ self,
308
+ hidden_states: torch.Tensor,
309
+ attention_mask: Optional[torch.Tensor] = None,
310
+ position_ids: Optional[torch.LongTensor] = None,
311
+ past_key_values: Optional[Cache] = None,
312
+ use_cache: Optional[bool] = False,
313
+ cache_position: Optional[torch.LongTensor] = None,
314
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
315
+ **kwargs: Unpack[TransformersKwargs],
316
+ ) -> torch.Tensor:
317
+ residual = hidden_states
318
+ hidden_states = self.input_layernorm(hidden_states)
319
+ # Self Attention
320
+ hidden_states, _ = self.self_attn(
321
+ hidden_states=hidden_states,
322
+ attention_mask=attention_mask,
323
+ position_ids=position_ids,
324
+ past_key_values=past_key_values,
325
+ use_cache=use_cache,
326
+ cache_position=cache_position,
327
+ position_embeddings=position_embeddings,
328
+ **kwargs,
329
+ )
330
+ hidden_states = residual + hidden_states
331
+
332
+ # Fully Connected
333
+ residual = hidden_states
334
+ hidden_states = self.post_attention_layernorm(hidden_states)
335
+ hidden_states = self.mlp(hidden_states)
336
+ hidden_states = residual + hidden_states
337
+ return hidden_states
338
+
339
+
340
+ class HunYuanVisionTransformer(nn.Module):
341
+ config: HunYuanVLVisionConfig
342
+ _no_split_modules = ["HunYuanVLVisionBlock"]
343
+
344
+ def __init__(self, config: HunYuanVLVisionConfig):
345
+ super().__init__()
346
+ self.config = config
347
+ self.embeddings = HunYuanVisionPatchEmbed(config)
348
+ self.layers = nn.ModuleList([HunYuanVisionBlock(config) for _ in range(config.num_hidden_layers)])
349
+ self.perceive = HunYuanVisionPatchMerger(
350
+ self.config.hidden_size,
351
+ self.config.text_hidden_size,
352
+ self.config.spatial_merge_size,
353
+ self.config.rms_norm_eps,
354
+ )
355
+
356
+ def get_activation_function(self, act_name: str):
357
+ act_map = {
358
+ "gelu": nn.GELU(),
359
+ "relu": nn.ReLU(),
360
+ "silu": nn.SiLU(),
361
+ }
362
+ return act_map.get(act_name.lower(), nn.GELU()) # default GELU
363
+
364
+ # @auto_docstring
365
+ def forward(
366
+ self,
367
+ x: torch.Tensor,
368
+ grid_thw: list[list[int]],
369
+ ) -> torch.Tensor:
370
+ #
371
+ r"""
372
+ grid_thw (`torch.LongTensor` of shape `(num_images, 3)`):
373
+ The temporal, height and width dimensions of feature shape for each image. Each row contains [t, h, w] values.
374
+ """
375
+ hidden_states = self.embeddings(x, grid_thw)
376
+ for layer in self.layers:
377
+ hidden_states = layer(hidden_states)
378
+
379
+ cu_seqlens: list = [0]
380
+ for t, h, w in grid_thw:
381
+ cu_seqlens.append((h * w).item())
382
+
383
+ cu_seqlens = torch.tensor(cu_seqlens, dtype=torch.int32)
384
+ cu_seqlens = torch.cumsum(cu_seqlens, dim=0, dtype=torch.int32)
385
+ split_lengths = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
386
+ split_items = torch.split(hidden_states, split_lengths, dim=1)
387
+
388
+ processed_items = []
389
+ for grid, item in zip(grid_thw, split_items):
390
+ t, h, w = grid
391
+ processed = self.perceive(item, size=(h, w))
392
+ processed_items.append(processed)
393
+
394
+ hidden_states = torch.cat(processed_items, dim=1)
395
+
396
+ return hidden_states
397
+
398
+
399
+ class HunYuanVLRotaryEmbedding(nn.Module):
400
+ inv_freq: torch.Tensor # fix linting for `register_buffer`
401
+
402
+ def __init__(self, config: HunYuanVLConfig, device=None):
403
+ super().__init__()
404
+ # BC: "rope_type" was originally "type"
405
+ if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
406
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
407
+ else:
408
+ self.rope_type = "default"
409
+ self.max_seq_len_cached = config.max_position_embeddings
410
+ self.original_max_seq_len = config.max_position_embeddings
411
+
412
+ self.config = config
413
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type if self.rope_type != "xdrope" else "dynamic"]
414
+ if self.rope_type in ["xdrope", "dynamic"] and config.rope_scaling["alpha"]:
415
+ # DynamicNTKAlphaRotary
416
+ self.dim = config.head_dim
417
+ base = config.rope_theta * config.rope_scaling.get("alpha") ** (self.dim / (self.dim - 2))
418
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
419
+ self.attention_scaling = 1.0
420
+ else:
421
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
422
+
423
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
424
+ self.original_inv_freq = self.inv_freq
425
+ self._set_cos_sin_cache(
426
+ seq_len=config.max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
427
+ )
428
+
429
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
430
+ self.max_seq_len_cached = seq_len
431
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
432
+ freqs = torch.outer(t, self.inv_freq)
433
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
434
+ emb = torch.cat((freqs, freqs), dim=-1).float()
435
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
436
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
437
+
438
+ def forward(self, x, seq_len: Optional[int] = None):
439
+ # x: [bs, num_attention_heads, seq_len, head_size]
440
+ if seq_len > self.max_seq_len_cached:
441
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
442
+
443
+ return (
444
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
445
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
446
+ )
447
+
448
+
449
+ def rotate_half(x):
450
+ """Rotates half the hidden dims of the input."""
451
+ x1 = x[..., : x.shape[-1] // 2]
452
+ x2 = x[..., x.shape[-1] // 2 :]
453
+ return torch.cat((-x2, x1), dim=-1)
454
+
455
+
456
+ def apply_rotary_pos_emb_xdrope(q, k, cos, sin, position_ids, xdrope_section, output_size=None):
457
+ """Applies XD Rotary Position Embedding to the query and key tensors.
458
+
459
+ Args:
460
+ q (`torch.Tensor`): The query tensor.
461
+ k (`torch.Tensor`): The key tensor.
462
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
463
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
464
+ position_ids (`torch.Tensor`): The position IDs for the tokens.
465
+ xdrope_section (`list`): The section ratios for XD RoPE.
466
+ output_size (`tuple`, optional): The output size of the tensors. Defaults to None.
467
+ bf16 (bool, optional): Whether to use bfloat16 precision. Defaults to False.
468
+
469
+ Returns:
470
+ `tuple(torch.Tensor)`: The query and key tensors rotated using the XD Rotary Position Embedding.
471
+ """
472
+ x_dim = len(xdrope_section)
473
+ cos = cos[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
474
+ sin = sin[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
475
+
476
+ xdrope_section = xdrope_section * 2
477
+
478
+ # for xd concat
479
+ assert sum(xdrope_section) == cos.shape[-1], "Illegal partition for xd rope"
480
+ cos = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(cos.split(xdrope_section, dim=-1))], dim=-1)
481
+ sin = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(sin.split(xdrope_section, dim=-1))], dim=-1)
482
+
483
+ # for head repeat
484
+ cos = cos.view(output_size[0], 1, output_size[2], -1) # .repeat(1, output_size[1], 1, 1)
485
+ sin = sin.view(output_size[0], 1, output_size[2], -1) # .repeat(1, output_size[1], 1, 1)
486
+
487
+ origin_dtype = q.dtype
488
+ q, k = q.float(), k.float()
489
+ cos, sin = cos.float(), sin.float()
490
+ q_out, k_out = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
491
+
492
+ return q_out.to(origin_dtype), k_out.to(origin_dtype)
493
+
494
+
495
+ def apply_rotary_pos_emb(
496
+ q: torch.Tensor,
497
+ k: torch.Tensor,
498
+ cos: torch.Tensor,
499
+ sin: torch.Tensor,
500
+ position_ids: Optional[torch.Tensor] = None,
501
+ unsqueeze_dim: int = 1,
502
+ ):
503
+ """Applies Rotary Position Embedding to the query and key tensors.
504
+
505
+ Args:
506
+ q (`torch.Tensor`): The query tensor.
507
+ k (`torch.Tensor`): The key tensor.
508
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
509
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
510
+ position_ids (`torch.Tensor`, *optional*):
511
+ Deprecated and unused.
512
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
513
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
514
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
515
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
516
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
517
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
518
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
519
+ Returns:
520
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
521
+ """
522
+ if position_ids is not None:
523
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
524
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
525
+ else:
526
+ cos = cos.unsqueeze(0).unsqueeze(unsqueeze_dim)
527
+ sin = sin.unsqueeze(0).unsqueeze(unsqueeze_dim)
528
+ q_embed = (q * cos) + (rotate_half(q) * sin)
529
+ k_embed = (k * cos) + (rotate_half(k) * sin)
530
+ return q_embed, k_embed
531
+
532
+
533
+ class HunYuanVLAttention(nn.Module):
534
+ def __init__(self, config, layer_idx: int):
535
+ super().__init__()
536
+ self.config = config
537
+ self.layer_idx = layer_idx
538
+ self.is_causal = True # used in flash_attention
539
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
540
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
541
+ self.scaling = self.head_dim**-0.5
542
+ self.attention_dropout = config.attention_dropout
543
+ self.q_proj = nn.Linear(
544
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
545
+ )
546
+ self.k_proj = nn.Linear(
547
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
548
+ )
549
+ self.v_proj = nn.Linear(
550
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
551
+ )
552
+ self.o_proj = nn.Linear(
553
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
554
+ )
555
+
556
+ self.query_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
557
+ self.key_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
558
+
559
+ self.rotary_emb = HunYuanVLRotaryEmbedding(config=config)
560
+ self.xdrope_section = config.rope_scaling["xdrope_section"]
561
+
562
+ @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
563
+ def forward(
564
+ self,
565
+ hidden_states: torch.Tensor,
566
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
567
+ position_ids: Optional[torch.LongTensor] = None,
568
+ attention_mask: Optional[torch.Tensor] = None,
569
+ past_key_values: Optional[Cache] = None,
570
+ cache_position: Optional[torch.LongTensor] = None,
571
+ **kwargs: Unpack[TransformersKwargs],
572
+ ) -> tuple[torch.Tensor, torch.Tensor]:
573
+ input_shape = hidden_states.shape[:-1]
574
+ hidden_shape = (*input_shape, -1, self.head_dim)
575
+
576
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
577
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
578
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
579
+
580
+ kv_seq_len = key_states.shape[-2]
581
+ origin_kv_seq_len = key_states.shape[-2]
582
+ if past_key_values is not None:
583
+ kv_seq_len += past_key_values.get_seq_length(self.layer_idx)
584
+
585
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
586
+ if self.xdrope_section is not None:
587
+ if past_key_values is None or past_key_values.get_seq_length() == 0:
588
+ output_size = (
589
+ query_states.size(0),
590
+ query_states.size(1),
591
+ query_states.size(2),
592
+ key_states.size(2),
593
+ )
594
+ query_states, key_states = apply_rotary_pos_emb_xdrope(
595
+ query_states, key_states, cos, sin, position_ids, self.xdrope_section, output_size
596
+ )
597
+ else:
598
+ position_ids = (
599
+ torch.ones(position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device)
600
+ * past_key_values.get_seq_length()
601
+ )
602
+ cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
603
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
604
+ else:
605
+ position_ids = torch.ones(
606
+ position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device
607
+ ) * past_key_values.get_seq_length(self.layer_idx)
608
+ cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
609
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
610
+
611
+ query_states = self.query_layernorm(query_states)
612
+ key_states = self.key_layernorm(key_states)
613
+
614
+ if past_key_values is not None:
615
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
616
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
617
+ key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
618
+
619
+ attention_interface: Callable = eager_attention_forward
620
+ if self.config._attn_implementation != "eager":
621
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
622
+
623
+ attn_output, attn_weights = attention_interface(
624
+ self,
625
+ query_states,
626
+ key_states,
627
+ value_states,
628
+ attention_mask,
629
+ dropout=0.0 if not self.training else self.attention_dropout,
630
+ scaling=self.scaling,
631
+ **kwargs,
632
+ )
633
+
634
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
635
+ attn_output = self.o_proj(attn_output)
636
+ return attn_output, attn_weights
637
+
638
+
639
+ class HunYuanVLDecoderLayer(GradientCheckpointingLayer):
640
+ def __init__(self, config: Union[HunYuanVLVisionConfig, HunYuanVLTextConfig], layer_idx: int):
641
+ super().__init__()
642
+ self.hidden_size = config.hidden_size
643
+
644
+ self.self_attn = HunYuanVLAttention(config=config, layer_idx=layer_idx)
645
+
646
+ self.mlp = HunYuanVLMLP(config)
647
+ self.input_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
648
+ self.post_attention_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
649
+ self.layer_idx = layer_idx
650
+ if config.norm_type == "hf_rms" or config.norm_type == "rms":
651
+ self.input_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
652
+ self.post_attention_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
653
+ elif config.norm_type == "fused" or config.norm_type == "torch_nn":
654
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
655
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
656
+ else:
657
+ assert False, "other norm_type are not supported"
658
+
659
+ @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
660
+ def forward(
661
+ self,
662
+ hidden_states: torch.Tensor,
663
+ attention_mask: Optional[torch.Tensor] = None,
664
+ position_ids: Optional[torch.LongTensor] = None,
665
+ past_key_values: Optional[Cache] = None,
666
+ use_cache: Optional[bool] = False,
667
+ cache_position: Optional[torch.LongTensor] = None,
668
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
669
+ **kwargs: Unpack[TransformersKwargs],
670
+ ) -> torch.Tensor:
671
+ residual = hidden_states
672
+ hidden_states = self.input_layernorm(hidden_states)
673
+ # Self Attention
674
+ hidden_states, _ = self.self_attn(
675
+ hidden_states=hidden_states,
676
+ attention_mask=attention_mask,
677
+ position_ids=position_ids,
678
+ past_key_values=past_key_values,
679
+ use_cache=use_cache,
680
+ cache_position=cache_position,
681
+ position_embeddings=position_embeddings,
682
+ **kwargs,
683
+ )
684
+ hidden_states = residual + hidden_states
685
+
686
+ # Fully Connected
687
+ residual = hidden_states
688
+ hidden_states = self.post_attention_layernorm(hidden_states)
689
+ hidden_states = self.mlp(hidden_states)
690
+ hidden_states = residual + hidden_states
691
+ return hidden_states
692
+
693
+
694
+ @auto_docstring
695
+ class HunYuanVLPreTrainedModel(PreTrainedModel):
696
+ config: HunYuanVLConfig
697
+ base_model_prefix = "model"
698
+ supports_gradient_checkpointing = True
699
+ _no_split_modules = ["HunYuanVLDecoderLayer"]
700
+ _skip_keys_device_placement = ["past_key_values"]
701
+ _supports_flash_attn = True
702
+ _supports_sdpa = True
703
+ _supports_flex_attn = True
704
+
705
+ _can_compile_fullgraph = True
706
+ _supports_attention_backend = True
707
+ _can_record_outputs = {
708
+ "hidden_states": HunYuanVLDecoderLayer,
709
+ "attentions": HunYuanVLAttention,
710
+ }
711
+
712
+ def _init_weights(self, module):
713
+ std = self.config.initializer_range
714
+ if isinstance(module, nn.Linear):
715
+ module.weight.data.normal_(mean=0.0, std=std)
716
+ if module.bias is not None:
717
+ module.bias.data.zero_()
718
+ elif isinstance(module, nn.Embedding):
719
+ module.weight.data.normal_(mean=0.0, std=std)
720
+ if module.padding_idx is not None:
721
+ module.weight.data[module.padding_idx].zero_()
722
+
723
+
724
+ @auto_docstring
725
+ class HunYuanVLModel(HunYuanVLPreTrainedModel):
726
+ def __init__(self, config: Union[HunYuanVLConfig, HunYuanVLTextConfig]):
727
+ super().__init__(config)
728
+ self.padding_idx = config.pad_token_id
729
+ self.vocab_size = config.vocab_size
730
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
731
+ self.layers = nn.ModuleList(
732
+ [HunYuanVLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
733
+ )
734
+ self.norm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
735
+ self.gradient_checkpointing = False
736
+ self.post_init()
737
+
738
+ # @auto_docstring # TODO Fix this
739
+ def forward(
740
+ self,
741
+ input_ids: Optional[torch.LongTensor] = None,
742
+ attention_mask: Optional[torch.Tensor] = None,
743
+ position_ids: Optional[torch.LongTensor] = None,
744
+ past_key_values: Optional[Cache] = None,
745
+ inputs_embeds: Optional[torch.FloatTensor] = None,
746
+ cache_position: Optional[torch.LongTensor] = None,
747
+ use_cache: Optional[bool] = None,
748
+ **kwargs: Unpack[TransformersKwargs],
749
+ ) -> BaseModelOutputWithPast:
750
+ if (input_ids is None) ^ (inputs_embeds is not None):
751
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
752
+
753
+ if inputs_embeds is None:
754
+ inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
755
+
756
+ if use_cache and past_key_values is None:
757
+ past_key_values = DynamicCache(config=self.config)
758
+
759
+ if cache_position is None:
760
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
761
+ cache_position: torch.Tensor = torch.arange(
762
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
763
+ )
764
+
765
+ if position_ids is None:
766
+ position_ids = cache_position.unsqueeze(0)
767
+
768
+ causal_mask = create_causal_mask(
769
+ config=self.config,
770
+ input_embeds=inputs_embeds,
771
+ attention_mask=attention_mask,
772
+ cache_position=cache_position,
773
+ past_key_values=past_key_values,
774
+ position_ids=position_ids,
775
+ )
776
+ hidden_states = inputs_embeds
777
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
778
+ hidden_states = decoder_layer(
779
+ hidden_states,
780
+ attention_mask=causal_mask,
781
+ position_ids=position_ids,
782
+ past_key_values=past_key_values,
783
+ cache_position=cache_position,
784
+ **kwargs,
785
+ )
786
+
787
+ hidden_states = self.norm(hidden_states)
788
+ return BaseModelOutputWithPast(
789
+ last_hidden_state=hidden_states,
790
+ past_key_values=past_key_values,
791
+ )
792
+
793
+
794
+ @auto_docstring
795
+ class HunYuanVLForCausalLM(HunYuanVLPreTrainedModel, GenerationMixin):
796
+ _tied_weights_keys = ["lm_head.weight"]
797
+ _tp_plan = {"lm_head": "colwise_rep"}
798
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
799
+
800
+ def __init__(self, config):
801
+ super().__init__(config)
802
+ self.model = HunYuanVLModel(config)
803
+ self.vocab_size = config.vocab_size
804
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
805
+
806
+ # Initialize weights and apply final processing
807
+ self.post_init()
808
+
809
+ @can_return_tuple
810
+ @auto_docstring
811
+ def forward(
812
+ self,
813
+ input_ids: Optional[torch.LongTensor] = None,
814
+ attention_mask: Optional[torch.Tensor] = None,
815
+ position_ids: Optional[torch.LongTensor] = None,
816
+ past_key_values: Optional[Cache] = None,
817
+ inputs_embeds: Optional[torch.FloatTensor] = None,
818
+ labels: Optional[torch.LongTensor] = None,
819
+ use_cache: Optional[bool] = None,
820
+ cache_position: Optional[torch.LongTensor] = None,
821
+ logits_to_keep: Union[int, torch.Tensor] = 0,
822
+ **kwargs: Unpack[TransformersKwargs],
823
+ ) -> CausalLMOutputWithPast:
824
+ r"""
825
+ Example:
826
+
827
+ ```python
828
+ >>> from transformers import AutoTokenizer, HunYuanVLForCausalLM
829
+
830
+ >>> model = HunYuanVLForCausalLM.from_pretrained("meta-hunyuan_vl/HunYuanVL-2-7b-hf")
831
+ >>> tokenizer = AutoTokenizer.from_pretrained("meta-hunyuan_vl/HunYuanVL-2-7b-hf")
832
+
833
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
834
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
835
+
836
+ >>> # Generate
837
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
838
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
839
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
840
+ ```"""
841
+ outputs: BaseModelOutputWithPast = self.model(
842
+ input_ids=input_ids,
843
+ attention_mask=attention_mask,
844
+ position_ids=position_ids,
845
+ past_key_values=past_key_values,
846
+ inputs_embeds=inputs_embeds,
847
+ use_cache=use_cache,
848
+ cache_position=cache_position,
849
+ **kwargs,
850
+ )
851
+
852
+ hidden_states = outputs.last_hidden_state
853
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
854
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
855
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
856
+
857
+ loss = None
858
+ if labels is not None:
859
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
860
+
861
+ return CausalLMOutputWithPast(
862
+ loss=loss,
863
+ logits=logits,
864
+ past_key_values=outputs.past_key_values,
865
+ hidden_states=outputs.hidden_states,
866
+ attentions=outputs.attentions,
867
+ )
868
+
869
+
870
+ class HunYuanVLForConditionalGeneration(HunYuanVLPreTrainedModel, GenerationMixin):
871
+ _tied_weights_keys = ["lm_head.weight"]
872
+ config: HunYuanVLConfig
873
+
874
+ def __init__(self, config: HunYuanVLConfig):
875
+ super().__init__(config)
876
+ self.model = HunYuanVLModel(config)
877
+ self.vocab_size = config.vocab_size
878
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
879
+ self.vit = HunYuanVisionTransformer(config.vision_config)
880
+ self.config = config
881
+ self.post_init()
882
+
883
+ def set_decoder(self, decoder):
884
+ self.model = decoder
885
+
886
+ def get_decoder(self):
887
+ return self.model
888
+
889
+ @can_return_tuple
890
+ @auto_docstring
891
+ def forward(
892
+ self,
893
+ input_ids: Optional[torch.LongTensor] = None,
894
+ attention_mask: Optional[torch.Tensor] = None,
895
+ position_ids: Optional[torch.LongTensor] = None,
896
+ past_key_values: Optional[Cache] = None,
897
+ inputs_embeds: Optional[torch.FloatTensor] = None,
898
+ labels: Optional[torch.LongTensor] = None,
899
+ use_cache: Optional[bool] = None,
900
+ cache_position: Optional[torch.LongTensor] = None,
901
+ logits_to_keep: Union[int, torch.Tensor] = 0,
902
+ **kwargs: Unpack[TransformersKwargs],
903
+ ) -> CausalLMOutputWithPast:
904
+ r"""
905
+ Example:
906
+
907
+ ```python
908
+ >>> from transformers import AutoProcessor, HunYuanVLForConditionalGeneration
909
+ >>> from PIL import Image
910
+ >>> import torch
911
+
912
+ >>> model_name_or_path = "tencent/HunyuanOCR"
913
+ >>> processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
914
+ >>> model = HunYuanVLForConditionalGeneration.from_pretrained(
915
+ ... model_name_or_path,
916
+ ... attn_implementation="eager",
917
+ ... torch_dtype=torch.bfloat16,
918
+ ... device_map="auto",
919
+ ... )
920
+
921
+ >>> img_path = "path/to/your/image.jpg"
922
+ >>> image = Image.open(img_path).convert("RGB")
923
+
924
+ >>> messages = [
925
+ ... {
926
+ ... "role": "user",
927
+ ... "content": [
928
+ ... {"type": "image", "image": img_path},
929
+ ... {"type": "text", "text": "Extract the text from the image."},
930
+ ... ],
931
+ ... }
932
+ ... ]
933
+ >>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
934
+ >>> inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
935
+
936
+ >>> with torch.no_grad():
937
+ ... generated_ids = model.generate(**inputs, max_new_tokens=1024)
938
+ >>> generated_ids_trimmed = generated_ids[0][len(inputs["input_ids"][0]):]
939
+ >>> output = processor.decode(generated_ids_trimmed, skip_special_tokens=True)
940
+
941
+ >>> print(output)
942
+
943
+ ```"""
944
+ outputs: BaseModelOutputWithPast = self.model(
945
+ input_ids=input_ids,
946
+ attention_mask=attention_mask,
947
+ position_ids=position_ids,
948
+ past_key_values=past_key_values,
949
+ inputs_embeds=inputs_embeds,
950
+ use_cache=use_cache,
951
+ cache_position=cache_position,
952
+ **kwargs,
953
+ )
954
+
955
+ hidden_states = outputs.last_hidden_state
956
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
957
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
958
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
959
+
960
+ loss = None
961
+ if labels is not None:
962
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
963
+
964
+ return CausalLMOutputWithPast(
965
+ loss=loss,
966
+ logits=logits,
967
+ past_key_values=outputs.past_key_values,
968
+ hidden_states=outputs.hidden_states,
969
+ attentions=outputs.attentions,
970
+ )
971
+
972
+ # def prepare_inputs_for_generation(
973
+ # self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
974
+ # ):
975
+ # inputs = super().prepare_inputs_for_generation(
976
+ # input_ids,
977
+ # past_key_values=past_key_values,
978
+ # attention_mask=attention_mask,
979
+ # inputs_embeds=inputs_embeds,
980
+ # **kwargs,
981
+ # )
982
+ # return inputs
983
+
984
+ @torch.no_grad()
985
+ def generate(
986
+ self,
987
+ input_ids: Optional[torch.Tensor] = None,
988
+ attention_mask: Optional[torch.Tensor] = None,
989
+ position_ids: Optional[torch.LongTensor] = None,
990
+ imgs: Optional[list[torch.FloatTensor]] = None,
991
+ imgs_pos: Optional[list[int]] = None,
992
+ token_type_ids: Optional[torch.LongTensor] = None,
993
+ pixel_values: Optional[torch.FloatTensor] = None,
994
+ image_grid_thw: Optional[list[int]] = None,
995
+ **kwargs,
996
+ ) -> CausalLMOutputWithPast:
997
+ if "inputs_embeds" in kwargs:
998
+ raise NotImplementedError("`inputs_embeds` is not supported")
999
+
1000
+ inputs_embeds = self.model.embed_tokens(input_ids)
1001
+
1002
+ if self.vit is not None and pixel_values is not None:
1003
+ pixel_values = pixel_values.to(torch.bfloat16)
1004
+ image_embeds = self.vit(pixel_values, image_grid_thw)
1005
+
1006
+ # ViT may be deployed on different GPUs from those used by LLMs, due to auto-mapping of accelerate.
1007
+ image_embeds = image_embeds.to(input_ids.device, non_blocking=True)
1008
+
1009
+ image_mask, _ = self.get_placeholder_mask(
1010
+ input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
1011
+ )
1012
+ inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
1013
+
1014
+ return super().generate(
1015
+ inputs=input_ids,
1016
+ position_ids=position_ids,
1017
+ attention_mask=attention_mask,
1018
+ inputs_embeds=inputs_embeds,
1019
+ # eos_token_id=self.config.eod_token_id,
1020
+ **kwargs,
1021
+ )
1022
+
1023
+ # Copied from transformers.models.llava.modeling_llava.LlavaModel.get_placeholder_mask
1024
+ def get_placeholder_mask(
1025
+ self,
1026
+ input_ids: torch.LongTensor,
1027
+ inputs_embeds: torch.FloatTensor,
1028
+ image_features: Optional[torch.FloatTensor] = None,
1029
+ ):
1030
+ """
1031
+ Obtains multimodal placeholder mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
1032
+ equal to the length of multimodal features. If the lengths are different, an error is raised.
1033
+ """
1034
+ if input_ids is None:
1035
+ special_image_mask = inputs_embeds == self.get_input_embeddings()(
1036
+ torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
1037
+ )
1038
+ special_image_mask = special_image_mask.all(-1)
1039
+ else:
1040
+ special_image_mask = input_ids == self.config.image_token_id
1041
+
1042
+ n_image_tokens = special_image_mask.sum()
1043
+ special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
1044
+ if image_features is not None and inputs_embeds[special_image_mask].numel() != image_features.numel():
1045
+ raise ValueError(
1046
+ f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {image_features.shape[0]}"
1047
+ )
1048
+
1049
+ return special_image_mask, None
1050
+
1051
+
1052
+ __all__ = [
1053
+ "HunYuanVLForConditionalGeneration",
1054
+ "HunYuanVLForCausalLM",
1055
+ "HunYuanVLModel",
1056
+ "HunYuanVLPreTrainedModel",
1057
+ "HunYuanVLTextModel",
1058
+ ]
modular_hunyuan_vl.py ADDED
@@ -0,0 +1,1042 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (C) 2025 THL A29 Limited, a Tencent company and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """PyTorch HunYuanVL model."""
16
+
17
+ from typing import Callable, Optional, Tuple, Union, List, Dict
18
+
19
+ import torch
20
+ import torch.utils.checkpoint
21
+ from torch import nn
22
+
23
+
24
+ from transformers.activations import ACT2FN
25
+ from transformers.cache_utils import Cache, DynamicCache
26
+ from transformers.generation import GenerationMixin
27
+ from transformers.masking_utils import create_causal_mask
28
+ from transformers.modeling_layers import GradientCheckpointingLayer
29
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
30
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
31
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
32
+ from transformers.processing_utils import Unpack
33
+ from transformers.utils import (
34
+ TransformersKwargs,
35
+ auto_docstring,
36
+ can_return_tuple,
37
+ logging,
38
+ )
39
+ from transformers.utils.deprecation import deprecate_kwarg
40
+ from transformers.utils.generic import check_model_inputs
41
+
42
+ from transformers.models.hunyuan_v1_dense.configuration_hunyuan_v1_dense import HunYuanDenseV1Config
43
+ from transformers.models.hunyuan_v1_dense.modeling_hunyuan_v1_dense import (
44
+ HunYuanDenseV1Attention,
45
+ HunYuanDenseV1DecoderLayer,
46
+ HunYuanDenseV1MLP,
47
+ HunYuanDenseV1Model,
48
+ HunYuanDenseV1PreTrainedModel,
49
+ HunYuanDenseV1RMSNorm,
50
+ HunYuanDenseV1RotaryEmbedding,
51
+ HunYuanDenseV1ForCausalLM
52
+ )
53
+
54
+ from transformers.models.llama.modeling_llama import (
55
+ LlamaAttention,
56
+ LlamaDecoderLayer,
57
+ LlamaForCausalLM,
58
+ LlamaForSequenceClassification,
59
+ LlamaMLP,
60
+ LlamaModel,
61
+ LlamaPreTrainedModel,
62
+ LlamaRMSNorm,
63
+ rotate_half,
64
+ repeat_kv,
65
+ eager_attention_forward
66
+ )
67
+
68
+
69
+ import json
70
+ import types
71
+ import math
72
+ import torch
73
+ from torch import Tensor, nn
74
+ import torch.nn.functional as F
75
+ from typing import List, Tuple, Optional, Union
76
+ from contextlib import contextmanager
77
+ from transformers.modeling_attn_mask_utils import (
78
+ _prepare_4d_causal_attention_mask_for_sdpa,
79
+ _prepare_4d_causal_attention_mask_for_sdpa,
80
+ _prepare_4d_causal_attention_mask,
81
+ )
82
+ from transformers.modeling_outputs import BaseModelOutputWithPooling
83
+
84
+ logger = logging.get_logger(__name__)
85
+
86
+
87
+ class HunYuanVLVisionConfig(PretrainedConfig):
88
+ model_type = "hunyuan_vl"
89
+ base_config_key = "vision_config"
90
+
91
+ def __init__(
92
+ self,
93
+ hidden_act='gelu',
94
+ hidden_size=1152,
95
+ intermediate_size=4304,
96
+ interpolate_mode='bilinear',
97
+ rms_norm_eps=1e-05,
98
+ learnable_mlp_pooling_size=0,
99
+ num_attention_heads=16,
100
+ num_key_value_heads=None,
101
+ num_channels=3,
102
+ num_hidden_layers=27,
103
+ out_hidden_size=4096,
104
+ patch_size=16,
105
+ remove_prenorm=True,
106
+ spatial_merge_size=2,
107
+ temporal_patch_size=1,
108
+ resize_resolution=2048,
109
+ img_max_token_num=4096,
110
+ max_image_size=2048,
111
+ video_max_image_size=768,
112
+ video_min_image_size=256,
113
+ min_image_size=512,
114
+ anyres_vit_max_image_size=2048,
115
+ max_vit_seq_len=16384,
116
+ text_hidden_size=3072,
117
+ **kwargs,
118
+ ):
119
+ super().__init__(**kwargs)
120
+
121
+ self.hidden_act = hidden_act
122
+ self.hidden_size = hidden_size
123
+ self.intermediate_size = intermediate_size
124
+ self.interpolate_mode = interpolate_mode
125
+ self.learnable_mlp_pooling_size = learnable_mlp_pooling_size
126
+ self.num_attention_heads = num_attention_heads
127
+ if not num_key_value_heads:
128
+ self.num_key_value_heads = num_attention_heads
129
+ else:
130
+ self.num_key_value_heads = num_key_value_heads
131
+ self.num_channels = num_channels
132
+ self.num_hidden_layers = num_hidden_layers
133
+ self.out_hidden_size = out_hidden_size
134
+ self.patch_size = patch_size
135
+ self.remove_prenorm = remove_prenorm
136
+ self.spatial_merge_size = spatial_merge_size
137
+ self.temporal_patch_size = temporal_patch_size
138
+ self.rms_norm_eps = rms_norm_eps
139
+
140
+ self.resize_resolution = resize_resolution
141
+ self.img_max_token_num = img_max_token_num
142
+ self.max_image_size = max_image_size
143
+ self.min_image_size = min_image_size
144
+ self.video_max_image_size = video_max_image_size
145
+ self.video_min_image_size = video_min_image_size
146
+ self.anyres_vit_max_image_size = anyres_vit_max_image_size
147
+ self.max_vit_seq_len = max_vit_seq_len
148
+ self.text_hidden_size = text_hidden_size
149
+
150
+
151
+ class HunYuanVLTextConfig(HunYuanDenseV1Config):
152
+ model_type = "hunyuan_vl_text"
153
+ keys_to_ignore_at_inference = ["past_key_values"]
154
+
155
+
156
+ class HunYuanVLConfig(PretrainedConfig):
157
+ model_type = "hunyuan_vl"
158
+ sub_configs = {"vision_config": HunYuanVLVisionConfig, "text_config": HunYuanVLTextConfig}
159
+ keys_to_ignore_at_inference = ["past_key_values"]
160
+
161
+ def __init__(
162
+ self,
163
+ text_config=None,
164
+ vision_config=None,
165
+ im_start_id=120118,
166
+ im_end_id=120119,
167
+ image_token_id=120120,
168
+ im_newline_id=120121,
169
+ video_start_id=120122,
170
+ video_end_id=120123,
171
+ **kwargs,
172
+ ):
173
+ # We need to init super() here so that it does not reset values
174
+ # that are in text config to the BaseClass defaults. The Base
175
+ # config has many text related defaults and not all defaults are same as for `HunYuanVLTextConfig`
176
+ super().__init__(**kwargs)
177
+
178
+ if isinstance(vision_config, dict):
179
+ self.vision_config = self.sub_configs["vision_config"](**vision_config)
180
+ elif vision_config is None:
181
+ self.vision_config = self.sub_configs["vision_config"]()
182
+
183
+ if isinstance(text_config, dict):
184
+ self.text_config = self.sub_configs["text_config"](**text_config)
185
+ elif text_config is None:
186
+ # For BC use all kwargs to init `TextConfig`
187
+ self.text_config = self.sub_configs["text_config"](**kwargs)
188
+
189
+ self.image_token_id = image_token_id
190
+ self.im_start_id = im_start_id
191
+ self.im_end_id = im_end_id
192
+ self.im_newline_id = im_newline_id
193
+ self.video_start_id = video_start_id
194
+ self.video_end_id = video_end_id
195
+
196
+ self.vision_config.text_hidden_size = self.text_config.hidden_size
197
+
198
+ # Attention implementation to use. It sets it recursively on sub-configs so we call it again in the end
199
+ self._attn_implementation = kwargs.pop("attn_implementation", None)
200
+
201
+ def __setattr__(self, key, value):
202
+ if (
203
+ (text_config := super().__getattribute__("__dict__").get("text_config")) is not None
204
+ and key not in ["dtype", "_attn_implementation_internal"]
205
+ and key in text_config.__dict__
206
+ ):
207
+ setattr(text_config, key, value)
208
+ else:
209
+ super().__setattr__(key, value)
210
+
211
+ def __getattribute__(self, key):
212
+ if "text_config" in super().__getattribute__("__dict__") and key not in [
213
+ "_name_or_path",
214
+ "model_type",
215
+ "dtype",
216
+ "_attn_implementation_internal",
217
+ ]:
218
+ text_config = super().__getattribute__("text_config")
219
+ if key in text_config.__dict__:
220
+ return getattr(text_config, key)
221
+
222
+ return super().__getattribute__(key)
223
+
224
+
225
+ class HunYuanVisionMLP(nn.Module):
226
+ def __init__(self, config: HunYuanVLConfig):
227
+ super().__init__()
228
+ self.config = config
229
+ self.hidden_size = config.hidden_size
230
+ self.intermediate_size = config.intermediate_size
231
+ self.act_fn = ACT2FN[config.hidden_act]
232
+ self.dense_h_to_4h = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
233
+ self.dense_4h_to_h = nn.Linear(self.intermediate_size, self.hidden_size, bias=True)
234
+
235
+ def forward(self, x):
236
+ intermediate = self.dense_h_to_4h(x)
237
+ intermediate = self.act_fn(intermediate)
238
+ output = self.dense_4h_to_h(intermediate)
239
+ return output
240
+
241
+
242
+ class HunYuanVLRMSNorm(LlamaRMSNorm):
243
+ pass
244
+
245
+ class HunYuanVLMLP(HunYuanDenseV1MLP):
246
+ pass
247
+
248
+ class HunYuanVisionPatchEmbed(nn.Module):
249
+ def __init__(self, config: HunYuanVLVisionConfig):
250
+ super().__init__()
251
+
252
+ self.config = config
253
+ self.embed_dim = config.hidden_size
254
+ self.patch_size = config.patch_size
255
+ self.num_channels = config.num_channels
256
+ self.spatial_merge_size = config.spatial_merge_size
257
+ self.interpolate_mode = config.interpolate_mode
258
+
259
+ self.patch_embedding = nn.Conv2d(
260
+ in_channels=config.num_channels,
261
+ out_channels=self.embed_dim,
262
+ kernel_size=self.patch_size,
263
+ stride=self.patch_size,
264
+ bias=True,
265
+ )
266
+
267
+ self.max_num_patches = (config.max_image_size // self.patch_size) ** 2
268
+ self.num_positions = self.max_num_patches + 1
269
+ self.position_edge = int(self.num_positions ** 0.5)
270
+ # first token is cls token, skip it
271
+ self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
272
+
273
+ self.patch_pos_embed = None
274
+
275
+ def forward(self, pixel_values: torch.Tensor, grid_thw: list[list[int]]) -> torch.Tensor:
276
+ num_patches, hidden_size = pixel_values.shape
277
+ pixel_values = pixel_values.reshape(num_patches, self.num_channels, self.patch_size, self.patch_size)
278
+
279
+ patch_embeds = self.patch_embedding(pixel_values)
280
+ patch_embeds = patch_embeds.squeeze(-1).squeeze(-1).unsqueeze(0)
281
+
282
+ if self.patch_pos_embed is None:
283
+ patch_pos_shape = (1, self.position_edge, self.position_edge, self.embed_dim)
284
+ self.patch_pos_embed = (
285
+ self.position_embedding.weight[1:, :].reshape(patch_pos_shape).permute(0, 3, 1, 2).float()
286
+ )
287
+
288
+ patch_pos_embed_list = []
289
+ for grid in grid_thw:
290
+ _, h0, w0 = grid
291
+ # we add a small number to avoid floating point error in the interpolation
292
+ # see discussion at https://github.com/facebookresearch/dino/issues/8
293
+ h0, w0 = h0 + 0.1, w0 + 0.1
294
+ patch_pos_embed = nn.functional.interpolate(
295
+ self.patch_pos_embed,
296
+ scale_factor=((h0 / self.position_edge).item(), (w0 / self.position_edge).item()),
297
+ mode=self.interpolate_mode,
298
+ align_corners=False,
299
+ )
300
+
301
+ patch_pos_embed = (
302
+ patch_pos_embed.reshape(self.embed_dim, -1).transpose(0, 1).unsqueeze(0).to(patch_embeds.dtype)
303
+ )
304
+ patch_pos_embed_list.append(patch_pos_embed)
305
+
306
+ patch_pos_embed = torch.cat(patch_pos_embed_list, dim=1)
307
+ embeddings = patch_embeds + patch_pos_embed
308
+
309
+ return embeddings
310
+
311
+
312
+ class HunYuanVisionPatchMerger(nn.Module):
313
+ def __init__(
314
+ self,
315
+ in_channels,
316
+ out_channels,
317
+ spatial_merge_size,
318
+ rms_norm_eps,
319
+ **kwargs,
320
+ ):
321
+ super().__init__()
322
+
323
+ embed_std = out_channels ** -0.5
324
+ self.spatial_merge_size = spatial_merge_size
325
+ self.proj = nn.Sequential(
326
+ nn.Conv2d(in_channels, in_channels * 2, kernel_size=spatial_merge_size, stride=spatial_merge_size),
327
+ nn.GELU(),
328
+ nn.Conv2d(in_channels * 2, in_channels * 4, kernel_size=1),
329
+ )
330
+ self.mlp = nn.Linear(in_channels * 4, out_channels)
331
+ self.image_newline = nn.Parameter(torch.randn(in_channels * 4) * embed_std)
332
+ self.image_begin = nn.Parameter(torch.randn(out_channels) * embed_std)
333
+ self.image_end = nn.Parameter(torch.randn(out_channels) * embed_std)
334
+ self.image_sep = nn.Parameter(torch.randn(out_channels) * embed_std)
335
+
336
+ self.before_rms = HunYuanVLRMSNorm(in_channels, eps=rms_norm_eps)
337
+ self.after_rms = HunYuanVLRMSNorm(out_channels, eps=rms_norm_eps)
338
+
339
+ def forward(self, x, size=(16, 16)):
340
+ x = self.before_rms(x)
341
+ h, w = size
342
+ dtype = x.dtype
343
+ x = x.permute(0, 2, 1).reshape(x.shape[0], -1, int(h.item()), int(w.item()))
344
+ x = self.proj(x) # b,c,h,w
345
+ b, c, h, w = x.shape
346
+ x = torch.cat(
347
+ [x, self.image_newline.reshape(1, c, 1, 1).expand(b, c, h, 1).to(dtype, non_blocking=True)], dim=-1
348
+ )
349
+ x = x.reshape(b, c, -1).permute(0, 2, 1)
350
+ x = self.mlp(x)
351
+
352
+ begin = self.image_begin.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
353
+ end = self.image_end.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
354
+ x = torch.cat([begin, x, end], dim=1)
355
+
356
+ return self.after_rms(x)
357
+
358
+
359
+ class HunYuanVisionAttention(nn.Module):
360
+ def __init__(self, config: HunYuanVLConfig):
361
+ super().__init__()
362
+ self.config = config
363
+ self.is_causal = False # used in flash_attention
364
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
365
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
366
+ self.scaling = self.head_dim**-0.5
367
+ self.attention_dropout = config.attention_dropout
368
+ self.q_proj = nn.Linear(
369
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=True
370
+ )
371
+ self.k_proj = nn.Linear(
372
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True
373
+ )
374
+ self.v_proj = nn.Linear(
375
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True
376
+ )
377
+ self.o_proj = nn.Linear(
378
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=True
379
+ )
380
+
381
+ def forward(
382
+ self,
383
+ hidden_states: torch.Tensor,
384
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
385
+ position_ids: Optional[torch.LongTensor] = None,
386
+ attention_mask: Optional[torch.Tensor] = None,
387
+ past_key_values: Optional[Cache] = None,
388
+ cache_position: Optional[torch.LongTensor] = None,
389
+ **kwargs: Unpack[TransformersKwargs],
390
+ ) -> tuple[torch.Tensor, torch.Tensor]:
391
+ input_shape = hidden_states.shape[:-1]
392
+ hidden_shape = (*input_shape, -1, self.head_dim)
393
+
394
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
395
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
396
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
397
+
398
+ attention_interface: Callable = eager_attention_forward
399
+ if self.config._attn_implementation != "eager":
400
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
401
+
402
+ attn_output, attn_weights = attention_interface(
403
+ self,
404
+ query_states,
405
+ key_states,
406
+ value_states,
407
+ attention_mask,
408
+ dropout=0.0 if not self.training else self.attention_dropout,
409
+ scaling=self.scaling,
410
+ **kwargs,
411
+ )
412
+
413
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
414
+ attn_output = self.o_proj(attn_output)
415
+ return attn_output, attn_weights
416
+
417
+
418
+ class HunYuanVisionBlock(GradientCheckpointingLayer):
419
+ def __init__(self, config: HunYuanVLVisionConfig):
420
+ super().__init__()
421
+ self.hidden_size = config.hidden_size
422
+ self.self_attn = HunYuanVisionAttention(config)
423
+ self.mlp = HunYuanVisionMLP(config)
424
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
425
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
426
+
427
+ def forward(
428
+ self,
429
+ hidden_states: torch.Tensor,
430
+ attention_mask: Optional[torch.Tensor] = None,
431
+ position_ids: Optional[torch.LongTensor] = None,
432
+ past_key_values: Optional[Cache] = None,
433
+ use_cache: Optional[bool] = False,
434
+ cache_position: Optional[torch.LongTensor] = None,
435
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
436
+ **kwargs: Unpack[TransformersKwargs],
437
+ ) -> torch.Tensor:
438
+ residual = hidden_states
439
+ hidden_states = self.input_layernorm(hidden_states)
440
+ # Self Attention
441
+ hidden_states, _ = self.self_attn(
442
+ hidden_states=hidden_states,
443
+ attention_mask=attention_mask,
444
+ position_ids=position_ids,
445
+ past_key_values=past_key_values,
446
+ use_cache=use_cache,
447
+ cache_position=cache_position,
448
+ position_embeddings=position_embeddings,
449
+ **kwargs,
450
+ )
451
+ hidden_states = residual + hidden_states
452
+
453
+ # Fully Connected
454
+ residual = hidden_states
455
+ hidden_states = self.post_attention_layernorm(hidden_states)
456
+ hidden_states = self.mlp(hidden_states)
457
+ hidden_states = residual + hidden_states
458
+ return hidden_states
459
+
460
+
461
+ class HunYuanVisionTransformer(nn.Module):
462
+ config: HunYuanVLVisionConfig
463
+ _no_split_modules = ["HunYuanVLVisionBlock"]
464
+
465
+ def __init__(self, config: HunYuanVLVisionConfig):
466
+ super().__init__()
467
+ self.config = config
468
+ self.embeddings = HunYuanVisionPatchEmbed(config)
469
+ self.layers = nn.ModuleList(
470
+ [HunYuanVisionBlock(config) for _ in range(config.num_hidden_layers)]
471
+ )
472
+ self.perceive = HunYuanVisionPatchMerger(
473
+ self.config.hidden_size,
474
+ self.config.text_hidden_size,
475
+ self.config.spatial_merge_size,
476
+ self.config.rms_norm_eps,
477
+ )
478
+
479
+ def get_activation_function(self, act_name: str):
480
+ act_map = {
481
+ "gelu": nn.GELU(),
482
+ "relu": nn.ReLU(),
483
+ "silu": nn.SiLU(),
484
+ }
485
+ return act_map.get(act_name.lower(), nn.GELU()) # default GELU
486
+
487
+ # @auto_docstring
488
+ def forward(
489
+ self,
490
+ x: torch.Tensor,
491
+ grid_thw: list[list[int]],
492
+ ) -> torch.Tensor:
493
+ #
494
+ r"""
495
+ grid_thw (`torch.LongTensor` of shape `(num_images, 3)`):
496
+ The temporal, height and width dimensions of feature shape for each image. Each row contains [t, h, w] values.
497
+ """
498
+ hidden_states = self.embeddings(x, grid_thw)
499
+ for layer in self.layers:
500
+ hidden_states = layer(hidden_states)
501
+
502
+ cu_seqlens: list = [0]
503
+ for t, h, w in grid_thw:
504
+ cu_seqlens.append((h * w).item())
505
+
506
+ cu_seqlens = torch.tensor(cu_seqlens, dtype=torch.int32)
507
+ cu_seqlens = torch.cumsum(cu_seqlens, dim=0, dtype=torch.int32)
508
+ split_lengths = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
509
+ split_items = torch.split(hidden_states, split_lengths, dim=1)
510
+
511
+ processed_items = []
512
+ for grid, item in zip(grid_thw, split_items):
513
+ t, h, w = grid
514
+ processed = self.perceive(item, size=(h, w))
515
+ processed_items.append(processed)
516
+
517
+ hidden_states = torch.cat(processed_items, dim=1)
518
+
519
+ return hidden_states
520
+
521
+
522
+ def apply_rotary_pos_emb_xdrope(q, k, cos, sin, position_ids, xdrope_section, output_size=None):
523
+ """Applies XD Rotary Position Embedding to the query and key tensors.
524
+
525
+ Args:
526
+ q (`torch.Tensor`): The query tensor.
527
+ k (`torch.Tensor`): The key tensor.
528
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
529
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
530
+ position_ids (`torch.Tensor`): The position IDs for the tokens.
531
+ xdrope_section (`list`): The section ratios for XD RoPE.
532
+ output_size (`tuple`, optional): The output size of the tensors. Defaults to None.
533
+ bf16 (bool, optional): Whether to use bfloat16 precision. Defaults to False.
534
+
535
+ Returns:
536
+ `tuple(torch.Tensor)`: The query and key tensors rotated using the XD Rotary Position Embedding.
537
+ """
538
+ x_dim = len(xdrope_section)
539
+ cos = cos[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
540
+ sin = sin[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
541
+
542
+ xdrope_section = xdrope_section * 2
543
+
544
+ # for xd concat
545
+ assert sum(xdrope_section) == cos.shape[-1], "Illegal partition for xd rope"
546
+ cos = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(cos.split(xdrope_section, dim=-1))], dim=-1)
547
+ sin = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(sin.split(xdrope_section, dim=-1))], dim=-1)
548
+
549
+ # for head repeat
550
+ cos = cos.view(output_size[0], 1, output_size[2], -1) # .repeat(1, output_size[1], 1, 1)
551
+ sin = sin.view(output_size[0], 1, output_size[2], -1) # .repeat(1, output_size[1], 1, 1)
552
+
553
+ origin_dtype = q.dtype
554
+ q, k = q.float(), k.float()
555
+ cos, sin = cos.float(), sin.float()
556
+ q_out, k_out = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
557
+
558
+ return q_out.to(origin_dtype), k_out.to(origin_dtype)
559
+
560
+
561
+ def apply_rotary_pos_emb(
562
+ q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, position_ids: Optional[torch.Tensor]=None, unsqueeze_dim: int=1):
563
+ """Applies Rotary Position Embedding to the query and key tensors.
564
+
565
+ Args:
566
+ q (`torch.Tensor`): The query tensor.
567
+ k (`torch.Tensor`): The key tensor.
568
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
569
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
570
+ position_ids (`torch.Tensor`, *optional*):
571
+ Deprecated and unused.
572
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
573
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
574
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
575
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
576
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
577
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
578
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
579
+ Returns:
580
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
581
+ """
582
+ if position_ids is not None:
583
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
584
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
585
+ else:
586
+ cos = cos.unsqueeze(0).unsqueeze(unsqueeze_dim)
587
+ sin = sin.unsqueeze(0).unsqueeze(unsqueeze_dim)
588
+ q_embed = (q * cos) + (rotate_half(q) * sin)
589
+ k_embed = (k * cos) + (rotate_half(k) * sin)
590
+ return q_embed, k_embed
591
+
592
+ class HunYuanVLRotaryEmbedding(nn.Module):
593
+ inv_freq: torch.Tensor # fix linting for `register_buffer`
594
+
595
+ def __init__(self, config: HunYuanVLConfig, device=None):
596
+ super().__init__()
597
+ # BC: "rope_type" was originally "type"
598
+ if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
599
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
600
+ else:
601
+ self.rope_type = "default"
602
+ self.max_seq_len_cached = config.max_position_embeddings
603
+ self.original_max_seq_len = config.max_position_embeddings
604
+
605
+ self.config = config
606
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type if self.rope_type != "xdrope" else "dynamic"]
607
+ if self.rope_type in ["xdrope", "dynamic"] and config.rope_scaling["alpha"]:
608
+ # DynamicNTKAlphaRotary
609
+ self.dim = config.head_dim
610
+ base = config.rope_theta * config.rope_scaling.get("alpha") ** (self.dim / (self.dim - 2))
611
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
612
+ self.attention_scaling = 1.0
613
+ else:
614
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
615
+
616
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
617
+ self.original_inv_freq = self.inv_freq
618
+ self._set_cos_sin_cache(
619
+ seq_len=config.max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
620
+ )
621
+
622
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
623
+ self.max_seq_len_cached = seq_len
624
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
625
+ freqs = torch.outer(t, self.inv_freq)
626
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
627
+ emb = torch.cat((freqs, freqs), dim=-1).float()
628
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
629
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
630
+
631
+ def forward(self, x, seq_len: Optional[int]=None):
632
+ # x: [bs, num_attention_heads, seq_len, head_size]
633
+ if seq_len > self.max_seq_len_cached:
634
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
635
+
636
+ return (
637
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
638
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
639
+ )
640
+
641
+
642
+ class HunYuanVLAttention(nn.Module):
643
+
644
+ def __init__(self, config, layer_idx: int):
645
+ super().__init__()
646
+ self.config = config
647
+ self.layer_idx = layer_idx
648
+ self.is_causal = True # used in flash_attention
649
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
650
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
651
+ self.scaling = self.head_dim**-0.5
652
+ self.attention_dropout = config.attention_dropout
653
+ self.q_proj = nn.Linear(
654
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
655
+ )
656
+ self.k_proj = nn.Linear(
657
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
658
+ )
659
+ self.v_proj = nn.Linear(
660
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
661
+ )
662
+ self.o_proj = nn.Linear(
663
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
664
+ )
665
+
666
+ self.query_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
667
+ self.key_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
668
+
669
+ self.rotary_emb = HunYuanVLRotaryEmbedding(config=config)
670
+ self.xdrope_section = config.rope_scaling['xdrope_section']
671
+
672
+ @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
673
+ def forward(
674
+ self,
675
+ hidden_states: torch.Tensor,
676
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
677
+ position_ids: Optional[torch.LongTensor] = None,
678
+ attention_mask: Optional[torch.Tensor] = None,
679
+ past_key_values: Optional[Cache] = None,
680
+ cache_position: Optional[torch.LongTensor] = None,
681
+ **kwargs: Unpack[TransformersKwargs],
682
+ ) -> tuple[torch.Tensor, torch.Tensor]:
683
+ input_shape = hidden_states.shape[:-1]
684
+ hidden_shape = (*input_shape, -1, self.head_dim)
685
+
686
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
687
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
688
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
689
+
690
+ kv_seq_len = key_states.shape[-2]
691
+ origin_kv_seq_len = key_states.shape[-2]
692
+ if past_key_values is not None:
693
+ kv_seq_len += past_key_values.get_seq_length(self.layer_idx)
694
+
695
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
696
+ if self.xdrope_section is not None:
697
+ if past_key_values is None or past_key_values.get_seq_length() == 0:
698
+ output_size = (
699
+ query_states.size(0),
700
+ query_states.size(1),
701
+ query_states.size(2),
702
+ key_states.size(2),
703
+ )
704
+ query_states, key_states = apply_rotary_pos_emb_xdrope(
705
+ query_states, key_states, cos, sin, position_ids, self.xdrope_section, output_size
706
+ )
707
+ else:
708
+ position_ids = (
709
+ torch.ones(position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device)
710
+ * past_key_values.get_seq_length()
711
+ )
712
+ cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
713
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
714
+ else:
715
+ position_ids = torch.ones(
716
+ position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device
717
+ ) * past_key_values.get_seq_length(self.layer_idx)
718
+ cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
719
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
720
+
721
+ query_states = self.query_layernorm(query_states)
722
+ key_states = self.key_layernorm(key_states)
723
+
724
+ if past_key_values is not None:
725
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
726
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
727
+ key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
728
+
729
+ attention_interface: Callable = eager_attention_forward
730
+ if self.config._attn_implementation != "eager":
731
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
732
+
733
+ attn_output, attn_weights = attention_interface(
734
+ self,
735
+ query_states,
736
+ key_states,
737
+ value_states,
738
+ attention_mask,
739
+ dropout=0.0 if not self.training else self.attention_dropout,
740
+ scaling=self.scaling,
741
+ **kwargs,
742
+ )
743
+
744
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
745
+ attn_output = self.o_proj(attn_output)
746
+ return attn_output, attn_weights
747
+
748
+ class HunYuanVLDecoderLayer(LlamaDecoderLayer):
749
+ def __init__(
750
+ self,
751
+ config: Union[HunYuanVLVisionConfig, HunYuanVLTextConfig],
752
+ layer_idx: int):
753
+ super().__init__()
754
+ self.layer_idx = layer_idx
755
+ if config.norm_type == 'hf_rms' or config.norm_type == 'rms':
756
+ self.input_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
757
+ self.post_attention_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
758
+ elif config.norm_type == 'fused' or config.norm_type == 'torch_nn':
759
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
760
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
761
+ else:
762
+ assert False, "other norm_type are not supported"
763
+
764
+
765
+ class HunYuanVLPreTrainedModel(LlamaPreTrainedModel):
766
+ def _init_weights(self, module):
767
+ std = self.config.initializer_range
768
+ if isinstance(module, nn.Linear):
769
+ module.weight.data.normal_(mean=0.0, std=std)
770
+ if module.bias is not None:
771
+ module.bias.data.zero_()
772
+ elif isinstance(module, nn.Embedding):
773
+ module.weight.data.normal_(mean=0.0, std=std)
774
+ if module.padding_idx is not None:
775
+ module.weight.data[module.padding_idx].zero_()
776
+
777
+
778
+ @auto_docstring
779
+ class HunYuanVLModel(HunYuanVLPreTrainedModel):
780
+ def __init__(self, config: Union[HunYuanVLConfig, HunYuanVLTextConfig]):
781
+ super().__init__(config)
782
+ self.padding_idx = config.pad_token_id
783
+ self.vocab_size = config.vocab_size
784
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
785
+ self.layers = nn.ModuleList(
786
+ [HunYuanVLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
787
+ )
788
+ self.norm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
789
+ self.gradient_checkpointing = False
790
+ self.post_init()
791
+
792
+ @check_model_inputs
793
+ # @auto_docstring # TODO Fix this
794
+ def forward(
795
+ self,
796
+ input_ids: Optional[torch.LongTensor] = None,
797
+ attention_mask: Optional[torch.Tensor] = None,
798
+ position_ids: Optional[torch.LongTensor] = None,
799
+ past_key_values: Optional[Cache] = None,
800
+ inputs_embeds: Optional[torch.FloatTensor] = None,
801
+ cache_position: Optional[torch.LongTensor] = None,
802
+ use_cache: Optional[bool] = None,
803
+ **kwargs: Unpack[TransformersKwargs],
804
+ ) -> BaseModelOutputWithPast:
805
+ if (input_ids is None) ^ (inputs_embeds is not None):
806
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
807
+
808
+ if inputs_embeds is None:
809
+ inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
810
+
811
+ if use_cache and past_key_values is None:
812
+ past_key_values = DynamicCache(config=self.config)
813
+
814
+ if cache_position is None:
815
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
816
+ cache_position: torch.Tensor = torch.arange(
817
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
818
+ )
819
+
820
+ if position_ids is None:
821
+ position_ids = cache_position.unsqueeze(0)
822
+
823
+ causal_mask = create_causal_mask(
824
+ config=self.config,
825
+ input_embeds=inputs_embeds,
826
+ attention_mask=attention_mask,
827
+ cache_position=cache_position,
828
+ past_key_values=past_key_values,
829
+ position_ids=position_ids,
830
+ )
831
+ hidden_states = inputs_embeds
832
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
833
+ hidden_states = decoder_layer(
834
+ hidden_states,
835
+ attention_mask=causal_mask,
836
+ position_ids=position_ids,
837
+ past_key_values=past_key_values,
838
+ cache_position=cache_position,
839
+ **kwargs,
840
+ )
841
+
842
+ hidden_states = self.norm(hidden_states)
843
+ return BaseModelOutputWithPast(
844
+ last_hidden_state=hidden_states,
845
+ past_key_values=past_key_values,
846
+ )
847
+
848
+ class HunYuanVLForCausalLM(LlamaForCausalLM):
849
+ pass
850
+
851
+ class HunYuanVLForConditionalGeneration(HunYuanVLPreTrainedModel, GenerationMixin):
852
+ _tied_weights_keys = ["lm_head.weight"]
853
+ config: HunYuanVLConfig
854
+
855
+ def __init__(self, config: HunYuanVLConfig):
856
+ super().__init__(config)
857
+ self.model = HunYuanVLModel(config)
858
+ self.vocab_size = config.vocab_size
859
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
860
+ self.vit = HunYuanVisionTransformer(config.vision_config)
861
+ self.config = config
862
+ self.post_init()
863
+
864
+ def set_decoder(self, decoder):
865
+ self.model = decoder
866
+
867
+ def get_decoder(self):
868
+ return self.model
869
+
870
+ @can_return_tuple
871
+ @auto_docstring
872
+ def forward(
873
+ self,
874
+ input_ids: Optional[torch.LongTensor] = None,
875
+ attention_mask: Optional[torch.Tensor] = None,
876
+ position_ids: Optional[torch.LongTensor] = None,
877
+ past_key_values: Optional[Cache] = None,
878
+ inputs_embeds: Optional[torch.FloatTensor] = None,
879
+ labels: Optional[torch.LongTensor] = None,
880
+ use_cache: Optional[bool] = None,
881
+ cache_position: Optional[torch.LongTensor] = None,
882
+ logits_to_keep: Union[int, torch.Tensor] = 0,
883
+ **kwargs: Unpack[TransformersKwargs],
884
+ ) -> CausalLMOutputWithPast:
885
+ r"""
886
+ Example:
887
+
888
+ ```python
889
+ >>> from transformers import AutoProcessor, HunYuanVLForConditionalGeneration
890
+ >>> from PIL import Image
891
+ >>> import torch
892
+
893
+ >>> model_name_or_path = "tencent/HunyuanOCR"
894
+ >>> processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
895
+ >>> model = HunYuanVLForConditionalGeneration.from_pretrained(
896
+ ... model_name_or_path,
897
+ ... attn_implementation="eager",
898
+ ... torch_dtype=torch.bfloat16,
899
+ ... device_map="auto",
900
+ ... )
901
+
902
+ >>> img_path = "path/to/your/image.jpg"
903
+ >>> image = Image.open(img_path).convert("RGB")
904
+
905
+ >>> messages = [
906
+ ... {
907
+ ... "role": "user",
908
+ ... "content": [
909
+ ... {"type": "image", "image": img_path},
910
+ ... {"type": "text", "text": "Extract the text from the image."},
911
+ ... ],
912
+ ... }
913
+ ... ]
914
+ >>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
915
+ >>> inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
916
+
917
+ >>> with torch.no_grad():
918
+ ... generated_ids = model.generate(**inputs, max_new_tokens=1024)
919
+ >>> generated_ids_trimmed = generated_ids[0][len(inputs["input_ids"][0]):]
920
+ >>> output = processor.decode(generated_ids_trimmed, skip_special_tokens=True)
921
+
922
+ >>> print(output)
923
+
924
+ ```"""
925
+ outputs: BaseModelOutputWithPast = self.model(
926
+ input_ids=input_ids,
927
+ attention_mask=attention_mask,
928
+ position_ids=position_ids,
929
+ past_key_values=past_key_values,
930
+ inputs_embeds=inputs_embeds,
931
+ use_cache=use_cache,
932
+ cache_position=cache_position,
933
+ **kwargs,
934
+ )
935
+
936
+ hidden_states = outputs.last_hidden_state
937
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
938
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
939
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
940
+
941
+ loss = None
942
+ if labels is not None:
943
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
944
+
945
+ return CausalLMOutputWithPast(
946
+ loss=loss,
947
+ logits=logits,
948
+ past_key_values=outputs.past_key_values,
949
+ hidden_states=outputs.hidden_states,
950
+ attentions=outputs.attentions,
951
+ )
952
+
953
+ # def prepare_inputs_for_generation(
954
+ # self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
955
+ # ):
956
+ # inputs = super().prepare_inputs_for_generation(
957
+ # input_ids,
958
+ # past_key_values=past_key_values,
959
+ # attention_mask=attention_mask,
960
+ # inputs_embeds=inputs_embeds,
961
+ # **kwargs,
962
+ # )
963
+ # return inputs
964
+
965
+ @torch.no_grad()
966
+ def generate(
967
+ self,
968
+ input_ids: Optional[torch.Tensor] = None,
969
+ attention_mask: Optional[torch.Tensor] = None,
970
+ position_ids: Optional[torch.LongTensor] = None,
971
+ imgs: Optional[list[torch.FloatTensor]] = None,
972
+ imgs_pos: Optional[list[int]] = None,
973
+ token_type_ids: Optional[torch.LongTensor] = None,
974
+ pixel_values: Optional[torch.FloatTensor] = None,
975
+ image_grid_thw: Optional[list[int]] = None,
976
+ **kwargs,
977
+ ) -> CausalLMOutputWithPast:
978
+ if "inputs_embeds" in kwargs:
979
+ raise NotImplementedError("`inputs_embeds` is not supported")
980
+
981
+ inputs_embeds = self.model.embed_tokens(input_ids)
982
+
983
+ if self.vit is not None and pixel_values is not None:
984
+ pixel_values = pixel_values.to(torch.bfloat16)
985
+ image_embeds = self.vit(pixel_values, image_grid_thw)
986
+
987
+ # ViT may be deployed on different GPUs from those used by LLMs, due to auto-mapping of accelerate.
988
+ image_embeds = image_embeds.to(input_ids.device, non_blocking=True)
989
+
990
+ image_mask, _ = self.get_placeholder_mask(
991
+ input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
992
+ )
993
+ inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
994
+
995
+ return super().generate(
996
+ inputs=input_ids,
997
+ position_ids=position_ids,
998
+ attention_mask=attention_mask,
999
+ inputs_embeds=inputs_embeds,
1000
+ # eos_token_id=self.config.eod_token_id,
1001
+ **kwargs,
1002
+ )
1003
+
1004
+ # Copied from transformers.models.llava.modeling_llava.LlavaModel.get_placeholder_mask
1005
+ def get_placeholder_mask(
1006
+ self,
1007
+ input_ids: torch.LongTensor,
1008
+ inputs_embeds: torch.FloatTensor,
1009
+ image_features: Optional[torch.FloatTensor] = None
1010
+ ):
1011
+ """
1012
+ Obtains multimodal placeholder mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
1013
+ equal to the length of multimodal features. If the lengths are different, an error is raised.
1014
+ """
1015
+ if input_ids is None:
1016
+ special_image_mask = inputs_embeds == self.get_input_embeddings()(
1017
+ torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
1018
+ )
1019
+ special_image_mask = special_image_mask.all(-1)
1020
+ else:
1021
+ special_image_mask = input_ids == self.config.image_token_id
1022
+
1023
+ n_image_tokens = special_image_mask.sum()
1024
+ special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
1025
+ if image_features is not None and inputs_embeds[special_image_mask].numel() != image_features.numel():
1026
+ raise ValueError(
1027
+ f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {image_features.shape[0]}"
1028
+ )
1029
+
1030
+ return special_image_mask, None
1031
+
1032
+
1033
+ __all__ = [
1034
+ "HunYuanVLConfig",
1035
+ "HunYuanVLVisionConfig",
1036
+ "HunYuanVLTextConfig",
1037
+ "HunYuanVLForConditionalGeneration",
1038
+ "HunYuanVLForCausalLM",
1039
+ "HunYuanVLModel",
1040
+ "HunYuanVLPreTrainedModel",
1041
+ "HunYuanVLTextModel"
1042
+ ]
preprocessor_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 262144,
3
+ "max_pixels": 4194304,
4
+ "patch_size": 16,
5
+ "resample": 1,
6
+ "temporal_patch_size": 1,
7
+ "merge_size": 2,
8
+ "image_mean": [
9
+ 0.48145466,
10
+ 0.4578275,
11
+ 0.40821073
12
+ ],
13
+ "image_std": [
14
+ 0.26862954,
15
+ 0.26130258,
16
+ 0.27577711
17
+ ],
18
+ "image_processor_type": "HunYuanVLImageProcessor",
19
+ "processor_class": "HunYuanVLProcessor",
20
+ "auto_map": {
21
+ "AutoProcessor": "processing_hunyuan_vl.HunYuanVLProcessor",
22
+ "AutoImageProcessor": "image_processing_hunyuan_vl.HunYuanVLImageProcessor"
23
+ }
24
+ }
processing_hunyuan_vl.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from typing import Union
3
+ import torch
4
+ import numpy as np
5
+
6
+ from transformers.feature_extraction_utils import BatchFeature
7
+ from transformers.image_utils import ImageInput
8
+ from transformers.video_utils import VideoInput
9
+ from transformers.processing_utils import ProcessorMixin
10
+ from transformers.tokenization_utils_base import PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
11
+ from transformers.utils import TensorType, logging
12
+
13
+
14
+ logger = logging.get_logger(__name__)
15
+
16
+
17
+ class HunYuanVLProcessor(ProcessorMixin):
18
+ attributes = ['image_processor', 'tokenizer']
19
+ valid_kwargs = ["chat_template"]
20
+ image_processor_class = "AutoImageProcessor"
21
+ tokenizer_class = "AutoTokenizer" # ("AutoTokenizer", None)
22
+
23
+ def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
24
+ # TODO Fix the init
25
+ self.tokenizer = tokenizer
26
+ self.image_token_id = 120120 # self.tokenizer.image_token_id
27
+ self.image_token = self.tokenizer.convert_ids_to_tokens(self.image_token_id)
28
+ self.im_start_token_id = 120118 # self.tokenizer.im_start_id
29
+ self.im_start_token = self.tokenizer.convert_ids_to_tokens(self.im_start_token_id)
30
+ self.im_end_token_id = 120119 # self.tokenizer.im_end_id
31
+ self.im_end_token = self.tokenizer.convert_ids_to_tokens(self.im_end_token_id)
32
+ self.placeholder_token = self.tokenizer.convert_ids_to_tokens(self.tokenizer.vocab_size - 1)
33
+ self.pad_id = 120002 #self.tokenizer.pad_token_id
34
+
35
+ super().__init__(image_processor, tokenizer, video_processor, chat_template=chat_template)
36
+
37
+ def __call__(
38
+ self,
39
+ images: ImageInput = None,
40
+ text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
41
+ videos: VideoInput = None,
42
+ **kwargs
43
+ ) -> BatchFeature:
44
+ image_inputs = videos_inputs = {}
45
+ if images is not None:
46
+ image_inputs = self.image_processor(images=images)
47
+ image_grid_thw = image_inputs["image_grid_thw"]
48
+
49
+ if not isinstance(text, list):
50
+ text = [text]
51
+
52
+ text = text.copy() # below lines change text in-place
53
+
54
+ image_tokens_cumsum = [0]
55
+ if images is not None:
56
+ index = 0
57
+ for i in range(len(text)):
58
+ while self.image_token in text[i]:
59
+ grid_h, grid_w = image_grid_thw[index][-2:]
60
+ patch_h = grid_h // self.image_processor.merge_size
61
+ patch_w = grid_w // self.image_processor.merge_size
62
+ num_image_tokens = patch_h * (patch_w + 1) + 2
63
+ image_tokens_cumsum.append(image_tokens_cumsum[-1] + num_image_tokens)
64
+ # text[i] = text[i].replace(self.image_token, self.im_start_token + self.placeholder_token * num_image_tokens + self.im_end_token, 1)
65
+ text[i] = text[i].replace(self.image_token, self.placeholder_token * num_image_tokens, 1)
66
+ index += 1
67
+ text[i] = text[i].replace(self.placeholder_token, self.image_token)
68
+ # text[i] = self.tokenizer.bos_token + text[i]
69
+
70
+ text_inputs = self.tokenizer(text, add_special_tokens=False, **kwargs)
71
+ self._check_special_mm_tokens(text, text_inputs, modalities=["image"])
72
+
73
+ input_ids = text_inputs['input_ids']
74
+ position_ids = torch.arange(len(input_ids[0]))
75
+ position_ids_w = torch.arange(len(input_ids[0]))
76
+ position_ids_h = torch.arange(len(input_ids[0]))
77
+ position_ids_t = torch.arange(len(input_ids[0]))
78
+
79
+ if images is not None:
80
+ image_token_pos_indices = torch.where(input_ids[0] == self.image_token_id)[0]
81
+ for i in range(len(image_grid_thw)):
82
+ grid_h, grid_w = image_grid_thw[i][-2:]
83
+ patch_h = grid_h // self.image_processor.merge_size
84
+ patch_w = grid_w // self.image_processor.merge_size
85
+ start_pos = image_token_pos_indices[image_tokens_cumsum[i]].item() + 1
86
+ replace_num = (patch_w + 1) * patch_h
87
+ position_ids_w[start_pos: start_pos + replace_num] = torch.tensor(list(range(patch_w + 1)) * patch_h, dtype=torch.int64)
88
+ patch_h_list = []
89
+ for h in range(patch_h):
90
+ patch_h_list += [h] * (patch_w+1)
91
+ position_ids_h[start_pos: start_pos + replace_num] = torch.tensor(patch_h_list, dtype=torch.int64)
92
+ position_ids_t[start_pos: start_pos + replace_num] = 0
93
+
94
+ position_ids = torch.stack([position_ids, position_ids_w, position_ids_h, position_ids_t]).unsqueeze(0)
95
+ text_inputs['position_ids'] = position_ids
96
+
97
+ attention_mask = input_ids.ne(self.pad_id)
98
+ text_inputs["attention_mask"] = attention_mask
99
+ text_inputs["imgs_pos"] = [self.get_imgs_pos(input_ids)]
100
+ # image_inputs["imgs"] = [[image_inputs["pixel_values"]]]
101
+
102
+ return_tensors = kwargs.pop("return_tensors", None)
103
+ return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
104
+
105
+ def batch_decode(self, *args, **kwargs):
106
+ return self.tokenizer.batch_decode(*args, **kwargs)
107
+
108
+ def decode(self, *args, **kwargs):
109
+ return self.tokenizer.decode(*args, **kwargs)
110
+
111
+ def post_process_image_text_to_text(
112
+ self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
113
+ ):
114
+ assert 0
115
+
116
+ def apply_chat_template(self, *args, **kwargs):
117
+ token_ids = self.tokenizer.apply_chat_template(*args, **kwargs)
118
+ return token_ids
119
+
120
+ def get_imgs_pos(self, doc_ids):
121
+ doc_ids = np.array(doc_ids, dtype=np.int64)
122
+ img_begin_index = np.where(doc_ids == self.im_start_token_id)[0]
123
+ img_end_index = np.where(doc_ids == self.im_end_token_id)[0]
124
+ imgs_pos = np.concatenate((np.reshape(img_begin_index + 1, (-1, 1)), np.reshape(img_end_index, (-1, 1))), axis=-1).tolist()
125
+ return imgs_pos
126
+
127
+ @property
128
+ def model_input_names(self):
129
+ tokenizer_input_names = self.tokenizer.model_input_names
130
+ image_processor_input_names = self.image_processor.model_input_names
131
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
132
+
133
+
134
+ def split_image_into_patch_blocks(
135
+ pixel_values: torch.Tensor, # shape: [batch_size, 3, H, W]
136
+ patch_size: int = 16, # e.g. 16
137
+ adaptor_patch_div: int = 4, # e.g. 4 --> 表示每个 patch_size 切成 4x4 小区域,即 patch_size // 4
138
+ ) -> torch.Tensor:
139
+ """
140
+ Split the input image tensor (supporting batch) into large patches of size `patch_size`,
141
+ and then further divide each large patch into smaller regions of size
142
+ (patch_size // adaptor_patch_div) x (patch_size // adaptor_patch_div).
143
+ Each small region is extracted as a tensor of shape [3, patch_size, patch_size].
144
+ The final output contains all such small region tensors.
145
+
146
+ Args:
147
+ pixel_values: Input image tensor of shape [batch_size, 3, H, W].
148
+ patch_size: Size of the large patch, e.g., 16.
149
+ adaptor_patch_div: Each large patch is divided into
150
+ (patch_size // adaptor_patch_div) x (patch_size // adaptor_patch_div)
151
+ smaller regions.
152
+
153
+ Returns:
154
+ patches: A tensor of shape [N, 3, patch_size, patch_size],
155
+ where N = batch_size * (H // patch_size) * (W // patch_size) * (patch_size // adaptor_patch_div)^2.
156
+ Each element in the batch corresponds to one small image region.
157
+ """
158
+ batch_size, channels, height, width = pixel_values.shape
159
+ assert channels == 3, "Pixel values must have 3 channels in dim=1"
160
+ assert height % patch_size == 0 and width % patch_size == 0, "H and W must be divisible by patch_size"
161
+
162
+ patch_height_num = height // patch_size
163
+ patch_width_num = width // patch_size
164
+ small_regions_per_patch = (patch_size // adaptor_patch_div) ** 2
165
+
166
+ # Reshape to [B, 3, ph, ps, pw, ps]
167
+ img = pixel_values.reshape(
168
+ batch_size, 3,
169
+ patch_height_num, patch_size,
170
+ patch_width_num, patch_size
171
+ )
172
+
173
+ # Further split each psxps patch into (ps//aps)x(ps//aps) small regions
174
+ img = img.reshape(
175
+ batch_size, 3,
176
+ patch_height_num,
177
+ patch_size // adaptor_patch_div, # ps // aps
178
+ adaptor_patch_div,
179
+ patch_width_num,
180
+ patch_size // adaptor_patch_div, # ps // aps
181
+ adaptor_patch_div
182
+ )
183
+
184
+ # Permute to group the small regions: [B, ph, pw, ps//aps, ps//aps, 3, aps, aps]
185
+ img = img.permute(0, 2, 5, 3, 6, 1, 4, 7)
186
+
187
+ # Reshape into [B * ph * pw * (ps//aps)^2, 3, patch_size, patch_size]
188
+ patches = img.reshape(-1, 3, patch_size, patch_size)
189
+
190
+ return patches
191
+
192
+
193
+
194
+ __all__ = ["HunYuanVLProcessor"]
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|hy_begin▁of▁sentence|>",
3
+ "eos_token": "<|hy_place▁holder▁no▁2|>",
4
+ "pad_token": "<|hy_▁pad▁|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff