lvyufeng commited on 15 days ago

Commit

629b298

verified ·

1 Parent(s): 6cc7e1d

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

.gitignore +6 -0
LICENSE +78 -0
README.md +243 -0
config.json +85 -0
configuration_hunyuan_vl.py +323 -0
generation_config.json +13 -0
image_processing_hunyuan_vl.py +475 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +720 -0
modeling_hunyuan_vl.py +1058 -0
modular_hunyuan_vl.py +1042 -0
preprocessor_config.json +24 -0
processing_hunyuan_vl.py +194 -0
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+*.swp
+*.bak
+*.bak*
+bak/

LICENSE ADDED Viewed

	@@ -0,0 +1,78 @@

+TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT
+Tencent HunyuanOCR Release Date: November 25, 2025
+THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.
+By clicking to agree or by using, reproducing, modifying, distributing, performing or displaying any portion or element of the Tencent Hunyuan Works, including via any Hosted Service, You will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
+1.	DEFINITIONS.
+a.	“Acceptable Use Policy” shall mean the policy made available by Tencent as set forth in the Exhibit A.
+b.	“Agreement” shall mean the terms and conditions for use, reproduction, distribution, modification, performance and displaying of Tencent Hunyuan Works or any portion or element thereof set forth herein.
+c.	“Documentation” shall mean the specifications, manuals and documentation for Tencent Hunyuan made publicly available by Tencent.
+d.	“Hosted Service” shall mean a hosted service offered via an application programming interface (API), web access, or any other electronic or remote means.
+e.	“Licensee,” “You” or “Your” shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Tencent Hunyuan Works for any purpose and in any field of use.
+f.	“Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
+g.	“Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
+h.	“Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
+i.	“Tencent,” “We” or “Us” shall mean the applicable entity or entities in the Tencent corporate family that own(s) intellectual property or other rights embodied in or utilized by the Materials.
+j.	“Tencent Hunyuan” shall mean the large language models, text/image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us, including, without limitation to, Tencent HunyuanOCR released at [https://huggingface.co/tencent/HunyuanOCR].
+k.	“Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
+l.	“Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
+m.	“Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
+n.	“including” shall mean including but not limited to.
+2.	GRANT OF RIGHTS.
+We grant You, for the Territory only, a non-exclusive, non-transferable and royalty-free limited license under Tencent’s intellectual property or other rights owned by Us embodied in or utilized by the Materials to use, reproduce, distribute, create derivative works of (including Model Derivatives), and make modifications to the Materials, only in accordance with the terms of this Agreement and the Acceptable Use Policy, and You must not violate (or encourage or permit anyone else to violate) any term of this Agreement or the Acceptable Use Policy.
+3.	DISTRIBUTION.
+You may, subject to Your compliance with this Agreement, distribute or make available to Third Parties the Tencent Hunyuan Works, exclusively in the Territory, provided that You meet all of the following conditions:
+a.	You must provide all such Third Party recipients of the Tencent Hunyuan Works or products or services using them a copy of this Agreement;
+b.	You must cause any modified files to carry prominent notices stating that You changed the files;
+c.	You are encouraged to: (i) publish at least one technology introduction blogpost or one public statement expressing Your experience of using the Tencent Hunyuan Works; and (ii) mark the products or services developed by using the Tencent Hunyuan Works to indicate that the product/service is “Powered by Tencent Hunyuan”; and
+d.	All distributions to Third Parties (other than through a Hosted Service) must be accompanied by a “Notice” text file that contains the following notice: “Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2025 Tencent. All Rights Reserved. The trademark rights of “Tencent Hunyuan” are owned by Tencent or its affiliate.”
+e.	In the event that You use, integrate, implement, or otherwise deploy the Tencent Hunyuan Works, in whole or in part, to provide, enable, or support any service, product, or functionality to third parties, You shall clearly, accurately, and prominently disclose to all end users the full legal name and entity of the actual provider of such service, product, or functionality. You shall expressly and conspicuously state that Tencent is not affiliated with, associated with, sponsoring, or endorsing any such service, product, or functionality. You shall not use or display any name, logo, trademark, trade name, or other indicia of Tencent in any manner that could be construed as, or be likely to create, confusion, deception, or a false impression regarding any relationship, affiliation, sponsorship, or endorsement by Tencent.
+You may add Your own copyright statement to Your modifications and, except as set forth in this Section and in Section 5, may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Model Derivatives as a whole, provided Your use, reproduction, modification, distribution, performance and display of the work otherwise complies with the terms and conditions of this Agreement (including as regards the Territory). If You receive Tencent Hunyuan Works from a Licensee as part of an integrated end user product, then this Section 3 of this Agreement will not apply to You.
+4.	ADDITIONAL COMMERCIAL TERMS.
+If, on the Tencent Hunyuan version release date, the monthly active users of all products or services made available by or for Licensee is greater than 100 million monthly active users in the preceding calendar month, You must request a license from Tencent, which Tencent may grant to You in its sole discretion, and You are not authorized to exercise any of the rights under this Agreement unless or until Tencent otherwise expressly grants You such rights.
+5.	RULES OF USE.
+a.	Your use of the Tencent Hunyuan Works must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Tencent Hunyuan Works, which is hereby incorporated by reference into this Agreement. You must include the use restrictions referenced in these Sections 5(a) and 5(b) as an enforceable provision in any agreement (e.g., license agreement, terms of use, etc.) governing the use and/or distribution of Tencent Hunyuan Works and You must provide notice to subsequent users to whom You distribute that Tencent Hunyuan Works are subject to the use restrictions in these Sections 5(a) and 5(b).
+b.	You must not use the Tencent Hunyuan Works or any Output or results of the Tencent Hunyuan Works to improve any other AI model (other than Tencent Hunyuan or Model Derivatives thereof).
+c.	You must not use, reproduce, modify, distribute, or display the Tencent Hunyuan Works, Output or results of the Tencent Hunyuan Works outside the Territory. Any such use outside the Territory is unlicensed and unauthorized under this Agreement.
+6.	INTELLECTUAL PROPERTY.
+a.	Subject to Tencent’s ownership of Tencent Hunyuan Works made by or for Tencent and intellectual property rights therein, conditioned upon Your compliance with the terms and conditions of this Agreement, as between You and Tencent, You will be the owner of any derivative works and modifications of the Materials and any Model Derivatives that are made by or for You.
+b.	No trademark licenses are granted under this Agreement, and in connection with the Tencent Hunyuan Works, Licensee may not use any name or mark owned by or associated with Tencent or any of its affiliates, except as required for reasonable and customary use in describing and distributing the Tencent Hunyuan Works. Tencent hereby grants You a license to use “Tencent Hunyuan” (the “Mark”) in the Territory solely as required to comply with the provisions of Section 3(c), provided that You comply with any applicable laws related to trademark protection. All goodwill arising out of Your use of the Mark will inure to the benefit of Tencent.
+c.	If You commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any person or entity alleging that the Materials or any Output, or any portion of any of the foregoing, infringe any intellectual property or other right owned or licensable by You, then all licenses granted to You under this Agreement shall terminate as of the date such lawsuit or other proceeding is filed. You will defend, indemnify and hold harmless Us from and against any claim by any Third Party arising out of or related to Your or the Third Party’s use or distribution of the Tencent Hunyuan Works.
+d.	Tencent claims no rights in Outputs You generate. You and Your users are solely responsible for Outputs and their subsequent uses.
+7.	DISCLAIMERS OF WARRANTY AND LIMITATIONS OF LIABILITY.
+a.	We are not obligated to support, update, provide training for, or develop any further version of the Tencent Hunyuan Works or to grant any license thereto.
+b.	UNLESS AND ONLY TO THE EXTENT REQUIRED BY APPLICABLE LAW, THE TENCENT HUNYUAN WORKS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OF ANY KIND INCLUDING ANY WARRANTIES OF TITLE, MERCHANTABILITY, NONINFRINGEMENT, COURSE OF DEALING, USAGE OF TRADE, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING, REPRODUCING, MODIFYING, PERFORMING, DISPLAYING OR DISTRIBUTING ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS AND ASSUME ANY AND ALL RISKS ASSOCIATED WITH YOUR OR A THIRD PARTY’S USE OR DISTRIBUTION OF ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS AND YOUR EXERCISE OF RIGHTS AND PERMISSIONS UNDER THIS AGREEMENT.
+c.	TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL TENCENT OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, FOR ANY DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, CONSEQUENTIAL OR PUNITIVE DAMAGES, OR LOST PROFITS OF ANY KIND ARISING FROM THIS AGREEMENT OR RELATED TO ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS, EVEN IF TENCENT OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+8.	SURVIVAL AND TERMINATION.
+a.	The term of this Agreement shall commence upon Your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
+b.	We may terminate this Agreement if You breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, You must promptly delete and cease use of the Tencent Hunyuan Works. Sections 6(a), 6(c), 7 and 9 shall survive the termination of this Agreement.
+9.	GOVERNING LAW AND JURISDICTION.
+a.	This Agreement and any dispute arising out of or relating to it will be governed by the laws of the Hong Kong Special Administrative Region of the People’s Republic of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
+b.	Exclusive jurisdiction and venue for any dispute arising out of or relating to this Agreement will be a court of competent jurisdiction in the Hong Kong Special Administrative Region of the People’s Republic of China, and Tencent and Licensee consent to the exclusive jurisdiction of such court with respect to any such dispute.
+EXHIBIT A
+ACCEPTABLE USE POLICY
+Tencent reserves the right to update this Acceptable Use Policy from time to time.
+Last modified: November 5, 2024
+Tencent endeavors to promote safe and fair use of its tools and features, including Tencent Hunyuan. You agree not to use Tencent Hunyuan or Model Derivatives:
+1.	Outside the Territory;
+2.	In any way that violates any applicable national, federal, state, local, international or any other law or regulation;
+3.	To harm Yourself or others;
+4.	To repurpose or distribute output from Tencent Hunyuan or any Model Derivatives to harm Yourself or others;
+5.	To override or circumvent the safety guardrails and safeguards We have put in place;
+6.	For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
+7.	To generate or disseminate verifiably false information and/or content with the purpose of harming others or influencing elections;
+8.	To generate or facilitate false online engagement, including fake reviews and other means of fake online engagement;
+9.	To intentionally defame, disparage or otherwise harass others;
+10.	To generate and/or disseminate malware (including ransomware) or any other content to be used for the purpose of harming electronic systems;
+11.	To generate or disseminate personal identifiable information with the purpose of harming others;
+12.	To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
+13.	To impersonate another individual without consent, authorization, or legal right;
+14.	To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
+15.	In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
+16.	To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
+17.	For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;
+18.	To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
+19.	For military purposes;
+20.	To engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or other professional practices.

README.md ADDED Viewed

	@@ -0,0 +1,243 @@

+---
+license: other
+language:
+- zh
+- en
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+<p align="center">
+ <img src="https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/assets/hyocr-head-img.png?raw=true" width="80%"/> <br>
+</p>
+<p align="center">
+<a href="https://huggingface.co/spaces/tencent/HunyuanOCR"><b>🎯 Demo</b></a> |
+<a href="https://huggingface.co/tencent/HunyuanOCR"><b>📥 Model Download</b></a> |
+<a href="https://arxiv.org/abs/2511.19575"><b>📄 Technical Report</b></a> |
+<a href="https://github.com/Tencent-Hunyuan/HunyuanOCR"><b>🌟 Github</b></a>
+</p>
+<h2>
+<p align="center">
+  <a href="https://arxiv.org/abs/2511.19575">HunyuanOCR</a>
+</p>
+</h2>
+## Notice
+The official repo of [HunyuanOCR](https://huggingface.co/tencent/HunyuanOCR) do not support official `transformers` and only provide a commit version to use. We modify the official implementation as `remote_code` to support the official `transformers` version. You can use [HunyuanOCR] with the latest version of `transformers` easily.
+## 📖 Introduction
+**HunyuanOCR** stands as a leading end-to-end OCR expert VLM powered by Hunyuan's native multimodal architecture. With a remarkably lightweight 1B parameter design, it has achieved multiple state-of-the-art benchmarks across the industry. The model demonstrates mastery in **complex multilingual document parsing** while excelling in practical applications including **text spotting, open-field information extraction, video subtitle extraction, and photo translation**.
+## 🚀 Quick Start with Transformers
+### Installation
+#### Use Pytorch + Transformers
+```bash
+pip install transformers==4.57.3
+```
+#### Use Mindspore + MindNLP
+```bash
+pip install transformers==4.57.3
+pip install git+https://github.com/mindspore-lab/mindnlp
+```
+### Model Inference
+#### MindSpore + MindNLP
+```python
+import mindtorch
+import mindnlp
+from transformers import AutoProcessor
+from transformers import AutoModel
+from PIL import Image
+def clean_repeated_substrings(text):
+    """Clean repeated substrings in text"""
+    n = len(text)
+    if n<8000:
+        return text
+    for length in range(2, n // 10 + 1):
+        candidate = text[-length:]
+        count = 0
+        i = n - length
+        while i >= 0 and text[i:i + length] == candidate:
+            count += 1
+            i -= length
+        if count >= 10:
+            return text[:n - length * (count - 1)]
+    return text
+model_name_or_path = "lvyufeng/HunyuanOCR"
+processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)
+img_path = "image_ocr.jpg"
+image_inputs = Image.open(img_path)
+messages1 = [
+    {"role": "system", "content": ""},
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": img_path},
+            {"type": "text", "text": (
+                "检测并识别图片中的文字，将文本坐标格式化输出。"
+            )},
+        ],
+    }
+]
+messages = [messages1]
+texts = [
+    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
+    for msg in messages
+]
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+model = AutoModel.from_pretrained(
+    model_name_or_path,
+    attn_implementation="eager",
+    dtype=mindtorch.float16,
+    device_map="auto",
+    trust_remote_code=True
+)
+with mindtorch.no_grad():
+    device = next(model.parameters()).device
+    inputs = inputs.to(device)
+    generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
+if "input_ids" in inputs:
+    input_ids = inputs.input_ids
+else:
+    print("inputs: # fallback", inputs)
+    input_ids = inputs.inputs
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
+]
+output_texts = clean_repeated_substrings(processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+))
+print(output_texts)
+```
+#### Pytorch + Transformers
+```python
+import torch
+from transformers import AutoProcessor
+from transformers import AutoModel
+from PIL import Image
+def clean_repeated_substrings(text):
+    """Clean repeated substrings in text"""
+    n = len(text)
+    if n<8000:
+        return text
+    for length in range(2, n // 10 + 1):
+        candidate = text[-length:]
+        count = 0
+        i = n - length
+        while i >= 0 and text[i:i + length] == candidate:
+            count += 1
+            i -= length
+        if count >= 10:
+            return text[:n - length * (count - 1)]
+    return text
+model_name_or_path = "lvyufeng/HunyuanOCR"
+processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)
+img_path = "image_ocr.jpg"
+image_inputs = Image.open(img_path)
+messages1 = [
+    {"role": "system", "content": ""},
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": img_path},
+            {"type": "text", "text": (
+                "检测并识别图片中的文字，将文本坐标格式化输出。"
+            )},
+        ],
+    }
+]
+messages = [messages1]
+texts = [
+    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
+    for msg in messages
+]
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+model = AutoModel.from_pretrained(
+    model_name_or_path,
+    attn_implementation="eager",
+    dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True
+)
+with torch.no_grad():
+    device = next(model.parameters()).device
+    inputs = inputs.to(device)
+    generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
+if "input_ids" in inputs:
+    input_ids = inputs.input_ids
+else:
+    print("inputs: # fallback", inputs)
+    input_ids = inputs.inputs
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
+]
+output_texts = clean_repeated_substrings(processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+))
+print(output_texts)
+```
+## 💬 Application-oriented Prompts
+| Task | English | Chinese |
+|------|---------|---------|
+| **Spotting** | Detect and recognize text in the image, and output the text coordinates in a formatted manner. | 检测并识别图片中的文字，将文本坐标格式化输出。 |
+| **Parsing** | • Identify the formula in the image and represent it using LaTeX format.<br><br>• Parse the table in the image into HTML.<br><br>• Parse the chart in the image; use Mermaid format for flowcharts and Markdown for other charts.<br><br>• Extract all information from the main body of the document image and represent it in markdown format, ignoring headers and footers. Tables should be expressed in HTML format, formulas in the document should be represented using LaTeX format, and the parsing should be organized according to the reading order. | • 识别图片中的公式，用 LaTeX 格式表示。<br><br>• 把图中的表格解析为 HTML。<br><br>• 解析图中的图表，对于流程图使用 Mermaid 格式表示，其他图表使用 Markdown 格式表示。<br><br>• 提取文档图片中正文的所有信息用 markdown 格式表示，其中页眉、页脚部分忽略，表格用 html 格式表达，文档中公式用 latex 格式表示，按照阅读顺序组织进行解析。 |
+| **Information Extraction** | • Output the value of Key.<br><br>• Extract the content of the fields: ['key1','key2', ...] from the image and return it in JSON format.<br><br>• Extract the subtitles from the image. | • 输出 Key 的值。<br><br>• 提取图片中的: ['key1','key2', ...] 的字段内容，并按照 JSON 格式返回。<br><br>• 提取图片中的字幕。 |
+| **Translation** | First extract the text, then translate the text content into English. If it is a document, ignore the header and footer. Formulas should be represented in LaTeX format, and tables should be represented in HTML format. | 先提取文字，再将文字内容翻译为英文。若是文档，则其中页眉、页脚忽略。公式用latex格式表示，表格用html格式表示。 |
+## 📚 Citation
+```
+@misc{hunyuanvisionteam2025hunyuanocrtechnicalreport,
+      title={HunyuanOCR Technical Report},
+      author={Hunyuan Vision Team and Pengyuan Lyu and Xingyu Wan and Gengluo Li and Shangpin Peng and Weinong Wang and Liang Wu and Huawen Shen and Yu Zhou and Canhui Tang and Qi Yang and Qiming Peng and Bin Luo and Hower Yang and Xinsong Zhang and Jinnian Zhang and Houwen Peng and Hongming Yang and Senhao Xie and Longsha Zhou and Ge Pei and Binghong Wu and Kan Wu and Jieneng Yang and Bochao Wang and Kai Liu and Jianchen Zhu and Jie Jiang and Linus and Han Hu and Chengquan Zhang},
+      year={2025},
+      journal={arXiv preprint arXiv:2511.19575},
+      url={https://arxiv.org/abs/2511.19575},
+}
+```
+## 🙏 Acknowledgements
+We would like to thank [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR), [dots.ocr](https://github.com/rednote-hilab/dots.ocr) for their valuable models and ideas.
+We also appreciate the benchmarks: [OminiDocBench](https://github.com/opendatalab/OmniDocBench), [OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR/tree/main/OCRBench), [DoTA](https://github.com/liangyupu/DIMTDA).

config.json ADDED Viewed

	@@ -0,0 +1,85 @@

+{
+  "architectures": [
+    "HunYuanVLForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_hunyuan_vl.HunYuanVLConfig",
+    "AutoModel": "modeling_hunyuan_vl.HunYuanVLForConditionalGeneration",
+    "AutoModelForSeq2SeqLM": "modeling_hunyuan_vl.HunYuanVLForConditionalGeneration"
+  },
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attention_head_dim": 128,
+  "bos_token_id": 120000,
+  "eod_token_id": 120020,
+  "eos_token_id": 120020,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "image_start_token_id": 120118,
+  "image_end_token_id": 120119,
+  "image_token_id": 120120,
+  "image_newline_token_id": 120121,
+  "initializer_range": 0.02,
+  "intermediate_size": 3584,
+  "max_position_embeddings": 32768,
+  "mlp_bias": false,
+  "model_type": "hunyuan_vl",
+  "norm_type": "rms",
+  "num_attention_heads": 16,
+  "num_experts": 1,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 8,
+  "org_vocab_size": 120818,
+  "pad_id": 120002,
+  "pad_token_id": -1,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": {
+    "alpha": 1000.0,
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "factor": 1.0,
+    "mscale": 1.0,
+    "mscale_all_dim": 1.0,
+    "type": "xdrope",
+    "xdrope_section": [
+      16,
+      16,
+      16,
+      16
+    ]
+  },
+  "rope_theta": 10000.0,
+  "routed_scaling_factor": 1.0,
+  "sep_token_id": 0,
+  "text_end_id": 8,
+  "text_start_id": 7,
+  "tie_word_embeddings": true,
+  "dtype": "bfloat16",
+  "transformers_version": "4.49.0",
+  "use_cache": true,
+  "use_qk_norm": true,
+  "use_cla": false,
+  "vision_config": {
+    "add_patchemb_bias": true,
+    "attention_dropout": 0.0,
+    "cat_extra_token": 1,
+    "hidden_act": "gelu",
+    "hidden_dropout": 0.0,
+    "hidden_size": 1152,
+    "img_max_token_num": 4096,
+    "intermediate_size": 4304,
+    "interpolate_mode": "bilinear",
+    "max_image_size": 2048,
+    "max_vit_seq_len": 16384,
+    "num_attention_heads": 16,
+    "num_channels": 3,
+    "num_hidden_layers": 27,
+    "out_hidden_size": 1024,
+    "patch_size": 16,
+    "rms_norm_eps": 1e-05,
+    "spatial_merge_size": 2
+  },
+  "vocab_size": 120818
+}

configuration_hunyuan_vl.py ADDED Viewed

	@@ -0,0 +1,323 @@

+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/hunyuan_vl/modular_hunyuan_vl.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_hunyuan_vl.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+# coding=utf-8
+# Copyright (C) 2025 THL A29 Limited, a Tencent company and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from transformers.configuration_utils import PretrainedConfig
+class HunYuanVLVisionConfig(PretrainedConfig):
+    model_type = "hunyuan_vl"
+    base_config_key = "vision_config"
+    def __init__(
+        self,
+        hidden_act="gelu",
+        hidden_size=1152,
+        intermediate_size=4304,
+        interpolate_mode="bilinear",
+        rms_norm_eps=1e-05,
+        learnable_mlp_pooling_size=0,
+        num_attention_heads=16,
+        num_key_value_heads=None,
+        num_channels=3,
+        num_hidden_layers=27,
+        out_hidden_size=4096,
+        patch_size=16,
+        remove_prenorm=True,
+        spatial_merge_size=2,
+        temporal_patch_size=1,
+        resize_resolution=2048,
+        img_max_token_num=4096,
+        max_image_size=2048,
+        video_max_image_size=768,
+        video_min_image_size=256,
+        min_image_size=512,
+        anyres_vit_max_image_size=2048,
+        max_vit_seq_len=16384,
+        text_hidden_size=3072,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.hidden_act = hidden_act
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.interpolate_mode = interpolate_mode
+        self.learnable_mlp_pooling_size = learnable_mlp_pooling_size
+        self.num_attention_heads = num_attention_heads
+        if not num_key_value_heads:
+            self.num_key_value_heads = num_attention_heads
+        else:
+            self.num_key_value_heads = num_key_value_heads
+        self.num_channels = num_channels
+        self.num_hidden_layers = num_hidden_layers
+        self.out_hidden_size = out_hidden_size
+        self.patch_size = patch_size
+        self.remove_prenorm = remove_prenorm
+        self.spatial_merge_size = spatial_merge_size
+        self.temporal_patch_size = temporal_patch_size
+        self.rms_norm_eps = rms_norm_eps
+        self.resize_resolution = resize_resolution
+        self.img_max_token_num = img_max_token_num
+        self.max_image_size = max_image_size
+        self.min_image_size = min_image_size
+        self.video_max_image_size = video_max_image_size
+        self.video_min_image_size = video_min_image_size
+        self.anyres_vit_max_image_size = anyres_vit_max_image_size
+        self.max_vit_seq_len = max_vit_seq_len
+        self.text_hidden_size = text_hidden_size
+class HunYuanVLTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`HunYuanVLTextConfig`]. It is used to instantiate an
+    HunYuan model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the HunYuan-7B.
+    Hunyuan-7B-Instruct [tencent/Hunyuan-7B-Instruct](https://huggingface.co/tencent/Hunyuan-7B-Instruct).
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 290943):
+            Vocabulary size of the HunYuan model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`HunYuanVLTextConfig`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations or shared MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        eod_token_id (int, *optional*, defaults to 3):
+            Token ID representing the end-of-document marker. Used to indicate the termination of a text sequence.
+            Example: In multi-document processing, this token helps the model distinguish between separate documents.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        head_dim (`int`, *optional*, defaults to 128):
+            The attention head dimension.
+    """
+    model_type = "hunyuan_vl_text"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=290943,
+        hidden_size=4096,
+        intermediate_size: int = 11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-5,
+        use_cache=True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        eod_token_id=3,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        head_dim=None,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.head_dim = head_dim
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        # self._rope_scaling_validation()   # TODO: Need validation?
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    def _rope_scaling_validation(self):
+        """
+        Validate the `rope_scaling` configuration.
+        """
+        if self.rope_scaling is None:
+            return
+        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
+            raise ValueError(
+                "`rope_scaling` must be a dictionary with with two fields, `type` and `factor` or `type` and `alpha`, "
+                f"got {self.rope_scaling}"
+            )
+        rope_scaling_type = self.rope_scaling.get("type", None)
+        rope_scaling_factor = self.rope_scaling.get("factor", None)
+        rope_scaling_alpha = self.rope_scaling.get("alpha", None)
+        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
+            raise ValueError(
+                f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
+            )
+        if rope_scaling_factor is None and rope_scaling_alpha is None:
+            raise ValueError("`rope_scaling`'s factor or alpha field must be have one, got both of none")
+        if rope_scaling_factor is not None:
+            if not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
+                raise ValueError(f"`rope_scaling`'s factor field must be a float > 1.0, got {rope_scaling_factor}")
+        if rope_scaling_alpha is not None:
+            if not isinstance(rope_scaling_alpha, float) or rope_scaling_alpha <= 1.0:
+                raise ValueError(f"`rope_scaling`'s alpha field must be a float > 1.0, got {rope_scaling_alpha}")
+class HunYuanVLConfig(PretrainedConfig):
+    model_type = "hunyuan_vl"
+    sub_configs = {"vision_config": HunYuanVLVisionConfig, "text_config": HunYuanVLTextConfig}
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        im_start_id=120118,
+        im_end_id=120119,
+        image_token_id=120120,
+        im_newline_id=120121,
+        video_start_id=120122,
+        video_end_id=120123,
+        **kwargs,
+    ):
+        # We need to init super() here so that it does not reset values
+        # that are in text config to the BaseClass defaults. The Base
+        # config has many text related defaults and not all defaults are same as for `HunYuanVLTextConfig`
+        super().__init__(**kwargs)
+        if isinstance(vision_config, dict):
+            self.vision_config = self.sub_configs["vision_config"](**vision_config)
+        elif vision_config is None:
+            self.vision_config = self.sub_configs["vision_config"]()
+        if isinstance(text_config, dict):
+            self.text_config = self.sub_configs["text_config"](**text_config)
+        elif text_config is None:
+            # For BC use all kwargs to init `TextConfig`
+            self.text_config = self.sub_configs["text_config"](**kwargs)
+        self.image_token_id = image_token_id
+        self.im_start_id = im_start_id
+        self.im_end_id = im_end_id
+        self.im_newline_id = im_newline_id
+        self.video_start_id = video_start_id
+        self.video_end_id = video_end_id
+        self.vision_config.text_hidden_size = self.text_config.hidden_size
+        # Attention implementation to use. It sets it recursively on sub-configs so we call it again in the end
+        self._attn_implementation = kwargs.pop("attn_implementation", None)
+    def __setattr__(self, key, value):
+        if (
+            (text_config := super().__getattribute__("__dict__").get("text_config")) is not None
+            and key not in ["dtype", "_attn_implementation_internal"]
+            and key in text_config.__dict__
+        ):
+            setattr(text_config, key, value)
+        else:
+            super().__setattr__(key, value)
+    def __getattribute__(self, key):
+        if "text_config" in super().__getattribute__("__dict__") and key not in [
+            "_name_or_path",
+            "model_type",
+            "dtype",
+            "_attn_implementation_internal",
+        ]:
+            text_config = super().__getattribute__("text_config")
+            if key in text_config.__dict__:
+                return getattr(text_config, key)
+        return super().__getattribute__(key)
+__all__ = ["HunYuanVLConfig", "HunYuanVLVisionConfig", "HunYuanVLTextConfig"]

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "bos_token_id": 120000,
+  "pad_token_id": 120002,
+  "do_sample": true,
+  "eos_token_id": [
+    120007,
+    120020
+  ],
+  "repetition_penalty": 1.03,
+  "top_k": 1,
+  "top_p": 1.0,
+  "temperature":0.0
+}

image_processing_hunyuan_vl.py ADDED Viewed

	@@ -0,0 +1,475 @@

+"""Image processor class for HunYuanVLV1."""
+import math
+from typing import Optional, Union
+import numpy as np
+import torchvision.transforms as transforms
+from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
+from transformers.image_transforms import (
+    convert_to_rgb,
+    resize,
+    to_channel_dimension_format,
+)
+from transformers.image_utils import (
+    OPENAI_CLIP_MEAN,
+    OPENAI_CLIP_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    get_image_size,
+    infer_channel_dimension_format,
+    is_scaled_image,
+    make_flat_list_of_images,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+    validate_preprocess_arguments,
+)
+from transformers.utils import TensorType, logging
+from transformers.video_utils import VideoInput, make_batched_videos
+logger = logging.get_logger(__name__)
+def smart_resize(
+    height: int, width: int, factor: int = 16, min_pixels: int = 512 * 512, max_pixels: int = 2048 * 2048
+):
+    """Rescales the image so that the following conditions are met:
+    1. Both dimensions (height and width) are divisible by 'factor'.
+    2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
+    3. The aspect ratio of the image is maintained as closely as possible.
+    """
+    if max(height, width) / min(height, width) > 200:
+        raise ValueError(
+            f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
+        )
+    h_bar = round(height / factor) * factor
+    w_bar = round(width / factor) * factor
+    if h_bar * w_bar > max_pixels:
+        beta = math.sqrt((height * width) / max_pixels)
+        h_bar = max(factor, math.floor(height / beta / factor) * factor)
+        w_bar = max(factor, math.floor(width / beta / factor) * factor)
+    elif h_bar * w_bar < min_pixels:
+        beta = math.sqrt(min_pixels / (height * width))
+        h_bar = math.ceil(height * beta / factor) * factor
+        w_bar = math.ceil(width * beta / factor) * factor
+    return h_bar, w_bar
+class HunYuanVLImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a HunYuanVLV1 image processor that dynamically resizes images based on the original images.
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions.
+        size (`dict[str, int]`, *optional*, defaults to `{"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}`):
+            Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
+        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
+            Resampling filter to use when resizing the image.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image.
+        image_mean (`float` or `list[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
+            Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
+        image_std (`float` or `list[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+        min_pixels (`int`, *optional*, defaults to `512 * 512`):
+            The min pixels of the image to resize the image.
+        max_pixels (`int`, *optional*, defaults to `2048 * 2048`):
+            The max pixels of the image to resize the image.
+        patch_size (`int`, *optional*, defaults to 14):
+            The spatial patch size of the vision encoder.
+        temporal_patch_size (`int`, *optional*, defaults to 2):
+            The temporal patch size of the vision encoder.
+        merge_size (`int`, *optional*, defaults to 2):
+            The merge size of the vision encoder to llm encoder.
+    """
+    model_input_names = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw"]
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Optional[dict[str, int]] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, list[float]]] = None,
+        image_std: Optional[Union[float, list[float]]] = None,
+        do_convert_rgb: bool = True,
+        min_pixels: Optional[int] = None,
+        max_pixels: Optional[int] = None,
+        patch_size: int = 16,
+        temporal_patch_size: int = 2,
+        merge_size: int = 2,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        if size is not None and ("shortest_edge" not in size or "longest_edge" not in size):
+            raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
+        else:
+            size = {"shortest_edge": 512*512, "longest_edge": 2048*2048}
+        # backward compatibility: override size with min_pixels and max_pixels if they are provided
+        if min_pixels is not None:
+            size["shortest_edge"] = min_pixels
+        if max_pixels is not None:
+            size["longest_edge"] = max_pixels
+        self.min_pixels = size["shortest_edge"]
+        self.max_pixels = size["longest_edge"]
+        self.size = size
+        self.do_resize = do_resize
+        self.resample = resample
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
+        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
+        self.patch_size = patch_size
+        self.temporal_patch_size = temporal_patch_size
+        self.merge_size = merge_size
+        self.do_convert_rgb = do_convert_rgb
+        # hard-code
+    def _preprocess(
+        self,
+        images: Union[ImageInput, VideoInput],
+        do_resize: Optional[bool] = None,
+        size: Optional[dict[str, int]] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, list[float]]] = None,
+        image_std: Optional[Union[float, list[float]]] = None,
+        patch_size: Optional[int] = None,
+        temporal_patch_size: Optional[int] = None,
+        merge_size: Optional[int] = None,
+        do_convert_rgb: Optional[bool] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ):
+        """
+        Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
+        Args:
+            images (`ImageInput`):
+                Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
+            vision_info (`list[Dict]`, *optional*):
+                Optional list of dictionaries containing additional information about vision inputs.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Scale factor to use if rescaling the image.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
+                Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
+            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
+                Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
+            patch_size (`int`, *optional*, defaults to `self.patch_size`):
+                The spatial patch size of the vision encoder.
+            temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
+                The temporal patch size of the vision encoder.
+            merge_size (`int`, *optional*, defaults to `self.merge_size`):
+                The merge size of the vision encoder to llm encoder.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.   - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+        """
+        images = make_list_of_images(images)
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+        width, height = images[0].width, images[0].height
+        resized_width, resized_height = width, height
+        processed_images = []
+        for image in images:
+            if do_resize:
+                resized_width, resized_height = smart_resize(
+                    width,
+                    height,
+                    factor=patch_size * merge_size,
+                    min_pixels=size["shortest_edge"],
+                    max_pixels=size["longest_edge"],
+                )
+                image = image.resize((resized_width, resized_height))
+            if do_normalize:
+                image = transforms.Compose([
+                    transforms.ToTensor(),
+                    transforms.Normalize(self.image_mean, self.image_std)
+                ])(image)
+            processed_images.append(image)
+        patches = np.array(processed_images)
+        channel = patches.shape[1]
+        grid_t = patches.shape[0] // temporal_patch_size
+        grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
+        patches = patches.reshape(
+            1,
+            channel,
+            grid_h // merge_size,
+            merge_size,
+            patch_size,
+            grid_w // merge_size,
+            merge_size,
+            patch_size,
+        )
+        patches = patches.transpose(0, 2, 3, 5, 6, 1, 4, 7)
+        flatten_patches = patches.reshape( 1 * grid_h * grid_w, channel * patch_size * patch_size)
+        return flatten_patches, (grid_t, grid_h, grid_w)
+    def preprocess(
+        self,
+        images: ImageInput,
+        videos: VideoInput = None,
+        do_resize: Optional[bool] = None,
+        size: Optional[dict[str, int]] = None,
+        min_pixels: Optional[int] = None,
+        max_pixels: Optional[int] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, list[float]]] = None,
+        image_std: Optional[Union[float, list[float]]] = None,
+        patch_size: Optional[int] = None,
+        temporal_patch_size: Optional[int] = None,
+        merge_size: Optional[int] = None,
+        do_convert_rgb: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ):
+        """
+        Args:
+            images (`ImageInput`):
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+            videos (`VideoInput`):
+                Video to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
+                passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            min_pixels (`int`, *optional*, defaults to `self.min_pixels`):
+                The min pixels of the image to resize the image.
+            max_pixels (`int`, *optional*, defaults to `self.max_pixels`):
+                The max pixels of the image to resize the image.
+            patch_size (`int`, *optional*, defaults to `self.patch_size`):
+                The spatial patch size of the vision encoder.
+            temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
+                The temporal patch size of the vision encoder.
+            merge_size (`int`, *optional*, defaults to `self.merge_size`):
+                The merge size of the vision encoder to llm encoder.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+        """
+        min_pixels = min_pixels if min_pixels is not None else self.min_pixels
+        max_pixels = max_pixels if max_pixels is not None else self.max_pixels
+        if size is not None:
+            if "shortest_edge" not in size or "longest_edge" not in size:
+                raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
+            min_pixels = size["shortest_edge"]
+        elif min_pixels is not None and max_pixels is not None:
+            # backward compatibility: override size with min_pixels and max_pixels if they are provided
+            size = {"shortest_edge": min_pixels, "longest_edge": max_pixels}
+        else:
+            size = {**self.size}
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        resample = resample if resample is not None else self.resample
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        patch_size = patch_size if patch_size is not None else self.patch_size
+        temporal_patch_size = temporal_patch_size if temporal_patch_size is not None else self.temporal_patch_size
+        merge_size = merge_size if merge_size is not None else self.merge_size
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+        if images is not None:
+            images = make_flat_list_of_images(images)
+        if images is not None and not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+        validate_preprocess_arguments(
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
+        data = {}
+        if images is not None:
+            pixel_values, vision_grid_thws = [], []
+            for image in images:
+                patches, image_grid_thw = self._preprocess(
+                    image,
+                    do_resize=do_resize,
+                    size=size,
+                    resample=resample,
+                    do_rescale=do_rescale,
+                    rescale_factor=rescale_factor,
+                    do_normalize=do_normalize,
+                    image_mean=image_mean,
+                    image_std=image_std,
+                    patch_size=patch_size,
+                    temporal_patch_size=temporal_patch_size,
+                    merge_size=merge_size,
+                    data_format=data_format,
+                    do_convert_rgb=do_convert_rgb,
+                    input_data_format=input_data_format,
+                )
+                pixel_values.extend(patches)
+                vision_grid_thws.append(image_grid_thw)
+            pixel_values = np.array(pixel_values)
+            vision_grid_thws = np.array(vision_grid_thws)
+            data.update({"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws})
+        # kept for BC only and should be removed after v5.0
+        if videos is not None:
+            logger.warning(
+                "`HunYuanVLV1ImageProcessor` works only with image inputs and doesn't process videos anymore. "
+                "This is a deprecated behavior and will be removed in v5.0. "
+                "Your videos should be forwarded to `HunYuanVLV1VideoProcessor`. "
+            )
+            videos = make_batched_videos(videos)
+            pixel_values_videos, vision_grid_thws_videos = [], []
+            for images in videos:
+                patches, video_grid_thw = self._preprocess(
+                    images,
+                    do_resize=do_resize,
+                    size=size,
+                    resample=resample,
+                    do_rescale=do_rescale,
+                    rescale_factor=rescale_factor,
+                    do_normalize=do_normalize,
+                    image_mean=image_mean,
+                    image_std=image_std,
+                    patch_size=patch_size,
+                    temporal_patch_size=temporal_patch_size,
+                    merge_size=merge_size,
+                    data_format=data_format,
+                    do_convert_rgb=do_convert_rgb,
+                    input_data_format=input_data_format,
+                )
+                pixel_values_videos.extend(patches)
+                vision_grid_thws_videos.append(video_grid_thw)
+            data.update(
+                {
+                    "pixel_values_videos": np.array(pixel_values_videos),
+                    "video_grid_thw": np.array(vision_grid_thws_videos),
+                }
+            )
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
+        """
+        A utility that returns number of image patches for a given image size.
+        Args:
+            height (`int`):
+                Height of the input image.
+            width (`int`):
+                Width of the input image.
+            images_kwargs (`dict`, *optional*)
+                Any kwargs to override defaults of the image processor.
+        Returns:
+            `int`: Number of image patches per image.
+        """
+        min_pixels = images_kwargs["min_pixels"] if "min_pixels" in images_kwargs else self.size["shortest_edge"]
+        max_pixels = images_kwargs["max_pixels"] if "max_pixels" in images_kwargs else self.size["longest_edge"]
+        patch_size = images_kwargs.get("patch_size", self.patch_size)
+        merge_size = images_kwargs.get("merge_size", self.merge_size)
+        factor = patch_size * merge_size
+        resized_height, resized_width = smart_resize(
+            height, width, factor, min_pixels=min_pixels, max_pixels=max_pixels
+        )
+        grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
+        return grid_h * (grid_w + 1) + 2
+__all__ = ["HunYuanVLImageProcessor"]

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7a0f4cb7fdfe4dc2686f8554310a34b4859ae464ec948f89d954318e382382d
+size 439600816

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fbfd70bed291d7920c65aacf4f07c8ea55e60dda253a529860880c5a7e4c00bd
+size 453346288

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:53d0b9f9a85aa21b3454f16f19845294fff7bab8e13aeaf3f7992b85fd35c473
+size 461590008

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fd82d09583ee16037f04532808d0f00332301fc6ed18aa0b75b902fa014402aa
+size 637958736

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,720 @@

+{
+  "metadata": {
+    "total_size": 1992416224
+  },
+  "weight_map": {
+    "model.embed_tokens.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.18.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.18.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.self_attn.key_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.self_attn.query_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.5.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.5.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.self_attn.key_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.self_attn.query_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.7.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.7.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.self_attn.key_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.self_attn.query_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.self_attn.key_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.self_attn.query_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.norm.weight": "model-00004-of-00004.safetensors",
+    "vit.perceive.after_rms.weight": "model-00004-of-00004.safetensors",
+    "vit.perceive.before_rms.weight": "model-00003-of-00004.safetensors",
+    "vit.perceive.image_begin": "model-00003-of-00004.safetensors",
+    "vit.perceive.image_end": "model-00003-of-00004.safetensors",
+    "vit.perceive.image_newline": "model-00003-of-00004.safetensors",
+    "vit.perceive.image_sep": "model-00003-of-00004.safetensors",
+    "vit.perceive.mlp.bias": "model-00004-of-00004.safetensors",
+    "vit.perceive.mlp.weight": "model-00003-of-00004.safetensors",
+    "vit.perceive.proj.0.bias": "model-00004-of-00004.safetensors",
+    "vit.perceive.proj.0.weight": "model-00003-of-00004.safetensors",
+    "vit.perceive.proj.2.bias": "model-00004-of-00004.safetensors",
+    "vit.perceive.proj.2.weight": "model-00003-of-00004.safetensors",
+    "vit.embeddings.patch_embedding.bias": "model-00004-of-00004.safetensors",
+    "vit.embeddings.patch_embedding.weight": "model-00003-of-00004.safetensors",
+    "vit.embeddings.position_embedding.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.0.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.10.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.10.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.10.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.10.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.10.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.10.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.10.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.10.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.10.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.11.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.11.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.11.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.12.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.12.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.13.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.13.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.13.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.13.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.13.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.13.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.13.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.13.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.13.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.14.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.14.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.15.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.16.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.16.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.17.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.18.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.18.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.18.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.18.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.18.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.19.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.mlp.dense_4h_to_h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.19.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.19.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.2.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.2.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.2.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.2.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.20.input_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.20.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.mlp.dense_h_to_4h.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.20.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
+    "vit.layers.20.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.21.input_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.21.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.21.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.21.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.21.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.21.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.21.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.21.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.21.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.input_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.mlp.dense_4h_to_h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.mlp.dense_h_to_4h.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.22.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.22.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.23.input_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.23.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.23.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.23.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.23.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.23.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.23.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.23.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.23.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "vit.layers.24.input_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.24.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.24.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.input_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.25.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.25.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.26.input_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.26.mlp.dense_4h_to_h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.26.mlp.dense_h_to_4h.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.26.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.26.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.26.self_attn.o_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.26.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.26.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
+    "vit.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.3.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.3.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.3.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.mlp.dense_4h_to_h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.mlp.dense_h_to_4h.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.4.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.4.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.5.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.5.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.5.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.5.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.5.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.5.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.5.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.5.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.5.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "vit.layers.6.input_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.mlp.dense_4h_to_h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.mlp.dense_h_to_4h.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.6.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
+    "vit.layers.6.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.mlp.dense_4h_to_h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.mlp.dense_h_to_4h.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.7.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.7.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "vit.layers.8.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.8.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.8.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.input_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.mlp.dense_4h_to_h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.mlp.dense_4h_to_h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.mlp.dense_h_to_4h.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.mlp.dense_h_to_4h.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "vit.layers.9.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
+    "vit.layers.9.self_attn.v_proj.weight": "model-00004-of-00004.safetensors"
+  }
+}

modeling_hunyuan_vl.py ADDED Viewed

	@@ -0,0 +1,1058 @@

+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/hunyuan_vl/modular_hunyuan_vl.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_hunyuan_vl.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+# coding=utf-8
+# Copyright (C) 2025 THL A29 Limited, a Tencent company and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Callable, Optional, Union
+import torch
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation import GenerationMixin
+from transformers.integrations import use_kernel_forward_from_hub
+from transformers.masking_utils import create_causal_mask
+from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple
+from transformers.utils.deprecation import deprecate_kwarg
+from .configuration_hunyuan_vl import HunYuanVLConfig, HunYuanVLTextConfig, HunYuanVLVisionConfig
+class HunYuanVisionMLP(nn.Module):
+    def __init__(self, config: HunYuanVLConfig):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.act_fn = ACT2FN[config.hidden_act]
+        self.dense_h_to_4h = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
+        self.dense_4h_to_h = nn.Linear(self.intermediate_size, self.hidden_size, bias=True)
+    def forward(self, x):
+        intermediate = self.dense_h_to_4h(x)
+        intermediate = self.act_fn(intermediate)
+        output = self.dense_4h_to_h(intermediate)
+        return output
+@use_kernel_forward_from_hub("RMSNorm")
+class HunYuanVLRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        HunYuanVLRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class HunYuanVLMLP(nn.Module):
+    def __init__(self, config: HunYuanVLConfig, layer_idx=None, is_shared_mlp=False):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+        self.layer_idx = layer_idx
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+class HunYuanVisionPatchEmbed(nn.Module):
+    def __init__(self, config: HunYuanVLVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.patch_size = config.patch_size
+        self.num_channels = config.num_channels
+        self.spatial_merge_size = config.spatial_merge_size
+        self.interpolate_mode = config.interpolate_mode
+        self.patch_embedding = nn.Conv2d(
+            in_channels=config.num_channels,
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            bias=True,
+        )
+        self.max_num_patches = (config.max_image_size // self.patch_size) ** 2
+        self.num_positions = self.max_num_patches + 1
+        self.position_edge = int(self.num_positions**0.5)
+        # first token is cls token, skip it
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+        self.patch_pos_embed = None
+    def forward(self, pixel_values: torch.Tensor, grid_thw: list[list[int]]) -> torch.Tensor:
+        num_patches, hidden_size = pixel_values.shape
+        pixel_values = pixel_values.reshape(num_patches, self.num_channels, self.patch_size, self.patch_size)
+        patch_embeds = self.patch_embedding(pixel_values)
+        patch_embeds = patch_embeds.squeeze(-1).squeeze(-1).unsqueeze(0)
+        if self.patch_pos_embed is None:
+            patch_pos_shape = (1, self.position_edge, self.position_edge, self.embed_dim)
+            self.patch_pos_embed = (
+                self.position_embedding.weight[1:, :].reshape(patch_pos_shape).permute(0, 3, 1, 2).float()
+            )
+        patch_pos_embed_list = []
+        for grid in grid_thw:
+            _, h0, w0 = grid
+            # we add a small number to avoid floating point error in the interpolation
+            # see discussion at https://github.com/facebookresearch/dino/issues/8
+            h0, w0 = h0 + 0.1, w0 + 0.1
+            patch_pos_embed = nn.functional.interpolate(
+                self.patch_pos_embed,
+                scale_factor=((h0 / self.position_edge).item(), (w0 / self.position_edge).item()),
+                mode=self.interpolate_mode,
+                align_corners=False,
+            )
+            patch_pos_embed = (
+                patch_pos_embed.reshape(self.embed_dim, -1).transpose(0, 1).unsqueeze(0).to(patch_embeds.dtype)
+            )
+            patch_pos_embed_list.append(patch_pos_embed)
+        patch_pos_embed = torch.cat(patch_pos_embed_list, dim=1)
+        embeddings = patch_embeds + patch_pos_embed
+        return embeddings
+class HunYuanVisionPatchMerger(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        spatial_merge_size,
+        rms_norm_eps,
+        **kwargs,
+    ):
+        super().__init__()
+        embed_std = out_channels**-0.5
+        self.spatial_merge_size = spatial_merge_size
+        self.proj = nn.Sequential(
+            nn.Conv2d(in_channels, in_channels * 2, kernel_size=spatial_merge_size, stride=spatial_merge_size),
+            nn.GELU(),
+            nn.Conv2d(in_channels * 2, in_channels * 4, kernel_size=1),
+        )
+        self.mlp = nn.Linear(in_channels * 4, out_channels)
+        self.image_newline = nn.Parameter(torch.randn(in_channels * 4) * embed_std)
+        self.image_begin = nn.Parameter(torch.randn(out_channels) * embed_std)
+        self.image_end = nn.Parameter(torch.randn(out_channels) * embed_std)
+        self.image_sep = nn.Parameter(torch.randn(out_channels) * embed_std)
+        self.before_rms = HunYuanVLRMSNorm(in_channels, eps=rms_norm_eps)
+        self.after_rms = HunYuanVLRMSNorm(out_channels, eps=rms_norm_eps)
+    def forward(self, x, size=(16, 16)):
+        x = self.before_rms(x)
+        h, w = size
+        dtype = x.dtype
+        x = x.permute(0, 2, 1).reshape(x.shape[0], -1, int(h.item()), int(w.item()))
+        x = self.proj(x)  # b,c,h,w
+        b, c, h, w = x.shape
+        x = torch.cat(
+            [x, self.image_newline.reshape(1, c, 1, 1).expand(b, c, h, 1).to(dtype, non_blocking=True)], dim=-1
+        )
+        x = x.reshape(b, c, -1).permute(0, 2, 1)
+        x = self.mlp(x)
+        begin = self.image_begin.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
+        end = self.image_end.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
+        x = torch.cat([begin, x, end], dim=1)
+        return self.after_rms(x)
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs: Unpack[TransformersKwargs],
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+class HunYuanVisionAttention(nn.Module):
+    def __init__(self, config: HunYuanVLConfig):
+        super().__init__()
+        self.config = config
+        self.is_causal = False  # used in flash_attention
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=True)
+        self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True)
+        self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True)
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=True)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        position_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class HunYuanVisionBlock(GradientCheckpointingLayer):
+    def __init__(self, config: HunYuanVLVisionConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = HunYuanVisionAttention(config)
+        self.mlp = HunYuanVisionMLP(config)
+        self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> torch.Tensor:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+class HunYuanVisionTransformer(nn.Module):
+    config: HunYuanVLVisionConfig
+    _no_split_modules = ["HunYuanVLVisionBlock"]
+    def __init__(self, config: HunYuanVLVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embeddings = HunYuanVisionPatchEmbed(config)
+        self.layers = nn.ModuleList([HunYuanVisionBlock(config) for _ in range(config.num_hidden_layers)])
+        self.perceive = HunYuanVisionPatchMerger(
+            self.config.hidden_size,
+            self.config.text_hidden_size,
+            self.config.spatial_merge_size,
+            self.config.rms_norm_eps,
+        )
+    def get_activation_function(self, act_name: str):
+        act_map = {
+            "gelu": nn.GELU(),
+            "relu": nn.ReLU(),
+            "silu": nn.SiLU(),
+        }
+        return act_map.get(act_name.lower(), nn.GELU())  # default GELU
+    # @auto_docstring
+    def forward(
+        self,
+        x: torch.Tensor,
+        grid_thw: list[list[int]],
+    ) -> torch.Tensor:
+        #
+        r"""
+        grid_thw (`torch.LongTensor` of shape `(num_images, 3)`):
+            The temporal, height and width dimensions of feature shape for each image. Each row contains [t, h, w] values.
+        """
+        hidden_states = self.embeddings(x, grid_thw)
+        for layer in self.layers:
+            hidden_states = layer(hidden_states)
+        cu_seqlens: list = [0]
+        for t, h, w in grid_thw:
+            cu_seqlens.append((h * w).item())
+        cu_seqlens = torch.tensor(cu_seqlens, dtype=torch.int32)
+        cu_seqlens = torch.cumsum(cu_seqlens, dim=0, dtype=torch.int32)
+        split_lengths = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
+        split_items = torch.split(hidden_states, split_lengths, dim=1)
+        processed_items = []
+        for grid, item in zip(grid_thw, split_items):
+            t, h, w = grid
+            processed = self.perceive(item, size=(h, w))
+            processed_items.append(processed)
+        hidden_states = torch.cat(processed_items, dim=1)
+        return hidden_states
+class HunYuanVLRotaryEmbedding(nn.Module):
+    inv_freq: torch.Tensor  # fix linting for `register_buffer`
+    def __init__(self, config: HunYuanVLConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type if self.rope_type != "xdrope" else "dynamic"]
+        if self.rope_type in ["xdrope", "dynamic"] and config.rope_scaling["alpha"]:
+            # DynamicNTKAlphaRotary
+            self.dim = config.head_dim
+            base = config.rope_theta * config.rope_scaling.get("alpha") ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            self.attention_scaling = 1.0
+        else:
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+        self._set_cos_sin_cache(
+            seq_len=config.max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1).float()
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len: Optional[int] = None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len].to(dtype=x.dtype),
+            self.sin_cached[:seq_len].to(dtype=x.dtype),
+        )
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb_xdrope(q, k, cos, sin, position_ids, xdrope_section, output_size=None):
+    """Applies XD Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`): The position IDs for the tokens.
+        xdrope_section (`list`): The section ratios for XD RoPE.
+        output_size (`tuple`, optional): The output size of the tensors. Defaults to None.
+        bf16 (bool, optional): Whether to use bfloat16 precision. Defaults to False.
+    Returns:
+        `tuple(torch.Tensor)`: The query and key tensors rotated using the XD Rotary Position Embedding.
+    """
+    x_dim = len(xdrope_section)
+    cos = cos[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
+    sin = sin[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
+    xdrope_section = xdrope_section * 2
+    # for xd concat
+    assert sum(xdrope_section) == cos.shape[-1], "Illegal partition for xd rope"
+    cos = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(cos.split(xdrope_section, dim=-1))], dim=-1)
+    sin = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(sin.split(xdrope_section, dim=-1))], dim=-1)
+    # for head repeat
+    cos = cos.view(output_size[0], 1, output_size[2], -1)  # .repeat(1, output_size[1], 1, 1)
+    sin = sin.view(output_size[0], 1, output_size[2], -1)  # .repeat(1, output_size[1], 1, 1)
+    origin_dtype = q.dtype
+    q, k = q.float(), k.float()
+    cos, sin = cos.float(), sin.float()
+    q_out, k_out = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
+    return q_out.to(origin_dtype), k_out.to(origin_dtype)
+def apply_rotary_pos_emb(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+    position_ids: Optional[torch.Tensor] = None,
+    unsqueeze_dim: int = 1,
+):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    if position_ids is not None:
+        cos = cos[position_ids].unsqueeze(unsqueeze_dim)
+        sin = sin[position_ids].unsqueeze(unsqueeze_dim)
+    else:
+        cos = cos.unsqueeze(0).unsqueeze(unsqueeze_dim)
+        sin = sin.unsqueeze(0).unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class HunYuanVLAttention(nn.Module):
+    def __init__(self, config, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.is_causal = True  # used in flash_attention
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+        self.query_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.key_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.rotary_emb = HunYuanVLRotaryEmbedding(config=config)
+        self.xdrope_section = config.rope_scaling["xdrope_section"]
+    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        position_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        kv_seq_len = key_states.shape[-2]
+        origin_kv_seq_len = key_states.shape[-2]
+        if past_key_values is not None:
+            kv_seq_len += past_key_values.get_seq_length(self.layer_idx)
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        if self.xdrope_section is not None:
+            if past_key_values is None or past_key_values.get_seq_length() == 0:
+                output_size = (
+                    query_states.size(0),
+                    query_states.size(1),
+                    query_states.size(2),
+                    key_states.size(2),
+                )
+                query_states, key_states = apply_rotary_pos_emb_xdrope(
+                    query_states, key_states, cos, sin, position_ids, self.xdrope_section, output_size
+                )
+            else:
+                position_ids = (
+                    torch.ones(position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device)
+                    * past_key_values.get_seq_length()
+                )
+                cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
+                query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        else:
+            position_ids = torch.ones(
+                position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device
+            ) * past_key_values.get_seq_length(self.layer_idx)
+            cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = self.query_layernorm(query_states)
+        key_states = self.key_layernorm(key_states)
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class HunYuanVLDecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: Union[HunYuanVLVisionConfig, HunYuanVLTextConfig], layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = HunYuanVLAttention(config=config, layer_idx=layer_idx)
+        self.mlp = HunYuanVLMLP(config)
+        self.input_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.layer_idx = layer_idx
+        if config.norm_type == "hf_rms" or config.norm_type == "rms":
+            self.input_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+            self.post_attention_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        elif config.norm_type == "fused" or config.norm_type == "torch_nn":
+            self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+            self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+        else:
+            assert False, "other norm_type are not supported"
+    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> torch.Tensor:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+@auto_docstring
+class HunYuanVLPreTrainedModel(PreTrainedModel):
+    config: HunYuanVLConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["HunYuanVLDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": HunYuanVLDecoderLayer,
+        "attentions": HunYuanVLAttention,
+    }
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+@auto_docstring
+class HunYuanVLModel(HunYuanVLPreTrainedModel):
+    def __init__(self, config: Union[HunYuanVLConfig, HunYuanVLTextConfig]):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [HunYuanVLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    # @auto_docstring # TODO Fix this
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache(config=self.config)
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position: torch.Tensor = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = create_causal_mask(
+            config=self.config,
+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values,
+            position_ids=position_ids,
+        )
+        hidden_states = inputs_embeds
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                cache_position=cache_position,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+        )
+@auto_docstring
+class HunYuanVLForCausalLM(HunYuanVLPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = HunYuanVLModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, HunYuanVLForCausalLM
+        >>> model = HunYuanVLForCausalLM.from_pretrained("meta-hunyuan_vl/HunYuanVL-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-hunyuan_vl/HunYuanVL-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class HunYuanVLForConditionalGeneration(HunYuanVLPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    config: HunYuanVLConfig
+    def __init__(self, config: HunYuanVLConfig):
+        super().__init__(config)
+        self.model = HunYuanVLModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.vit = HunYuanVisionTransformer(config.vision_config)
+        self.config = config
+        self.post_init()
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+        ```python
+        >>> from transformers import AutoProcessor, HunYuanVLForConditionalGeneration
+        >>> from PIL import Image
+        >>> import torch
+        >>> model_name_or_path = "tencent/HunyuanOCR"
+        >>> processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
+        >>> model = HunYuanVLForConditionalGeneration.from_pretrained(
+        ...     model_name_or_path,
+        ...     attn_implementation="eager",
+        ...     torch_dtype=torch.bfloat16,
+        ...     device_map="auto",
+        ... )
+        >>> img_path = "path/to/your/image.jpg"
+        >>> image = Image.open(img_path).convert("RGB")
+        >>> messages = [
+        ...     {
+        ...         "role": "user",
+        ...         "content": [
+        ...             {"type": "image", "image": img_path},
+        ...             {"type": "text", "text": "Extract the text from the image."},
+        ...         ],
+        ...     }
+        ... ]
+        >>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        >>> inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
+        >>> with torch.no_grad():
+        ...     generated_ids = model.generate(**inputs, max_new_tokens=1024)
+        >>> generated_ids_trimmed = generated_ids[0][len(inputs["input_ids"][0]):]
+        >>> output = processor.decode(generated_ids_trimmed, skip_special_tokens=True)
+        >>> print(output)
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    # def prepare_inputs_for_generation(
+    #     self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    # ):
+    #     inputs = super().prepare_inputs_for_generation(
+    #         input_ids,
+    #         past_key_values=past_key_values,
+    #         attention_mask=attention_mask,
+    #         inputs_embeds=inputs_embeds,
+    #         **kwargs,
+    #     )
+    #     return inputs
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        imgs: Optional[list[torch.FloatTensor]] = None,
+        imgs_pos: Optional[list[int]] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        image_grid_thw: Optional[list[int]] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+        inputs_embeds = self.model.embed_tokens(input_ids)
+        if self.vit is not None and pixel_values is not None:
+            pixel_values = pixel_values.to(torch.bfloat16)
+            image_embeds = self.vit(pixel_values, image_grid_thw)
+            # ViT may be deployed on different GPUs from those used by LLMs, due to auto-mapping of accelerate.
+            image_embeds = image_embeds.to(input_ids.device, non_blocking=True)
+            image_mask, _ = self.get_placeholder_mask(
+                input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
+            )
+            inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+        return super().generate(
+            inputs=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            # eos_token_id=self.config.eod_token_id,
+            **kwargs,
+        )
+    # Copied from transformers.models.llava.modeling_llava.LlavaModel.get_placeholder_mask
+    def get_placeholder_mask(
+        self,
+        input_ids: torch.LongTensor,
+        inputs_embeds: torch.FloatTensor,
+        image_features: Optional[torch.FloatTensor] = None,
+    ):
+        """
+        Obtains multimodal placeholder mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
+        equal to the length of multimodal features. If the lengths are different, an error is raised.
+        """
+        if input_ids is None:
+            special_image_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_image_mask = special_image_mask.all(-1)
+        else:
+            special_image_mask = input_ids == self.config.image_token_id
+        n_image_tokens = special_image_mask.sum()
+        special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        if image_features is not None and inputs_embeds[special_image_mask].numel() != image_features.numel():
+            raise ValueError(
+                f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {image_features.shape[0]}"
+            )
+        return special_image_mask, None
+__all__ = [
+    "HunYuanVLForConditionalGeneration",
+    "HunYuanVLForCausalLM",
+    "HunYuanVLModel",
+    "HunYuanVLPreTrainedModel",
+    "HunYuanVLTextModel",
+]

modular_hunyuan_vl.py ADDED Viewed

	@@ -0,0 +1,1042 @@

+# coding=utf-8
+# Copyright (C) 2025 THL A29 Limited, a Tencent company and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch HunYuanVL model."""
+from typing import Callable, Optional, Tuple, Union, List, Dict
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation import GenerationMixin
+from transformers.masking_utils import create_causal_mask
+from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
+from transformers.processing_utils import Unpack
+from transformers.utils import (
+    TransformersKwargs,
+    auto_docstring,
+    can_return_tuple,
+    logging,
+)
+from transformers.utils.deprecation import deprecate_kwarg
+from transformers.utils.generic import check_model_inputs
+from transformers.models.hunyuan_v1_dense.configuration_hunyuan_v1_dense import HunYuanDenseV1Config
+from transformers.models.hunyuan_v1_dense.modeling_hunyuan_v1_dense import (
+    HunYuanDenseV1Attention,
+    HunYuanDenseV1DecoderLayer,
+    HunYuanDenseV1MLP,
+    HunYuanDenseV1Model,
+    HunYuanDenseV1PreTrainedModel,
+    HunYuanDenseV1RMSNorm,
+    HunYuanDenseV1RotaryEmbedding,
+    HunYuanDenseV1ForCausalLM
+)
+from transformers.models.llama.modeling_llama import (
+    LlamaAttention,
+    LlamaDecoderLayer,
+    LlamaForCausalLM,
+    LlamaForSequenceClassification,
+    LlamaMLP,
+    LlamaModel,
+    LlamaPreTrainedModel,
+    LlamaRMSNorm,
+    rotate_half,
+    repeat_kv,
+    eager_attention_forward
+)
+import json
+import types
+import math
+import torch
+from torch import Tensor, nn
+import torch.nn.functional as F
+from typing import List, Tuple, Optional, Union
+from contextlib import contextmanager
+from transformers.modeling_attn_mask_utils import (
+    _prepare_4d_causal_attention_mask_for_sdpa,
+    _prepare_4d_causal_attention_mask_for_sdpa,
+    _prepare_4d_causal_attention_mask,
+)
+from transformers.modeling_outputs import BaseModelOutputWithPooling
+logger = logging.get_logger(__name__)
+class HunYuanVLVisionConfig(PretrainedConfig):
+    model_type = "hunyuan_vl"
+    base_config_key = "vision_config"
+    def __init__(
+        self,
+        hidden_act='gelu',
+        hidden_size=1152,
+        intermediate_size=4304,
+        interpolate_mode='bilinear',
+        rms_norm_eps=1e-05,
+        learnable_mlp_pooling_size=0,
+        num_attention_heads=16,
+        num_key_value_heads=None,
+        num_channels=3,
+        num_hidden_layers=27,
+        out_hidden_size=4096,
+        patch_size=16,
+        remove_prenorm=True,
+        spatial_merge_size=2,
+        temporal_patch_size=1,
+        resize_resolution=2048,
+        img_max_token_num=4096,
+        max_image_size=2048,
+        video_max_image_size=768,
+        video_min_image_size=256,
+        min_image_size=512,
+        anyres_vit_max_image_size=2048,
+        max_vit_seq_len=16384,
+        text_hidden_size=3072,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.hidden_act = hidden_act
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.interpolate_mode = interpolate_mode
+        self.learnable_mlp_pooling_size = learnable_mlp_pooling_size
+        self.num_attention_heads = num_attention_heads
+        if not num_key_value_heads:
+            self.num_key_value_heads = num_attention_heads
+        else:
+            self.num_key_value_heads = num_key_value_heads
+        self.num_channels = num_channels
+        self.num_hidden_layers = num_hidden_layers
+        self.out_hidden_size = out_hidden_size
+        self.patch_size = patch_size
+        self.remove_prenorm = remove_prenorm
+        self.spatial_merge_size = spatial_merge_size
+        self.temporal_patch_size = temporal_patch_size
+        self.rms_norm_eps = rms_norm_eps
+        self.resize_resolution = resize_resolution
+        self.img_max_token_num = img_max_token_num
+        self.max_image_size = max_image_size
+        self.min_image_size = min_image_size
+        self.video_max_image_size = video_max_image_size
+        self.video_min_image_size = video_min_image_size
+        self.anyres_vit_max_image_size = anyres_vit_max_image_size
+        self.max_vit_seq_len = max_vit_seq_len
+        self.text_hidden_size = text_hidden_size
+class HunYuanVLTextConfig(HunYuanDenseV1Config):
+    model_type = "hunyuan_vl_text"
+    keys_to_ignore_at_inference = ["past_key_values"]
+class HunYuanVLConfig(PretrainedConfig):
+    model_type = "hunyuan_vl"
+    sub_configs = {"vision_config": HunYuanVLVisionConfig, "text_config": HunYuanVLTextConfig}
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        im_start_id=120118,
+        im_end_id=120119,
+        image_token_id=120120,
+        im_newline_id=120121,
+        video_start_id=120122,
+        video_end_id=120123,
+        **kwargs,
+    ):
+        # We need to init super() here so that it does not reset values
+        # that are in text config to the BaseClass defaults. The Base
+        # config has many text related defaults and not all defaults are same as for `HunYuanVLTextConfig`
+        super().__init__(**kwargs)
+        if isinstance(vision_config, dict):
+            self.vision_config = self.sub_configs["vision_config"](**vision_config)
+        elif vision_config is None:
+            self.vision_config = self.sub_configs["vision_config"]()
+        if isinstance(text_config, dict):
+            self.text_config = self.sub_configs["text_config"](**text_config)
+        elif text_config is None:
+            # For BC use all kwargs to init `TextConfig`
+            self.text_config = self.sub_configs["text_config"](**kwargs)
+        self.image_token_id = image_token_id
+        self.im_start_id = im_start_id
+        self.im_end_id = im_end_id
+        self.im_newline_id = im_newline_id
+        self.video_start_id = video_start_id
+        self.video_end_id = video_end_id
+        self.vision_config.text_hidden_size = self.text_config.hidden_size
+        # Attention implementation to use. It sets it recursively on sub-configs so we call it again in the end
+        self._attn_implementation = kwargs.pop("attn_implementation", None)
+    def __setattr__(self, key, value):
+        if (
+            (text_config := super().__getattribute__("__dict__").get("text_config")) is not None
+            and key not in ["dtype", "_attn_implementation_internal"]
+            and key in text_config.__dict__
+        ):
+            setattr(text_config, key, value)
+        else:
+            super().__setattr__(key, value)
+    def __getattribute__(self, key):
+        if "text_config" in super().__getattribute__("__dict__") and key not in [
+            "_name_or_path",
+            "model_type",
+            "dtype",
+            "_attn_implementation_internal",
+        ]:
+            text_config = super().__getattribute__("text_config")
+            if key in text_config.__dict__:
+                return getattr(text_config, key)
+        return super().__getattribute__(key)
+class HunYuanVisionMLP(nn.Module):
+    def __init__(self, config: HunYuanVLConfig):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.act_fn = ACT2FN[config.hidden_act]
+        self.dense_h_to_4h = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
+        self.dense_4h_to_h = nn.Linear(self.intermediate_size, self.hidden_size, bias=True)
+    def forward(self, x):
+        intermediate = self.dense_h_to_4h(x)
+        intermediate = self.act_fn(intermediate)
+        output = self.dense_4h_to_h(intermediate)
+        return output
+class HunYuanVLRMSNorm(LlamaRMSNorm):
+    pass
+class HunYuanVLMLP(HunYuanDenseV1MLP):
+    pass
+class HunYuanVisionPatchEmbed(nn.Module):
+    def __init__(self, config: HunYuanVLVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.patch_size = config.patch_size
+        self.num_channels = config.num_channels
+        self.spatial_merge_size = config.spatial_merge_size
+        self.interpolate_mode = config.interpolate_mode
+        self.patch_embedding = nn.Conv2d(
+            in_channels=config.num_channels,
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            bias=True,
+        )
+        self.max_num_patches = (config.max_image_size // self.patch_size) ** 2
+        self.num_positions = self.max_num_patches + 1
+        self.position_edge = int(self.num_positions ** 0.5)
+        # first token is cls token, skip it
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+        self.patch_pos_embed = None
+    def forward(self, pixel_values: torch.Tensor, grid_thw: list[list[int]]) -> torch.Tensor:
+        num_patches, hidden_size = pixel_values.shape
+        pixel_values = pixel_values.reshape(num_patches, self.num_channels, self.patch_size, self.patch_size)
+        patch_embeds = self.patch_embedding(pixel_values)
+        patch_embeds = patch_embeds.squeeze(-1).squeeze(-1).unsqueeze(0)
+        if self.patch_pos_embed is None:
+            patch_pos_shape = (1, self.position_edge, self.position_edge, self.embed_dim)
+            self.patch_pos_embed = (
+                self.position_embedding.weight[1:, :].reshape(patch_pos_shape).permute(0, 3, 1, 2).float()
+            )
+        patch_pos_embed_list = []
+        for grid in grid_thw:
+            _, h0, w0 = grid
+            # we add a small number to avoid floating point error in the interpolation
+            # see discussion at https://github.com/facebookresearch/dino/issues/8
+            h0, w0 = h0 + 0.1, w0 + 0.1
+            patch_pos_embed = nn.functional.interpolate(
+                self.patch_pos_embed,
+                scale_factor=((h0 / self.position_edge).item(), (w0 / self.position_edge).item()),
+                mode=self.interpolate_mode,
+                align_corners=False,
+            )
+            patch_pos_embed = (
+                patch_pos_embed.reshape(self.embed_dim, -1).transpose(0, 1).unsqueeze(0).to(patch_embeds.dtype)
+            )
+            patch_pos_embed_list.append(patch_pos_embed)
+        patch_pos_embed = torch.cat(patch_pos_embed_list, dim=1)
+        embeddings = patch_embeds + patch_pos_embed
+        return embeddings
+class HunYuanVisionPatchMerger(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        spatial_merge_size,
+        rms_norm_eps,
+        **kwargs,
+    ):
+        super().__init__()
+        embed_std = out_channels ** -0.5
+        self.spatial_merge_size = spatial_merge_size
+        self.proj = nn.Sequential(
+            nn.Conv2d(in_channels, in_channels * 2, kernel_size=spatial_merge_size, stride=spatial_merge_size),
+            nn.GELU(),
+            nn.Conv2d(in_channels * 2, in_channels * 4, kernel_size=1),
+        )
+        self.mlp = nn.Linear(in_channels * 4, out_channels)
+        self.image_newline = nn.Parameter(torch.randn(in_channels * 4) * embed_std)
+        self.image_begin = nn.Parameter(torch.randn(out_channels) * embed_std)
+        self.image_end = nn.Parameter(torch.randn(out_channels) * embed_std)
+        self.image_sep = nn.Parameter(torch.randn(out_channels) * embed_std)
+        self.before_rms = HunYuanVLRMSNorm(in_channels, eps=rms_norm_eps)
+        self.after_rms = HunYuanVLRMSNorm(out_channels, eps=rms_norm_eps)
+    def forward(self, x, size=(16, 16)):
+        x = self.before_rms(x)
+        h, w = size
+        dtype = x.dtype
+        x = x.permute(0, 2, 1).reshape(x.shape[0], -1, int(h.item()), int(w.item()))
+        x = self.proj(x)  # b,c,h,w
+        b, c, h, w = x.shape
+        x = torch.cat(
+            [x, self.image_newline.reshape(1, c, 1, 1).expand(b, c, h, 1).to(dtype, non_blocking=True)], dim=-1
+        )
+        x = x.reshape(b, c, -1).permute(0, 2, 1)
+        x = self.mlp(x)
+        begin = self.image_begin.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
+        end = self.image_end.reshape(1, 1, -1).expand(b, 1, x.shape[-1]).to(dtype, non_blocking=True)
+        x = torch.cat([begin, x, end], dim=1)
+        return self.after_rms(x)
+class HunYuanVisionAttention(nn.Module):
+    def __init__(self, config: HunYuanVLConfig):
+        super().__init__()
+        self.config = config
+        self.is_causal = False   # used in flash_attention
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=True
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=True
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        position_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class HunYuanVisionBlock(GradientCheckpointingLayer):
+    def __init__(self, config: HunYuanVLVisionConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = HunYuanVisionAttention(config)
+        self.mlp = HunYuanVisionMLP(config)
+        self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> torch.Tensor:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+class HunYuanVisionTransformer(nn.Module):
+    config: HunYuanVLVisionConfig
+    _no_split_modules = ["HunYuanVLVisionBlock"]
+    def __init__(self, config: HunYuanVLVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embeddings = HunYuanVisionPatchEmbed(config)
+        self.layers = nn.ModuleList(
+            [HunYuanVisionBlock(config) for _ in range(config.num_hidden_layers)]
+        )
+        self.perceive = HunYuanVisionPatchMerger(
+            self.config.hidden_size,
+            self.config.text_hidden_size,
+            self.config.spatial_merge_size,
+            self.config.rms_norm_eps,
+        )
+    def get_activation_function(self, act_name: str):
+        act_map = {
+            "gelu": nn.GELU(),
+            "relu": nn.ReLU(),
+            "silu": nn.SiLU(),
+        }
+        return act_map.get(act_name.lower(), nn.GELU())  # default GELU
+    # @auto_docstring
+    def forward(
+        self,
+        x: torch.Tensor,
+        grid_thw: list[list[int]],
+    ) -> torch.Tensor:
+        #
+        r"""
+        grid_thw (`torch.LongTensor` of shape `(num_images, 3)`):
+            The temporal, height and width dimensions of feature shape for each image. Each row contains [t, h, w] values.
+        """
+        hidden_states = self.embeddings(x, grid_thw)
+        for layer in self.layers:
+            hidden_states = layer(hidden_states)
+        cu_seqlens: list = [0]
+        for t, h, w in grid_thw:
+            cu_seqlens.append((h * w).item())
+        cu_seqlens = torch.tensor(cu_seqlens, dtype=torch.int32)
+        cu_seqlens = torch.cumsum(cu_seqlens, dim=0, dtype=torch.int32)
+        split_lengths = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
+        split_items = torch.split(hidden_states, split_lengths, dim=1)
+        processed_items = []
+        for grid, item in zip(grid_thw, split_items):
+            t, h, w = grid
+            processed = self.perceive(item, size=(h, w))
+            processed_items.append(processed)
+        hidden_states = torch.cat(processed_items, dim=1)
+        return hidden_states
+def apply_rotary_pos_emb_xdrope(q, k, cos, sin, position_ids, xdrope_section, output_size=None):
+    """Applies XD Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`): The position IDs for the tokens.
+        xdrope_section (`list`): The section ratios for XD RoPE.
+        output_size (`tuple`, optional): The output size of the tensors. Defaults to None.
+        bf16 (bool, optional): Whether to use bfloat16 precision. Defaults to False.
+    Returns:
+        `tuple(torch.Tensor)`: The query and key tensors rotated using the XD Rotary Position Embedding.
+    """
+    x_dim = len(xdrope_section)
+    cos = cos[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
+    sin = sin[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
+    xdrope_section = xdrope_section * 2
+    # for xd concat
+    assert sum(xdrope_section) == cos.shape[-1], "Illegal partition for xd rope"
+    cos = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(cos.split(xdrope_section, dim=-1))], dim=-1)
+    sin = torch.cat([m[:, :, i % x_dim, :] for i, m in enumerate(sin.split(xdrope_section, dim=-1))], dim=-1)
+    # for head repeat
+    cos = cos.view(output_size[0], 1, output_size[2], -1)  # .repeat(1, output_size[1], 1, 1)
+    sin = sin.view(output_size[0], 1, output_size[2], -1)  # .repeat(1, output_size[1], 1, 1)
+    origin_dtype = q.dtype
+    q, k = q.float(), k.float()
+    cos, sin = cos.float(), sin.float()
+    q_out, k_out = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
+    return q_out.to(origin_dtype), k_out.to(origin_dtype)
+def apply_rotary_pos_emb(
+    q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, position_ids: Optional[torch.Tensor]=None, unsqueeze_dim: int=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    if position_ids is not None:
+        cos = cos[position_ids].unsqueeze(unsqueeze_dim)
+        sin = sin[position_ids].unsqueeze(unsqueeze_dim)
+    else:
+        cos = cos.unsqueeze(0).unsqueeze(unsqueeze_dim)
+        sin = sin.unsqueeze(0).unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class HunYuanVLRotaryEmbedding(nn.Module):
+    inv_freq: torch.Tensor  # fix linting for `register_buffer`
+    def __init__(self, config: HunYuanVLConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type if self.rope_type != "xdrope" else "dynamic"]
+        if self.rope_type in ["xdrope", "dynamic"] and config.rope_scaling["alpha"]:
+            # DynamicNTKAlphaRotary
+            self.dim = config.head_dim
+            base = config.rope_theta * config.rope_scaling.get("alpha") ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            self.attention_scaling = 1.0
+        else:
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+        self._set_cos_sin_cache(
+            seq_len=config.max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1).float()
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len: Optional[int]=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len].to(dtype=x.dtype),
+            self.sin_cached[:seq_len].to(dtype=x.dtype),
+        )
+class HunYuanVLAttention(nn.Module):
+    def __init__(self, config, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.is_causal = True  # used in flash_attention
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+        self.query_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.key_layernorm = HunYuanVLRMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.rotary_emb = HunYuanVLRotaryEmbedding(config=config)
+        self.xdrope_section = config.rope_scaling['xdrope_section']
+    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        position_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        kv_seq_len = key_states.shape[-2]
+        origin_kv_seq_len = key_states.shape[-2]
+        if past_key_values is not None:
+            kv_seq_len += past_key_values.get_seq_length(self.layer_idx)
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        if self.xdrope_section is not None:
+            if past_key_values is None or past_key_values.get_seq_length() == 0:
+                output_size = (
+                    query_states.size(0),
+                    query_states.size(1),
+                    query_states.size(2),
+                    key_states.size(2),
+                )
+                query_states, key_states = apply_rotary_pos_emb_xdrope(
+                    query_states, key_states, cos, sin, position_ids, self.xdrope_section, output_size
+                )
+            else:
+                position_ids = (
+                    torch.ones(position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device)
+                    * past_key_values.get_seq_length()
+                )
+                cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
+                query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        else:
+            position_ids = torch.ones(
+                position_ids.shape[0], 1, dtype=torch.long, device=position_ids.device
+            ) * past_key_values.get_seq_length(self.layer_idx)
+            cos, sin = cos[-origin_kv_seq_len:, :], sin[-origin_kv_seq_len:, :]
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = self.query_layernorm(query_states)
+        key_states = self.key_layernorm(key_states)
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class HunYuanVLDecoderLayer(LlamaDecoderLayer):
+    def __init__(
+        self,
+        config: Union[HunYuanVLVisionConfig, HunYuanVLTextConfig],
+        layer_idx: int):
+        super().__init__()
+        self.layer_idx = layer_idx
+        if config.norm_type == 'hf_rms' or config.norm_type == 'rms':
+            self.input_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+            self.post_attention_layernorm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        elif config.norm_type == 'fused' or config.norm_type == 'torch_nn':
+            self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+            self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
+        else:
+            assert False, "other norm_type are not supported"
+class HunYuanVLPreTrainedModel(LlamaPreTrainedModel):
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+@auto_docstring
+class HunYuanVLModel(HunYuanVLPreTrainedModel):
+    def __init__(self, config: Union[HunYuanVLConfig, HunYuanVLTextConfig]):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [HunYuanVLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = HunYuanVLRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    @check_model_inputs
+    # @auto_docstring # TODO Fix this
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache(config=self.config)
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position: torch.Tensor = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = create_causal_mask(
+            config=self.config,
+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values,
+            position_ids=position_ids,
+        )
+        hidden_states = inputs_embeds
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                cache_position=cache_position,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+        )
+class HunYuanVLForCausalLM(LlamaForCausalLM):
+    pass
+class HunYuanVLForConditionalGeneration(HunYuanVLPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    config: HunYuanVLConfig
+    def __init__(self, config: HunYuanVLConfig):
+        super().__init__(config)
+        self.model = HunYuanVLModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.vit = HunYuanVisionTransformer(config.vision_config)
+        self.config = config
+        self.post_init()
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+        ```python
+        >>> from transformers import AutoProcessor, HunYuanVLForConditionalGeneration
+        >>> from PIL import Image
+        >>> import torch
+        >>> model_name_or_path = "tencent/HunyuanOCR"
+        >>> processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
+        >>> model = HunYuanVLForConditionalGeneration.from_pretrained(
+        ...     model_name_or_path,
+        ...     attn_implementation="eager",
+        ...     torch_dtype=torch.bfloat16,
+        ...     device_map="auto",
+        ... )
+        >>> img_path = "path/to/your/image.jpg"
+        >>> image = Image.open(img_path).convert("RGB")
+        >>> messages = [
+        ...     {
+        ...         "role": "user",
+        ...         "content": [
+        ...             {"type": "image", "image": img_path},
+        ...             {"type": "text", "text": "Extract the text from the image."},
+        ...         ],
+        ...     }
+        ... ]
+        >>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        >>> inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
+        >>> with torch.no_grad():
+        ...     generated_ids = model.generate(**inputs, max_new_tokens=1024)
+        >>> generated_ids_trimmed = generated_ids[0][len(inputs["input_ids"][0]):]
+        >>> output = processor.decode(generated_ids_trimmed, skip_special_tokens=True)
+        >>> print(output)
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    # def prepare_inputs_for_generation(
+    #     self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    # ):
+    #     inputs = super().prepare_inputs_for_generation(
+    #         input_ids,
+    #         past_key_values=past_key_values,
+    #         attention_mask=attention_mask,
+    #         inputs_embeds=inputs_embeds,
+    #         **kwargs,
+    #     )
+    #     return inputs
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        imgs: Optional[list[torch.FloatTensor]] = None,
+        imgs_pos: Optional[list[int]] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        image_grid_thw: Optional[list[int]] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+        inputs_embeds = self.model.embed_tokens(input_ids)
+        if self.vit is not None and pixel_values is not None:
+            pixel_values = pixel_values.to(torch.bfloat16)
+            image_embeds = self.vit(pixel_values, image_grid_thw)
+            # ViT may be deployed on different GPUs from those used by LLMs, due to auto-mapping of accelerate.
+            image_embeds = image_embeds.to(input_ids.device, non_blocking=True)
+            image_mask, _ = self.get_placeholder_mask(
+                input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
+            )
+            inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+        return super().generate(
+            inputs=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            # eos_token_id=self.config.eod_token_id,
+            **kwargs,
+        )
+    # Copied from transformers.models.llava.modeling_llava.LlavaModel.get_placeholder_mask
+    def get_placeholder_mask(
+        self,
+        input_ids: torch.LongTensor,
+        inputs_embeds: torch.FloatTensor,
+        image_features: Optional[torch.FloatTensor] = None
+    ):
+        """
+        Obtains multimodal placeholder mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
+        equal to the length of multimodal features. If the lengths are different, an error is raised.
+        """
+        if input_ids is None:
+            special_image_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_image_mask = special_image_mask.all(-1)
+        else:
+            special_image_mask = input_ids == self.config.image_token_id
+        n_image_tokens = special_image_mask.sum()
+        special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        if image_features is not None and inputs_embeds[special_image_mask].numel() != image_features.numel():
+            raise ValueError(
+                f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {image_features.shape[0]}"
+            )
+        return special_image_mask, None
+__all__ = [
+    "HunYuanVLConfig",
+    "HunYuanVLVisionConfig",
+    "HunYuanVLTextConfig",
+    "HunYuanVLForConditionalGeneration",
+    "HunYuanVLForCausalLM",
+    "HunYuanVLModel",
+    "HunYuanVLPreTrainedModel",
+    "HunYuanVLTextModel"
+]

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "min_pixels": 262144,
+  "max_pixels": 4194304,
+  "patch_size": 16,
+  "resample": 1,
+  "temporal_patch_size": 1,
+  "merge_size": 2,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "image_processor_type": "HunYuanVLImageProcessor",
+  "processor_class": "HunYuanVLProcessor",
+  "auto_map": {
+    "AutoProcessor": "processing_hunyuan_vl.HunYuanVLProcessor",
+    "AutoImageProcessor": "image_processing_hunyuan_vl.HunYuanVLImageProcessor"
+  }
+}

processing_hunyuan_vl.py ADDED Viewed

	@@ -0,0 +1,194 @@

+import os
+from typing import Union
+import torch
+import numpy as np
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.image_utils import ImageInput
+from transformers.video_utils import VideoInput
+from transformers.processing_utils import ProcessorMixin
+from transformers.tokenization_utils_base import PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
+from transformers.utils import TensorType, logging
+logger = logging.get_logger(__name__)
+class HunYuanVLProcessor(ProcessorMixin):
+    attributes = ['image_processor', 'tokenizer']
+    valid_kwargs = ["chat_template"]
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer" # ("AutoTokenizer", None)
+    def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
+        # TODO Fix the init
+        self.tokenizer = tokenizer
+        self.image_token_id = 120120 # self.tokenizer.image_token_id
+        self.image_token = self.tokenizer.convert_ids_to_tokens(self.image_token_id)
+        self.im_start_token_id = 120118 # self.tokenizer.im_start_id
+        self.im_start_token = self.tokenizer.convert_ids_to_tokens(self.im_start_token_id)
+        self.im_end_token_id = 120119 # self.tokenizer.im_end_id
+        self.im_end_token = self.tokenizer.convert_ids_to_tokens(self.im_end_token_id)
+        self.placeholder_token = self.tokenizer.convert_ids_to_tokens(self.tokenizer.vocab_size - 1)
+        self.pad_id = 120002 #self.tokenizer.pad_token_id
+        super().__init__(image_processor, tokenizer, video_processor, chat_template=chat_template)
+    def __call__(
+        self,
+        images: ImageInput = None,
+        text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
+        videos: VideoInput = None,
+        **kwargs
+    ) -> BatchFeature:
+        image_inputs = videos_inputs = {}
+        if images is not None:
+            image_inputs = self.image_processor(images=images)
+            image_grid_thw = image_inputs["image_grid_thw"]
+        if not isinstance(text, list):
+            text = [text]
+        text = text.copy()  # below lines change text in-place
+        image_tokens_cumsum = [0]
+        if images is not None:
+            index = 0
+            for i in range(len(text)):
+                while self.image_token in text[i]:
+                    grid_h, grid_w = image_grid_thw[index][-2:]
+                    patch_h = grid_h // self.image_processor.merge_size
+                    patch_w = grid_w // self.image_processor.merge_size
+                    num_image_tokens = patch_h * (patch_w + 1) + 2
+                    image_tokens_cumsum.append(image_tokens_cumsum[-1] + num_image_tokens)
+                    # text[i] = text[i].replace(self.image_token, self.im_start_token + self.placeholder_token * num_image_tokens + self.im_end_token, 1)
+                    text[i] = text[i].replace(self.image_token, self.placeholder_token * num_image_tokens, 1)
+                    index += 1
+                text[i] = text[i].replace(self.placeholder_token, self.image_token)
+                # text[i] = self.tokenizer.bos_token + text[i]
+        text_inputs = self.tokenizer(text, add_special_tokens=False, **kwargs)
+        self._check_special_mm_tokens(text, text_inputs, modalities=["image"])
+        input_ids = text_inputs['input_ids']
+        position_ids = torch.arange(len(input_ids[0]))
+        position_ids_w = torch.arange(len(input_ids[0]))
+        position_ids_h = torch.arange(len(input_ids[0]))
+        position_ids_t = torch.arange(len(input_ids[0]))
+        if images is not None:
+            image_token_pos_indices = torch.where(input_ids[0] == self.image_token_id)[0]
+            for i in range(len(image_grid_thw)):
+                grid_h, grid_w = image_grid_thw[i][-2:]
+                patch_h = grid_h // self.image_processor.merge_size
+                patch_w = grid_w // self.image_processor.merge_size
+                start_pos = image_token_pos_indices[image_tokens_cumsum[i]].item() + 1
+                replace_num = (patch_w + 1) * patch_h
+                position_ids_w[start_pos: start_pos + replace_num] = torch.tensor(list(range(patch_w + 1)) * patch_h, dtype=torch.int64)
+                patch_h_list = []
+                for h in range(patch_h):
+                    patch_h_list += [h] * (patch_w+1)
+                position_ids_h[start_pos: start_pos + replace_num] = torch.tensor(patch_h_list, dtype=torch.int64)
+                position_ids_t[start_pos: start_pos + replace_num] = 0
+        position_ids = torch.stack([position_ids, position_ids_w, position_ids_h, position_ids_t]).unsqueeze(0)
+        text_inputs['position_ids'] = position_ids
+        attention_mask = input_ids.ne(self.pad_id)
+        text_inputs["attention_mask"] = attention_mask
+        text_inputs["imgs_pos"] = [self.get_imgs_pos(input_ids)]
+        # image_inputs["imgs"] = [[image_inputs["pixel_values"]]]
+        return_tensors = kwargs.pop("return_tensors", None)
+        return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
+    def batch_decode(self, *args, **kwargs):
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args, **kwargs):
+        return self.tokenizer.decode(*args, **kwargs)
+    def post_process_image_text_to_text(
+        self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
+    ):
+        assert 0
+    def apply_chat_template(self, *args, **kwargs):
+        token_ids = self.tokenizer.apply_chat_template(*args, **kwargs)
+        return token_ids
+    def get_imgs_pos(self, doc_ids):
+        doc_ids = np.array(doc_ids, dtype=np.int64)
+        img_begin_index = np.where(doc_ids == self.im_start_token_id)[0]
+        img_end_index = np.where(doc_ids == self.im_end_token_id)[0]
+        imgs_pos = np.concatenate((np.reshape(img_begin_index + 1, (-1, 1)), np.reshape(img_end_index, (-1, 1))), axis=-1).tolist()
+        return imgs_pos
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+def split_image_into_patch_blocks(
+    pixel_values: torch.Tensor,      # shape: [batch_size, 3, H, W]
+    patch_size: int = 16,                 # e.g. 16
+    adaptor_patch_div: int = 4,          # e.g. 4 --> 表示每个 patch_size 切成 4x4 小区域，即 patch_size // 4
+) -> torch.Tensor:
+    """
+    Split the input image tensor (supporting batch) into large patches of size `patch_size`,
+    and then further divide each large patch into smaller regions of size
+    (patch_size // adaptor_patch_div) x (patch_size // adaptor_patch_div).
+    Each small region is extracted as a tensor of shape [3, patch_size, patch_size].
+    The final output contains all such small region tensors.
+    Args:
+        pixel_values: Input image tensor of shape [batch_size, 3, H, W].
+        patch_size: Size of the large patch, e.g., 16.
+        adaptor_patch_div: Each large patch is divided into
+                          (patch_size // adaptor_patch_div) x (patch_size // adaptor_patch_div)
+                          smaller regions.
+    Returns:
+        patches: A tensor of shape [N, 3, patch_size, patch_size],
+                 where N = batch_size * (H // patch_size) * (W // patch_size) * (patch_size // adaptor_patch_div)^2.
+                 Each element in the batch corresponds to one small image region.
+    """
+    batch_size, channels, height, width = pixel_values.shape
+    assert channels == 3, "Pixel values must have 3 channels in dim=1"
+    assert height % patch_size == 0 and width % patch_size == 0, "H and W must be divisible by patch_size"
+    patch_height_num = height // patch_size
+    patch_width_num = width // patch_size
+    small_regions_per_patch = (patch_size // adaptor_patch_div) ** 2
+    # Reshape to [B, 3, ph, ps, pw, ps]
+    img = pixel_values.reshape(
+        batch_size, 3,
+        patch_height_num, patch_size,
+        patch_width_num, patch_size
+    )
+    # Further split each psxps patch into (ps//aps)x(ps//aps) small regions
+    img = img.reshape(
+        batch_size, 3,
+        patch_height_num,
+        patch_size // adaptor_patch_div,  # ps // aps
+        adaptor_patch_div,
+        patch_width_num,
+        patch_size // adaptor_patch_div,  # ps // aps
+        adaptor_patch_div
+    )
+    # Permute to group the small regions: [B, ph, pw, ps//aps, ps//aps, 3, aps, aps]
+    img = img.permute(0, 2, 5, 3, 6, 1, 4, 7)
+    # Reshape into [B * ph * pw * (ps//aps)^2, 3, patch_size, patch_size]
+    patches = img.reshape(-1, 3, patch_size, patch_size)
+    return patches
+__all__ = ["HunYuanVLProcessor"]

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<｜hy_begin▁of▁sentence｜>",
+  "eos_token": "<｜hy_place▁holder▁no▁2｜>",
+  "pad_token": "<｜hy_▁pad▁｜>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff