alexmarques commited on
Commit
0b7ec27
·
verified ·
1 Parent(s): 4a02336

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - multimodal
7
+ - vision-language
8
+ - reasoning
9
+ - math
10
+ - ocr
11
+ - gui-grounding
12
+ - computer-use
13
+ - chain-of-thought
14
+ base_model: microsoft/Phi-4-reasoning
15
+ pipeline_tag: image-text-to-text
16
+ model-index:
17
+ - name: Phi-4-Reasoning-Vision-15B
18
+ results:
19
+ - task:
20
+ type: visual-question-answering
21
+ dataset:
22
+ name: AI2D
23
+ type: ai2d
24
+ metrics:
25
+ - type: accuracy
26
+ value: 84.8
27
+ - task:
28
+ type: visual-question-answering
29
+ dataset:
30
+ name: ChartQA
31
+ type: chartqa
32
+ metrics:
33
+ - type: accuracy
34
+ value: 83.3
35
+ - task:
36
+ type: visual-question-answering
37
+ dataset:
38
+ name: MathVista (MINI)
39
+ type: mathvista
40
+ metrics:
41
+ - type: accuracy
42
+ value: 75.2
43
+ - task:
44
+ type: visual-question-answering
45
+ dataset:
46
+ name: MMMU
47
+ type: mmmu
48
+ metrics:
49
+ - type: accuracy
50
+ value: 54.3
51
+ - task:
52
+ type: visual-question-answering
53
+ dataset:
54
+ name: OCRBench
55
+ type: ocrbench
56
+ metrics:
57
+ - type: accuracy
58
+ value: 76.0
59
+ - task:
60
+ type: visual-question-answering
61
+ dataset:
62
+ name: ScreenSpot-V2
63
+ type: screenspot-v2
64
+ metrics:
65
+ - type: accuracy
66
+ value: 88.2
67
+ ---
68
+
69
+ # Phi-4-Reasoning-Vision-15B
70
+
71
+ [![Microsoft](https://img.shields.io/badge/Microsoft-Project-0078D4?logo=microsoft)](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)
72
+ [![Foundry](https://img.shields.io/badge/Azure-Foundry-0089D6)](https://aka.ms/Phi-4-r-v-foundry)
73
+ [![Github](https://img.shields.io/badge/Github-181717?logo=github&logoColor=white)](https://github.com/microsoft/phi-4-reasoning-vision-15B)
74
+ [![Paper](https://img.shields.io/badge/Paper-2511.19663-red)](https://aka.ms/Phi-4-reasoning-vision-15B-TR)
75
+
76
+ [Official Microsoft Blog](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)<br>
77
+ [Technical Report](https://aka.ms/Phi-4-reasoning-vision-15B-TR)<br>
78
+ [Github](https://github.com/microsoft/phi-4-reasoning-vision-15B)<br>
79
+ [Try Phi-4-Reasoning-Vision-15B on Microsoft Foundry](https://aka.ms/Phi-4-r-v-foundry)<br>
80
+
81
+ **Developer:** Microsoft Corporation
82
+ **Authorized Representative:** Microsoft Ireland Operations Limited, 70 Sir John Rogerson's Quay, Dublin 2, D02 R296, Ireland
83
+ **Release Date:** March 4, 2026
84
+ **License:** [MIT](https://opensource.org/licenses/MIT)
85
+ **Parameters:** 15B
86
+ **Context Length:** 16,384 tokens
87
+ **Inputs:** Text and Images
88
+ **Outputs:** Text
89
+ **Training GPUs:** 240 B200s
90
+ **Training Time:** 4 days
91
+ **Training Dates:** February 3, 2025 – February 21, 2026
92
+ **Model Dependencies:** [Phi-4-Reasoning](https://huggingface.co/microsoft/Phi-4-reasoning)
93
+
94
+ ---
95
+
96
+ ## 1. Model Overview
97
+
98
+ Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model built on the Phi-4-Reasoning language model backbone and the SigLIP-2 vision encoder, using a mid-fusion architecture. In this architecture, the vision encoder first converts images into visual tokens, which are then projected into the language model's embedding space and injected into the pretrained language model. This approach leverages the strengths of both pretrained components while keeping training and inference costs manageable. The model employs a dynamic resolution vision encoder with up to 3,600 visual tokens, enabling high-resolution image understanding critical for tasks such as GUI grounding and fine-grained document analysis. Bidirectional attention is applied within images (intra-image) to improve spatial reasoning without the overfitting risks observed with broader bidirectional schemes.
99
+
100
+ Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using `<think>...</think>` blocks) for tasks like mathematical and scientific reasoning, or default to direct inference (tagged with `<nothink>`) for perception-focused tasks such as captioning, object detection, and grounding. The training data consists primarily of meticulously filtered and improved open-source vision-language datasets, supplemented by high-quality domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on substantially more training data and compute.
101
+
102
+ ### 1.1 Alignment Approach
103
+
104
+ Phi-4-Reasoning-Vision-15B has adopted a safety post-training approach leveraging a combination of open-source and in-house generated synthetic datasets. The safety alignment is achieved through Supervised Fine-Tuning (SFT) using data that includes both helpfulness and harmlessness examples, as well as targeted questions and answers across multiple safety categories. The model's training data explicitly includes safety-oriented samples designed to teach appropriate refusal behavior for harmful content categories including hate speech, violence, self-harm content, and sexually explicit material. Automated red teaming was performed on Azure to assess safety risks including groundedness, jailbreak susceptibility, harmful content generation, and copyright violations for protected material.
105
+
106
+ ---
107
+
108
+ ## 2. Usage
109
+
110
+ ### 2.1 Primary Use Cases
111
+
112
+ Phi-4-Reasoning-Vision-15B is designed for general-purpose multimodal AI systems and applications that require vision-language understanding with selective reasoning capabilities, particularly in memory- or compute-constrained environments. The model excels in two primary domains:
113
+
114
+ - **Scientific and mathematical reasoning over visual inputs:** such as solving math problems presented as handwritten equations or diagrams, extracting and reasoning over quantitative information in documents, charts, and tables, and supporting multi-step reasoning in educational or scientific analysis contexts.
115
+ - **Computer-use agent (CUA) tasks:** such as interpreting screen content, localizing interactive GUI elements, and selecting actions within graphical user interfaces.
116
+
117
+ The model is also capable of general multimodal tasks including image captioning, visual question answering, optical character recognition, object localization, and grounding. Its hybrid reasoning design allows it to produce fast, direct responses for perception-focused tasks while engaging in structured chain-of-thought reasoning when the task benefits from it, making it suitable as a building block for generative AI-powered features across a range of applications.
118
+
119
+ ### 2.2 Out-of-Scope Use Cases
120
+
121
+ Phi-4-Reasoning-Vision-15B is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of vision-language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
122
+
123
+ The model is trained primarily on English text and image-text pairs. Languages other than English may experience degraded performance. The model should not be used in scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (e.g., housing, employment, credit) without further assessments and additional debiasing techniques. It is not suitable for providing medical diagnoses, legal advice, or financial planning. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
124
+
125
+ ### 2.3 Distribution Channels
126
+
127
+ Some of Phi-4-Reasoning-Vision-15B's distribution channels include:
128
+
129
+ - Public access through open-source repositories: [Hugging Face](https://huggingface.co/microsoft/Phi-4-Reasoning-Vision-15B)
130
+ - Public access through open-source code repositories: [GitHub](https://github.com/microsoft/Phi-4-vision)
131
+ - Enterprise or subscription-based access through [Azure AI Foundry](https://ai.azure.com)
132
+
133
+ ### 2.4 Input Formats
134
+
135
+ Given the nature of the training data, always use chat template and system prompt for inference. For example, for the prompt "Please describe the image", the fully formatted chat templated prompt is the following:
136
+
137
+ ```
138
+ <|im_start|>system<|im_sep|>You are Phi, a multimodal model trained by Microsoft to help users. Your role as an assistant is to provide accurate, coherent, and actionable responses, adapting your reasoning mode ("NOTHINK" vs "THINK") automatically based on the complexity, clarity, and confidence of each task.
139
+
140
+ #### NOTHINK Mode
141
+ Use this mode when the task is clear, factual, low-complexity, or can be confidently answered immediately without iterative reasoning. Such as when the input is clear and unambiguous or visual recognition or text comprehension is straightforward, and where a factual, numeric, or short procedural answer is sufficient. Provide a concise, accurate, and confident answer. Please structure your response into one section: using the specified format: <nothink> {Solution section}. In the Solution section, present the final solution that you deem correct. The Solution section should be logical, accurate, and concise.
142
+
143
+ #### THINK Mode
144
+ This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Use this mode when multiple modalities must be integrated, the task involves analysis, inference, design, or planning, the query is ambiguous, multi-step, or requires judgment. Think through the visual and textual context before responding. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.
145
+
146
+ Now, try to solve the following question through the above guidelines:<|im_end|><|im_start|>user<|im_sep|>Please describe the image<|im_end|><|im_start|>assistant<|im_sep|>
147
+ ```
148
+
149
+ To force a thinking response, append the `<think>` token to the generation template:
150
+
151
+ ```
152
+ <|im_start|>assistant<|im_sep|><think>
153
+ ```
154
+
155
+ To force a non-thinking response, append the `<nothink>` token to the generation template:
156
+
157
+ ```
158
+ <|im_start|>assistant<|im_sep|><nothink>
159
+ ```
160
+
161
+ ### 2.5 Technical Requirements and Integration Guidance
162
+
163
+ The following software packages are required for running Phi-4-Reasoning-Vision:
164
+
165
+ - `torch >= 2.7.1`
166
+ - `transformers >= 4.57.1`
167
+ - `vllm >= 0.15.2` (only required if using vLLM)
168
+
169
+ Phi-4-Reasoning-Vision-15B has been tested on NVIDIA A6000, A100, H100, and B200 GPUs with the Ubuntu 22.04.5 LTS operating system. In principle, other GPU architectures with enough memory to fit the model could suffice, but these have not been tested. It is recommended that users host Phi-4-Reasoning-Vision-15B on a vLLM server using bf16 precision.
170
+
171
+ ### 2.6 Responsible AI Considerations
172
+
173
+ Like other models, Phi-4-Reasoning-Vision-15B can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
174
+
175
+ - **Quality of Service:** The model is trained primarily on English text. Languages other than English may experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Phi-4-Reasoning-Vision-15B is not intended to support multilingual use.
176
+ - **Representation of Harms & Perpetuation of Stereotypes:** The model may over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
177
+ - **Inappropriate or Offensive Content:** The model may produce inappropriate or offensive content, which may make it inappropriate to deploy in sensitive contexts without additional mitigations specific to the use case.
178
+ - **Information Reliability:** Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
179
+
180
+ Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g., privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include:
181
+
182
+ - **Allocation:** Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (e.g., housing, employment, credit) without further assessments and additional debiasing techniques.
183
+ - **High-Risk Scenarios:** Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable, or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (e.g., legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
184
+ - **Misinformation:** Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
185
+ - **Generation of Harmful Content:** Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
186
+ - **Misuse:** Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
187
+
188
+ ---
189
+
190
+ ## 3. Quality and Performance Evaluation
191
+
192
+ Phi-4-Reasoning-Vision-15B was evaluated across a broad range of public benchmarks spanning multimodal reasoning, mathematical problem solving, document and chart understanding, visual perception, OCR, and computer-use grounding tasks. Two evaluation frameworks were used: Microsoft's Eureka ML Insights for internal development benchmarks, and VLMEvalKit for standardized community benchmarks. Evaluation logs will be released publicly.
193
+
194
+ The model was evaluated on the following benchmarks via VLMEvalKit: AI2D (diagram understanding), BLINK (core visual perception), ChartQA (chart reasoning), DocVQA (document question answering), HallusionBench (hallucination and visual illusion detection), MathVerse (visual math with varying multimodal information), MathVision (competition-level mathematical reasoning), MathVista (math reasoning in visual contexts), MMMU (multi-discipline multimodal understanding), MMStar (vision-indispensable multimodal evaluation), OCRBench (OCR capabilities), ScreenSpot-V2 for Desktop, Mobile, and Web (GUI element localization), WeMath (human-like mathematical reasoning process evaluation), WildVision (real-world human preference evaluation), and ZeroBench (challenging visual reasoning). During development, additional benchmarks including MMMU-CoT, ScreenSpot-Pro, and V*Bench were evaluated using Eureka ML Insights.
195
+
196
+ ### Table 1: Accuracy Comparisons Relative to Popular Open-Weight, Non-Thinking Models
197
+
198
+ | Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B – force nothink | Phi-4-mm-instruct | Kimi-VL-A3B-Instruct | gemma-3-12b-it | Qwen3-VL-8B-Instruct-4K | Qwen3-VL-8B-Instruct-32K | Qwen3-VL-32B-Instruct-4K | Qwen3-VL-32B-Instruct-32K |
199
+ |---|---|---|---|---|---|---|---|---|---|
200
+ | AI2D_TEST | 84.8 | 84.7 | 68.6 | 84.6 | 80.4 | 82.7 | 83 | 84.8 | 85 |
201
+ | ChartQA_TEST | 83.3 | 76.5 | 23.5 | 87 | 39 | 83.1 | 83.2 | 84.3 | 84 |
202
+ | HallusionBench | 64.4 | 63.1 | 56 | 65.2 | 65.3 | 73.5 | 74.1 | 74.4 | 74.9 |
203
+ | MathVerse_MINI | 44.9 | 43.8 | 32.4 | 41.7 | 29.8 | 54.5 | 57.4 | 64.2 | 64.2 |
204
+ | MathVision_MINI | 36.2 | 34.2 | 20 | 28.3 | 31.9 | 45.7 | 50 | 54.3 | 60.5 |
205
+ | MathVista_MINI | 75.2 | 68.7 | 50.5 | 67.1 | 57.4 | 77.1 | 76.4 | 82.5 | 81.8 |
206
+ | MMMU_VAL | 54.3 | 52 | 42.3 | 52 | 50 | 60.7 | 64.6 | 68.6 | 70.6 |
207
+ | MMStar | 64.5 | 63.3 | 45.9 | 60 | 59.4 | 68.9 | 69.9 | 73.7 | 74.3 |
208
+ | OCRBench | 76 | 75.6 | 62.6 | 86.5 | 75.3 | 89.2 | 90 | 88.5 | 88.5 |
209
+ | ScreenSpot_v2 | 88.2 | 88.3 | 28.5 | 89.8 | 3.5 | 91.5 | 91.5 | 93.7 | 93.9 |
210
+
211
+ ### Table 2: Accuracy Comparisons Relative to Popular Open-Weight, Thinking Models
212
+
213
+ | Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B - force thinking | Kimi-VL-A3B-Thinking | gemma3-12b-it | Qwen3-VL-8B-Thinking-4K | Qwen3-VL-8B-Thinking-40K | Qwen3-VL-32B-Thinking-4K | Qwen3-VL-32B-Thinking-40K |
214
+ |---|---|---|---|---|---|---|---|---|
215
+ | AI2D_TEST | 84.8 | 79.7 | 81.2 | 80.4 | 83.5 | 83.9 | 86.9 | 87.2 |
216
+ | ChartQA_TEST | 83.3 | 82.9 | 73.3 | 39 | 78 | 78.6 | 78.5 | 79.1 |
217
+ | HallusionBench | 64.4 | 63.9 | 70.6 | 65.3 | 71.6 | 73 | 76.4 | 76.6 |
218
+ | MathVerse_MINI | 44.9 | 53.1 | 61 | 29.8 | 67.3 | 73.3 | 78.3 | 78.2 |
219
+ | MathVision_MINI | 36.2 | 36.2 | 50.3 | 31.9 | 43.1 | 50.7 | 60.9 | 58.6 |
220
+ | MathVista_MINI | 75.2 | 74.1 | 78.6 | 57.4 | 77.7 | 79.5 | 83.9 | 83.8 |
221
+ | MMMU_VAL | 54.3 | 55 | 60.2 | 50 | 59.3 | 65.3 | 72 | 72.2 |
222
+ | MMStar | 64.5 | 63.9 | 69.6 | 59.4 | 69.3 | 72.3 | 75.5 | 75.7 |
223
+ | OCRBench | 76 | 73.7 | 79.9 | 75.3 | 81.2 | 82 | 83.7 | 85 |
224
+ | ScreenSpot_v2 | 88.2 | 88.1 | 81.8 | 3.5 | 93.3 | 92.7 | 83.1 | 83.1 |
225
+
226
+ ### 3.1 Safety Evaluation and Red-Teaming
227
+
228
+ Phi-4-Reasoning-Vision-15B was trained on a mixture of public safety data and internally generated tasks that it ought to refuse based on Microsoft's Responsible AI Policy.
229
+
230
+ Phi-4-Reasoning-Vision-15B's safety was evaluated using both quantitative and qualitative approaches prior to release. Automated red teaming was performed on Azure to assess safety risks across multiple risk categories, including disallowed content (sexual, violent, hateful, or self-harm content), copyright content and intellectual property, and jailbreak susceptibility. The evaluation assessed the model's groundedness and its tendency to generate fabricated or misleading information.
231
+
232
+ The safety evaluation built upon the established practices from the Phi-4-Reasoning model's safety assessment. The model's training data included explicit safety-oriented samples across both reasoning and non-reasoning modes, designed to teach appropriate refusal and harm-avoidance behaviors. The multimodal nature of the model introduces additional safety considerations around visual content interpretation, and evaluations were conducted to assess the model's behavior when presented with potentially harmful or misleading visual inputs.
233
+
234
+ | Evaluation | Description | Defect Rate |
235
+ |---|---|---|
236
+ | Text to Text Safety | Automated content safety evaluation measuring safety policies | 1.4% |
237
+ | Image to Text Safety | Automated content safety evaluation measuring safety policies | 4.5% |
238
+
239
+ ---
240
+
241
+ ## 4. Data Overview
242
+
243
+ ### 4.1 Training, Testing, and Validation Datasets
244
+
245
+ To learn more about the training data used for Phi-4-Reasoning-Vision-15B please refer to the full data card: RRRR_nnnn_Data Card for Foundation+Frontier Models.
246
+
247
+ ### 4.2 List of Data Sources
248
+
249
+ To learn more about the training data used for Phi-4-Reasoning-Vision-15B please refer to the full data card: RRRR_nnnn_Data Card for Foundation+Frontier Models.
250
+
251
+ ---
252
+
253
+ ## 5. Contact
254
+
255
+ Requests for additional information can be directed to [MSFTAIActRequest@microsoft.com](mailto:MSFTAIActRequest@microsoft.com).
256
+
257
+ Authorized representative: Microsoft Ireland Operations Limited, 70 Sir John Rogerson's Quay, Dublin 2, D02 R296, Ireland
258
+
259
+ ---
260
+
261
+ ## 6. Appendix
262
+
263
+ ### A. Benchmarking Methodology
264
+
265
+ Phi-4-Reasoning-Vision-15B was evaluated using two complementary open-source evaluation frameworks:
266
+
267
+ **1. [Eureka ML Insights](https://github.com/microsoft/eureka-ml-insights)**
268
+
269
+ Used during development for internal benchmarks and ablation studies. The following benchmarks were evaluated through this framework:
270
+
271
+ - **MathVista:** Mathematical reasoning over visual inputs including diagrams, charts, and figures
272
+ - **MMMU-CoT:** Multi-discipline multimodal understanding with chain-of-thought reasoning
273
+ - **ScreenSpot / ScreenSpot-V2:** GUI element localization on desktop and mobile screenshots
274
+ - **ScreenSpot-Pro:** High-resolution professional GUI grounding tasks
275
+ - **V\*Bench:** Visual reasoning benchmark
276
+
277
+ **2. [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)**
278
+
279
+ Used for standardized community benchmark evaluation. The following benchmarks were evaluated through this framework:
280
+
281
+ - **AI2D (TEST split):** Diagram understanding over ~5K illustrative diagrams from grade school natural sciences, evaluating the ability to interpret diagrammatic elements, relationships, and structure.
282
+ - **BLINK:** Core visual perception benchmark with 3,807 multiple-choice questions spanning 14 classic computer vision tasks including relative depth estimation, visual correspondence, and multi-view reasoning.
283
+ - **ChartQA (TEST split):** Chart understanding and reasoning benchmark with 9,600 human-written questions assessing complex visual and logical reasoning over chart data.
284
+ - **DocVQA (VAL split):** Document visual question answering over 12,000+ document images, evaluating text extraction and comprehension within document layouts.
285
+ - **HallusionBench:** Diagnostic benchmark evaluating image-context reasoning, language hallucination tendencies, and visual illusion susceptibility in vision-language models.
286
+ - **MathVerse (MINI split):** Visual math benchmark with 2,612 multi-subject math problems transformed into six versions offering varying degrees of multimodal information content.
287
+ - **MathVision (MINI split):** 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions, spanning 16 mathematical disciplines across 5 difficulty levels.
288
+ - **MathVista (MINI split):** Mathematical reasoning in visual contexts including geometry, algebra, and data interpretation.
289
+ - **MMMU (DEV_VAL split):** Massive multi-discipline multimodal understanding benchmark with 11.5K questions from college exams covering six core disciplines and 30 subjects.
290
+ - **MMStar:** Vision-indispensable multimodal benchmark with 1,500 carefully curated samples evaluating six core capabilities: coarse perception, fine-grained perception, instance reasoning, logical reasoning, science and technology, and mathematics.
291
+ - **OCRBench:** Comprehensive OCR evaluation with 1,000 question-answer pairs spanning text recognition, scene text VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition.
292
+ - **ScreenSpot-V2 (Desktop, Mobile, Web):** GUI element localization benchmark across desktop, mobile, and web interfaces.
293
+ - **WeMath:** Mathematical reasoning process benchmark with 6.5K visual math problems spanning 67 hierarchical knowledge concepts, evaluating knowledge acquisition and generalization beyond end-to-date performance.
294
+ - **WildVision:** Real-world human preference evaluation benchmark with 500 high-quality samples curated from 8,000 user submissions, using GPT-4o as judge.
295
+ - **ZeroBench:** Challenging visual reasoning benchmark with 100 manually curated questions designed to probe the limits of spatial reasoning, object recognition, and complex visual scene interpretation.
296
+
297
+ Evaluation logs will be released publicly.
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ <|im_start|>system<|im_sep|>You are Phi, a multimodal model trained by Microsoft to help users. Your role as an assistant is to provide accurate, coherent, and actionable responses, adapting your reasoning mode (\"NOTHINK\" vs \"THINK\") automatically based on the complexity, clarity, and confidence of each task.\n\n#### NOTHINK Mode\nUse this mode when the task is clear, factual, low-complexity, or can be confidently answered immediately without iterative reasoning. Such as when the input is clear and unambiguous or visual recognition or text comprehension is straightforward, and where a factual, numeric, or short procedural answer is sufficient. Provide a concise, accurate, and confident answer. Please structure your response into one section: using the specified format: <nothink> {Solution section}. In the Solution section, present the final solution that you deem correct. The Solution section should be logical, accurate, and concise.\n\n#### THINK Mode\nThis requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Use this mode when multiple modalities must be integrated, the task involves analysis, inference, design, or planning, the query is ambiguous, multi-step, or requires judgment. Think through the visual and textual context before responding. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.\n\nNow, try to solve the following question through the above guidelines:<|im_end|>{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>'}}{% generation %}{{message['content'] + '<|im_end|>'}}{% endgeneration %}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Phi4ForCausalLMV"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "modeling_phi4_visionr.Phi4VisionR",
7
+ "AutoModelForCausalLM": "modeling_phi4_visionr.Phi4ForCausalLMV",
8
+ "AutoProcessor": "processing_phi4_visionr.Phi4VisionRProcessor"
9
+ },
10
+ "attention_bias": false,
11
+ "attention_dropout": 0.0,
12
+ "bos_token_id": 100257,
13
+ "dtype": "bfloat16",
14
+ "embd_pdrop": 0.0,
15
+ "eos_token_id": 100265,
16
+ "freeze_mm_mlp_adapter": false,
17
+ "hidden_act": "silu",
18
+ "hidden_size": 5120,
19
+ "image_aspect_ratio": "square",
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 17920,
22
+ "max_num_patches": 3600,
23
+ "max_position_embeddings": 32768,
24
+ "min_num_patches": 256,
25
+ "mm_hidden_size": 1152,
26
+ "mm_projector_lr": null,
27
+ "mm_projector_type": "mlp2x_gelu",
28
+ "mm_vision_tower": "google/siglip2-so400m-patch16-naflex",
29
+ "model_type": "phi4-siglip",
30
+ "num_attention_heads": 40,
31
+ "num_hidden_layers": 40,
32
+ "num_key_value_heads": 10,
33
+ "original_max_position_embeddings": 32768,
34
+ "pad_token_id": 100349,
35
+ "partial_rotary_factor": 1.0,
36
+ "resid_pdrop": 0.0,
37
+ "rms_norm_eps": 1e-05,
38
+ "rope_scaling": null,
39
+ "rope_theta": 500000,
40
+ "sliding_window": null,
41
+ "tie_word_embeddings": false,
42
+ "tokenizer_model_max_length": 16384,
43
+ "tokenizer_padding_side": "right",
44
+ "transformers_version": "4.56.1",
45
+ "tune_mm_mlp_adapter": false,
46
+ "unfreeze_vision_tower": true,
47
+ "use_cache": true,
48
+ "use_mm_proj": true,
49
+ "use_s2": false,
50
+ "vocab_size": 100352,
51
+ "vision_config": {
52
+ "hidden_size": 1152,
53
+ "intermediate_size": 4304,
54
+ "model_type": "siglip2_vision_model",
55
+ "num_attention_heads": 16,
56
+ "num_hidden_layers": 27
57
+ }
58
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 100257,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 100265
7
+ ],
8
+ "pad_token_id": 100349,
9
+ "temperature": 0.8,
10
+ "top_p": 0.95,
11
+ "transformers_version": "4.56.1"
12
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:df55fbc1ae0c0bc05382b542ecbfcc790b64b5a3c8ebde66823a1815fd24c97a
3
+ size 4933656472
model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16628c381bfd03d814535ef49d2ae75c083cc07ae679e409081f5fd247ef8525
3
+ size 4954690712
model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1acfe85da02a69ed35e2ab76e2da29ae9a0c3b040dacc0f63b8497a6ccd4a29a
3
+ size 4902241352
model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0cbd06d44679f61afab76f66cae70ca831e4aaeee2a29e36b4db4f5c2df3235
3
+ size 4771169120
model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd41f7de7b9a60e1950c56247c8c0d7135c7ab2568507a2db3e6fb09054aded4
3
+ size 4771169120
model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0942ab288370b0d9bc7f861c4475285ac72790de97bea6bd9b0e93329c85831f
3
+ size 4878604168
model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10e987e926e6bbf6f34d54ba8c0f4033a827626398bf52419d139afbef07cea6
3
+ size 1027604608
model.safetensors.index.json ADDED
@@ -0,0 +1,703 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 15119518144,
4
+ "total_size": 30239036288
5
+ },
6
+ "weight_map": {
7
+ "lm_head.weight": "model-00007-of-00007.safetensors",
8
+ "model.embed_tokens.weight": "model-00001-of-00007.safetensors",
9
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00007.safetensors",
10
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
11
+ "model.layers.0.mlp.gate_up_proj.weight": "model-00001-of-00007.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
13
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
14
+ "model.layers.0.self_attn.qkv_proj.weight": "model-00001-of-00007.safetensors",
15
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00007.safetensors",
16
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
17
+ "model.layers.1.mlp.gate_up_proj.weight": "model-00001-of-00007.safetensors",
18
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
19
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
20
+ "model.layers.1.self_attn.qkv_proj.weight": "model-00001-of-00007.safetensors",
21
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00007.safetensors",
22
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
23
+ "model.layers.10.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
24
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
25
+ "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
26
+ "model.layers.10.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
27
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00007.safetensors",
28
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
29
+ "model.layers.11.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
30
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
31
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
32
+ "model.layers.11.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
33
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00007.safetensors",
34
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
35
+ "model.layers.12.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
36
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
37
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
38
+ "model.layers.12.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
39
+ "model.layers.13.input_layernorm.weight": "model-00003-of-00007.safetensors",
40
+ "model.layers.13.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
41
+ "model.layers.13.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
42
+ "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
43
+ "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
44
+ "model.layers.13.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
45
+ "model.layers.14.input_layernorm.weight": "model-00003-of-00007.safetensors",
46
+ "model.layers.14.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
47
+ "model.layers.14.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
48
+ "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
49
+ "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
50
+ "model.layers.14.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
51
+ "model.layers.15.input_layernorm.weight": "model-00003-of-00007.safetensors",
52
+ "model.layers.15.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
53
+ "model.layers.15.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
54
+ "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
55
+ "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
56
+ "model.layers.15.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
57
+ "model.layers.16.input_layernorm.weight": "model-00003-of-00007.safetensors",
58
+ "model.layers.16.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
59
+ "model.layers.16.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
60
+ "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
61
+ "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
62
+ "model.layers.16.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
63
+ "model.layers.17.input_layernorm.weight": "model-00003-of-00007.safetensors",
64
+ "model.layers.17.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
65
+ "model.layers.17.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
66
+ "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
67
+ "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
68
+ "model.layers.17.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
69
+ "model.layers.18.input_layernorm.weight": "model-00003-of-00007.safetensors",
70
+ "model.layers.18.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
71
+ "model.layers.18.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
72
+ "model.layers.18.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
73
+ "model.layers.18.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
74
+ "model.layers.18.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
75
+ "model.layers.19.input_layernorm.weight": "model-00003-of-00007.safetensors",
76
+ "model.layers.19.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
77
+ "model.layers.19.mlp.gate_up_proj.weight": "model-00003-of-00007.safetensors",
78
+ "model.layers.19.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
79
+ "model.layers.19.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
80
+ "model.layers.19.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
81
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00007.safetensors",
82
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
83
+ "model.layers.2.mlp.gate_up_proj.weight": "model-00001-of-00007.safetensors",
84
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
85
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
86
+ "model.layers.2.self_attn.qkv_proj.weight": "model-00001-of-00007.safetensors",
87
+ "model.layers.20.input_layernorm.weight": "model-00004-of-00007.safetensors",
88
+ "model.layers.20.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
89
+ "model.layers.20.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
90
+ "model.layers.20.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
91
+ "model.layers.20.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
92
+ "model.layers.20.self_attn.qkv_proj.weight": "model-00003-of-00007.safetensors",
93
+ "model.layers.21.input_layernorm.weight": "model-00004-of-00007.safetensors",
94
+ "model.layers.21.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
95
+ "model.layers.21.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
96
+ "model.layers.21.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
97
+ "model.layers.21.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
98
+ "model.layers.21.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
99
+ "model.layers.22.input_layernorm.weight": "model-00004-of-00007.safetensors",
100
+ "model.layers.22.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
101
+ "model.layers.22.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
102
+ "model.layers.22.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
103
+ "model.layers.22.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
104
+ "model.layers.22.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
105
+ "model.layers.23.input_layernorm.weight": "model-00004-of-00007.safetensors",
106
+ "model.layers.23.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
107
+ "model.layers.23.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
108
+ "model.layers.23.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
109
+ "model.layers.23.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
110
+ "model.layers.23.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
111
+ "model.layers.24.input_layernorm.weight": "model-00004-of-00007.safetensors",
112
+ "model.layers.24.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
113
+ "model.layers.24.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
114
+ "model.layers.24.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
115
+ "model.layers.24.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
116
+ "model.layers.24.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
117
+ "model.layers.25.input_layernorm.weight": "model-00004-of-00007.safetensors",
118
+ "model.layers.25.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
119
+ "model.layers.25.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
120
+ "model.layers.25.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
121
+ "model.layers.25.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
122
+ "model.layers.25.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
123
+ "model.layers.26.input_layernorm.weight": "model-00004-of-00007.safetensors",
124
+ "model.layers.26.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
125
+ "model.layers.26.mlp.gate_up_proj.weight": "model-00004-of-00007.safetensors",
126
+ "model.layers.26.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
127
+ "model.layers.26.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
128
+ "model.layers.26.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
129
+ "model.layers.27.input_layernorm.weight": "model-00005-of-00007.safetensors",
130
+ "model.layers.27.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
131
+ "model.layers.27.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
132
+ "model.layers.27.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
133
+ "model.layers.27.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
134
+ "model.layers.27.self_attn.qkv_proj.weight": "model-00004-of-00007.safetensors",
135
+ "model.layers.28.input_layernorm.weight": "model-00005-of-00007.safetensors",
136
+ "model.layers.28.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
137
+ "model.layers.28.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
138
+ "model.layers.28.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
139
+ "model.layers.28.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
140
+ "model.layers.28.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
141
+ "model.layers.29.input_layernorm.weight": "model-00005-of-00007.safetensors",
142
+ "model.layers.29.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
143
+ "model.layers.29.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
144
+ "model.layers.29.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
145
+ "model.layers.29.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
146
+ "model.layers.29.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
147
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00007.safetensors",
148
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
149
+ "model.layers.3.mlp.gate_up_proj.weight": "model-00001-of-00007.safetensors",
150
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
151
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
152
+ "model.layers.3.self_attn.qkv_proj.weight": "model-00001-of-00007.safetensors",
153
+ "model.layers.30.input_layernorm.weight": "model-00005-of-00007.safetensors",
154
+ "model.layers.30.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
155
+ "model.layers.30.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
156
+ "model.layers.30.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
157
+ "model.layers.30.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
158
+ "model.layers.30.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
159
+ "model.layers.31.input_layernorm.weight": "model-00005-of-00007.safetensors",
160
+ "model.layers.31.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
161
+ "model.layers.31.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
162
+ "model.layers.31.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
163
+ "model.layers.31.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
164
+ "model.layers.31.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
165
+ "model.layers.32.input_layernorm.weight": "model-00005-of-00007.safetensors",
166
+ "model.layers.32.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
167
+ "model.layers.32.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
168
+ "model.layers.32.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
169
+ "model.layers.32.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
170
+ "model.layers.32.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
171
+ "model.layers.33.input_layernorm.weight": "model-00005-of-00007.safetensors",
172
+ "model.layers.33.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
173
+ "model.layers.33.mlp.gate_up_proj.weight": "model-00005-of-00007.safetensors",
174
+ "model.layers.33.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
175
+ "model.layers.33.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
176
+ "model.layers.33.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
177
+ "model.layers.34.input_layernorm.weight": "model-00006-of-00007.safetensors",
178
+ "model.layers.34.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
179
+ "model.layers.34.mlp.gate_up_proj.weight": "model-00006-of-00007.safetensors",
180
+ "model.layers.34.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
181
+ "model.layers.34.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
182
+ "model.layers.34.self_attn.qkv_proj.weight": "model-00005-of-00007.safetensors",
183
+ "model.layers.35.input_layernorm.weight": "model-00006-of-00007.safetensors",
184
+ "model.layers.35.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
185
+ "model.layers.35.mlp.gate_up_proj.weight": "model-00006-of-00007.safetensors",
186
+ "model.layers.35.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
187
+ "model.layers.35.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
188
+ "model.layers.35.self_attn.qkv_proj.weight": "model-00006-of-00007.safetensors",
189
+ "model.layers.36.input_layernorm.weight": "model-00006-of-00007.safetensors",
190
+ "model.layers.36.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
191
+ "model.layers.36.mlp.gate_up_proj.weight": "model-00006-of-00007.safetensors",
192
+ "model.layers.36.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
193
+ "model.layers.36.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
194
+ "model.layers.36.self_attn.qkv_proj.weight": "model-00006-of-00007.safetensors",
195
+ "model.layers.37.input_layernorm.weight": "model-00006-of-00007.safetensors",
196
+ "model.layers.37.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
197
+ "model.layers.37.mlp.gate_up_proj.weight": "model-00006-of-00007.safetensors",
198
+ "model.layers.37.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
199
+ "model.layers.37.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
200
+ "model.layers.37.self_attn.qkv_proj.weight": "model-00006-of-00007.safetensors",
201
+ "model.layers.38.input_layernorm.weight": "model-00006-of-00007.safetensors",
202
+ "model.layers.38.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
203
+ "model.layers.38.mlp.gate_up_proj.weight": "model-00006-of-00007.safetensors",
204
+ "model.layers.38.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
205
+ "model.layers.38.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
206
+ "model.layers.38.self_attn.qkv_proj.weight": "model-00006-of-00007.safetensors",
207
+ "model.layers.39.input_layernorm.weight": "model-00006-of-00007.safetensors",
208
+ "model.layers.39.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
209
+ "model.layers.39.mlp.gate_up_proj.weight": "model-00006-of-00007.safetensors",
210
+ "model.layers.39.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
211
+ "model.layers.39.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
212
+ "model.layers.39.self_attn.qkv_proj.weight": "model-00006-of-00007.safetensors",
213
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00007.safetensors",
214
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
215
+ "model.layers.4.mlp.gate_up_proj.weight": "model-00001-of-00007.safetensors",
216
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
217
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
218
+ "model.layers.4.self_attn.qkv_proj.weight": "model-00001-of-00007.safetensors",
219
+ "model.layers.5.input_layernorm.weight": "model-00002-of-00007.safetensors",
220
+ "model.layers.5.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
221
+ "model.layers.5.mlp.gate_up_proj.weight": "model-00001-of-00007.safetensors",
222
+ "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
223
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
224
+ "model.layers.5.self_attn.qkv_proj.weight": "model-00001-of-00007.safetensors",
225
+ "model.layers.6.input_layernorm.weight": "model-00002-of-00007.safetensors",
226
+ "model.layers.6.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
227
+ "model.layers.6.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
228
+ "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
229
+ "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
230
+ "model.layers.6.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
231
+ "model.layers.7.input_layernorm.weight": "model-00002-of-00007.safetensors",
232
+ "model.layers.7.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
233
+ "model.layers.7.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
234
+ "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
235
+ "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
236
+ "model.layers.7.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
237
+ "model.layers.8.input_layernorm.weight": "model-00002-of-00007.safetensors",
238
+ "model.layers.8.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
239
+ "model.layers.8.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
240
+ "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
241
+ "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
242
+ "model.layers.8.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
243
+ "model.layers.9.input_layernorm.weight": "model-00002-of-00007.safetensors",
244
+ "model.layers.9.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
245
+ "model.layers.9.mlp.gate_up_proj.weight": "model-00002-of-00007.safetensors",
246
+ "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
247
+ "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
248
+ "model.layers.9.self_attn.qkv_proj.weight": "model-00002-of-00007.safetensors",
249
+ "model.mm_projector.0.bias": "model-00006-of-00007.safetensors",
250
+ "model.mm_projector.0.weight": "model-00006-of-00007.safetensors",
251
+ "model.mm_projector.2.bias": "model-00006-of-00007.safetensors",
252
+ "model.mm_projector.2.weight": "model-00006-of-00007.safetensors",
253
+ "model.norm.weight": "model-00006-of-00007.safetensors",
254
+ "model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.bias": "model-00006-of-00007.safetensors",
255
+ "model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00006-of-00007.safetensors",
256
+ "model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight": "model-00006-of-00007.safetensors",
257
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00006-of-00007.safetensors",
258
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00006-of-00007.safetensors",
259
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm2.bias": "model-00006-of-00007.safetensors",
260
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm2.weight": "model-00006-of-00007.safetensors",
261
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00006-of-00007.safetensors",
262
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00006-of-00007.safetensors",
263
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00006-of-00007.safetensors",
264
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00006-of-00007.safetensors",
265
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
266
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
267
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
268
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
269
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
270
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
271
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
272
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
273
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm1.bias": "model-00006-of-00007.safetensors",
274
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm1.weight": "model-00006-of-00007.safetensors",
275
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm2.bias": "model-00006-of-00007.safetensors",
276
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm2.weight": "model-00006-of-00007.safetensors",
277
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00006-of-00007.safetensors",
278
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00006-of-00007.safetensors",
279
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00006-of-00007.safetensors",
280
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00006-of-00007.safetensors",
281
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
282
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
283
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
284
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
285
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
286
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
287
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
288
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
289
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm1.bias": "model-00006-of-00007.safetensors",
290
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm1.weight": "model-00006-of-00007.safetensors",
291
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm2.bias": "model-00006-of-00007.safetensors",
292
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm2.weight": "model-00006-of-00007.safetensors",
293
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00006-of-00007.safetensors",
294
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00006-of-00007.safetensors",
295
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00006-of-00007.safetensors",
296
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00006-of-00007.safetensors",
297
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
298
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
299
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
300
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
301
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
302
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
303
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
304
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
305
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm1.bias": "model-00006-of-00007.safetensors",
306
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm1.weight": "model-00006-of-00007.safetensors",
307
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm2.bias": "model-00006-of-00007.safetensors",
308
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm2.weight": "model-00006-of-00007.safetensors",
309
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00006-of-00007.safetensors",
310
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00006-of-00007.safetensors",
311
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00006-of-00007.safetensors",
312
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00006-of-00007.safetensors",
313
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
314
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
315
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
316
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
317
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
318
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
319
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
320
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
321
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm1.bias": "model-00006-of-00007.safetensors",
322
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm1.weight": "model-00006-of-00007.safetensors",
323
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm2.bias": "model-00006-of-00007.safetensors",
324
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm2.weight": "model-00006-of-00007.safetensors",
325
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00006-of-00007.safetensors",
326
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00006-of-00007.safetensors",
327
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00006-of-00007.safetensors",
328
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00006-of-00007.safetensors",
329
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
330
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
331
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
332
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
333
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
334
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
335
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
336
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
337
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm1.bias": "model-00006-of-00007.safetensors",
338
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm1.weight": "model-00006-of-00007.safetensors",
339
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm2.bias": "model-00006-of-00007.safetensors",
340
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm2.weight": "model-00006-of-00007.safetensors",
341
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00006-of-00007.safetensors",
342
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00006-of-00007.safetensors",
343
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00006-of-00007.safetensors",
344
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00006-of-00007.safetensors",
345
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
346
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
347
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
348
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
349
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
350
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
351
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
352
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
353
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm1.bias": "model-00006-of-00007.safetensors",
354
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm1.weight": "model-00006-of-00007.safetensors",
355
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm2.bias": "model-00006-of-00007.safetensors",
356
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm2.weight": "model-00006-of-00007.safetensors",
357
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00006-of-00007.safetensors",
358
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00006-of-00007.safetensors",
359
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00006-of-00007.safetensors",
360
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00006-of-00007.safetensors",
361
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
362
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
363
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
364
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
365
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
366
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
367
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
368
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
369
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm1.bias": "model-00006-of-00007.safetensors",
370
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm1.weight": "model-00006-of-00007.safetensors",
371
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm2.bias": "model-00006-of-00007.safetensors",
372
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm2.weight": "model-00006-of-00007.safetensors",
373
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00006-of-00007.safetensors",
374
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00006-of-00007.safetensors",
375
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00006-of-00007.safetensors",
376
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00006-of-00007.safetensors",
377
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
378
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
379
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
380
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
381
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
382
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
383
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
384
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
385
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm1.bias": "model-00006-of-00007.safetensors",
386
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm1.weight": "model-00006-of-00007.safetensors",
387
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm2.bias": "model-00006-of-00007.safetensors",
388
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm2.weight": "model-00006-of-00007.safetensors",
389
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00006-of-00007.safetensors",
390
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00006-of-00007.safetensors",
391
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00006-of-00007.safetensors",
392
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00006-of-00007.safetensors",
393
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
394
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
395
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
396
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
397
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
398
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
399
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
400
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
401
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm1.bias": "model-00006-of-00007.safetensors",
402
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm1.weight": "model-00006-of-00007.safetensors",
403
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.bias": "model-00006-of-00007.safetensors",
404
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.weight": "model-00006-of-00007.safetensors",
405
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00006-of-00007.safetensors",
406
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00006-of-00007.safetensors",
407
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00006-of-00007.safetensors",
408
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00006-of-00007.safetensors",
409
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
410
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
411
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
412
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
413
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
414
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
415
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
416
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
417
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm1.bias": "model-00006-of-00007.safetensors",
418
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm1.weight": "model-00006-of-00007.safetensors",
419
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm2.bias": "model-00006-of-00007.safetensors",
420
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm2.weight": "model-00006-of-00007.safetensors",
421
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00006-of-00007.safetensors",
422
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00006-of-00007.safetensors",
423
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00006-of-00007.safetensors",
424
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00006-of-00007.safetensors",
425
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
426
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
427
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
428
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
429
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
430
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
431
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
432
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
433
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm1.bias": "model-00006-of-00007.safetensors",
434
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm1.weight": "model-00006-of-00007.safetensors",
435
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm2.bias": "model-00006-of-00007.safetensors",
436
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm2.weight": "model-00006-of-00007.safetensors",
437
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00006-of-00007.safetensors",
438
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00006-of-00007.safetensors",
439
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00006-of-00007.safetensors",
440
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00006-of-00007.safetensors",
441
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
442
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
443
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
444
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
445
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
446
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
447
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
448
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
449
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm1.bias": "model-00006-of-00007.safetensors",
450
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm1.weight": "model-00006-of-00007.safetensors",
451
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.bias": "model-00006-of-00007.safetensors",
452
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.weight": "model-00006-of-00007.safetensors",
453
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00006-of-00007.safetensors",
454
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00006-of-00007.safetensors",
455
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00006-of-00007.safetensors",
456
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00006-of-00007.safetensors",
457
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
458
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
459
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
460
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
461
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
462
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
463
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
464
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
465
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm1.bias": "model-00006-of-00007.safetensors",
466
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm1.weight": "model-00006-of-00007.safetensors",
467
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm2.bias": "model-00006-of-00007.safetensors",
468
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm2.weight": "model-00006-of-00007.safetensors",
469
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00006-of-00007.safetensors",
470
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00006-of-00007.safetensors",
471
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00006-of-00007.safetensors",
472
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00006-of-00007.safetensors",
473
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
474
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
475
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
476
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
477
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
478
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
479
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
480
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
481
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm1.bias": "model-00006-of-00007.safetensors",
482
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm1.weight": "model-00006-of-00007.safetensors",
483
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm2.bias": "model-00006-of-00007.safetensors",
484
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm2.weight": "model-00006-of-00007.safetensors",
485
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00006-of-00007.safetensors",
486
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00006-of-00007.safetensors",
487
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00006-of-00007.safetensors",
488
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00006-of-00007.safetensors",
489
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
490
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
491
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
492
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
493
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
494
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
495
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
496
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
497
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm1.bias": "model-00006-of-00007.safetensors",
498
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm1.weight": "model-00006-of-00007.safetensors",
499
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm2.bias": "model-00006-of-00007.safetensors",
500
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm2.weight": "model-00006-of-00007.safetensors",
501
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00006-of-00007.safetensors",
502
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00006-of-00007.safetensors",
503
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00006-of-00007.safetensors",
504
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00006-of-00007.safetensors",
505
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
506
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
507
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
508
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
509
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
510
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
511
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
512
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
513
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm1.bias": "model-00006-of-00007.safetensors",
514
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm1.weight": "model-00006-of-00007.safetensors",
515
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm2.bias": "model-00006-of-00007.safetensors",
516
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm2.weight": "model-00006-of-00007.safetensors",
517
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00006-of-00007.safetensors",
518
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00006-of-00007.safetensors",
519
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00006-of-00007.safetensors",
520
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00006-of-00007.safetensors",
521
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
522
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
523
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
524
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
525
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
526
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
527
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
528
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
529
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm1.bias": "model-00006-of-00007.safetensors",
530
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm1.weight": "model-00006-of-00007.safetensors",
531
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm2.bias": "model-00006-of-00007.safetensors",
532
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm2.weight": "model-00006-of-00007.safetensors",
533
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc1.bias": "model-00006-of-00007.safetensors",
534
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc1.weight": "model-00006-of-00007.safetensors",
535
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc2.bias": "model-00006-of-00007.safetensors",
536
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc2.weight": "model-00006-of-00007.safetensors",
537
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
538
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
539
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
540
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
541
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
542
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
543
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
544
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
545
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm1.bias": "model-00006-of-00007.safetensors",
546
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm1.weight": "model-00006-of-00007.safetensors",
547
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm2.bias": "model-00006-of-00007.safetensors",
548
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm2.weight": "model-00006-of-00007.safetensors",
549
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc1.bias": "model-00006-of-00007.safetensors",
550
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc1.weight": "model-00006-of-00007.safetensors",
551
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc2.bias": "model-00006-of-00007.safetensors",
552
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc2.weight": "model-00006-of-00007.safetensors",
553
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
554
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
555
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
556
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
557
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
558
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
559
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
560
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
561
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.layer_norm1.bias": "model-00006-of-00007.safetensors",
562
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.layer_norm1.weight": "model-00006-of-00007.safetensors",
563
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.layer_norm2.bias": "model-00006-of-00007.safetensors",
564
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.layer_norm2.weight": "model-00006-of-00007.safetensors",
565
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.mlp.fc1.bias": "model-00006-of-00007.safetensors",
566
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.mlp.fc1.weight": "model-00006-of-00007.safetensors",
567
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.mlp.fc2.bias": "model-00006-of-00007.safetensors",
568
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.mlp.fc2.weight": "model-00006-of-00007.safetensors",
569
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
570
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
571
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
572
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
573
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
574
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
575
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
576
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.26.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
577
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm1.bias": "model-00006-of-00007.safetensors",
578
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm1.weight": "model-00006-of-00007.safetensors",
579
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm2.bias": "model-00006-of-00007.safetensors",
580
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm2.weight": "model-00006-of-00007.safetensors",
581
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00006-of-00007.safetensors",
582
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00006-of-00007.safetensors",
583
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00006-of-00007.safetensors",
584
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00006-of-00007.safetensors",
585
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
586
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
587
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
588
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
589
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
590
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
591
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
592
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
593
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm1.bias": "model-00006-of-00007.safetensors",
594
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm1.weight": "model-00006-of-00007.safetensors",
595
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm2.bias": "model-00006-of-00007.safetensors",
596
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm2.weight": "model-00006-of-00007.safetensors",
597
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00006-of-00007.safetensors",
598
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00006-of-00007.safetensors",
599
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00006-of-00007.safetensors",
600
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00006-of-00007.safetensors",
601
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
602
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
603
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
604
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
605
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
606
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
607
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
608
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
609
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm1.bias": "model-00006-of-00007.safetensors",
610
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm1.weight": "model-00006-of-00007.safetensors",
611
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm2.bias": "model-00006-of-00007.safetensors",
612
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm2.weight": "model-00006-of-00007.safetensors",
613
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00006-of-00007.safetensors",
614
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00006-of-00007.safetensors",
615
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00006-of-00007.safetensors",
616
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00006-of-00007.safetensors",
617
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
618
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
619
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
620
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
621
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
622
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
623
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
624
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
625
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm1.bias": "model-00006-of-00007.safetensors",
626
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm1.weight": "model-00006-of-00007.safetensors",
627
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm2.bias": "model-00006-of-00007.safetensors",
628
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm2.weight": "model-00006-of-00007.safetensors",
629
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00006-of-00007.safetensors",
630
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00006-of-00007.safetensors",
631
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00006-of-00007.safetensors",
632
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00006-of-00007.safetensors",
633
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
634
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
635
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
636
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
637
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
638
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
639
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
640
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
641
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm1.bias": "model-00006-of-00007.safetensors",
642
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm1.weight": "model-00006-of-00007.safetensors",
643
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm2.bias": "model-00006-of-00007.safetensors",
644
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm2.weight": "model-00006-of-00007.safetensors",
645
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00006-of-00007.safetensors",
646
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00006-of-00007.safetensors",
647
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00006-of-00007.safetensors",
648
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00006-of-00007.safetensors",
649
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
650
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
651
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
652
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
653
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
654
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
655
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
656
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
657
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm1.bias": "model-00006-of-00007.safetensors",
658
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm1.weight": "model-00006-of-00007.safetensors",
659
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm2.bias": "model-00006-of-00007.safetensors",
660
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm2.weight": "model-00006-of-00007.safetensors",
661
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00006-of-00007.safetensors",
662
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00006-of-00007.safetensors",
663
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00006-of-00007.safetensors",
664
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00006-of-00007.safetensors",
665
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
666
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
667
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
668
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
669
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
670
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
671
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
672
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
673
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm1.bias": "model-00006-of-00007.safetensors",
674
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm1.weight": "model-00006-of-00007.safetensors",
675
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm2.bias": "model-00006-of-00007.safetensors",
676
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm2.weight": "model-00006-of-00007.safetensors",
677
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00006-of-00007.safetensors",
678
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00006-of-00007.safetensors",
679
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00006-of-00007.safetensors",
680
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00006-of-00007.safetensors",
681
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00006-of-00007.safetensors",
682
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
683
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00006-of-00007.safetensors",
684
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00006-of-00007.safetensors",
685
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
686
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
687
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
688
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
689
+ "model.vision_tower.vision_tower.vision_model.head.attention.in_proj_bias": "model-00006-of-00007.safetensors",
690
+ "model.vision_tower.vision_tower.vision_model.head.attention.in_proj_weight": "model-00006-of-00007.safetensors",
691
+ "model.vision_tower.vision_tower.vision_model.head.attention.out_proj.bias": "model-00006-of-00007.safetensors",
692
+ "model.vision_tower.vision_tower.vision_model.head.attention.out_proj.weight": "model-00006-of-00007.safetensors",
693
+ "model.vision_tower.vision_tower.vision_model.head.layernorm.bias": "model-00006-of-00007.safetensors",
694
+ "model.vision_tower.vision_tower.vision_model.head.layernorm.weight": "model-00006-of-00007.safetensors",
695
+ "model.vision_tower.vision_tower.vision_model.head.mlp.fc1.bias": "model-00006-of-00007.safetensors",
696
+ "model.vision_tower.vision_tower.vision_model.head.mlp.fc1.weight": "model-00006-of-00007.safetensors",
697
+ "model.vision_tower.vision_tower.vision_model.head.mlp.fc2.bias": "model-00006-of-00007.safetensors",
698
+ "model.vision_tower.vision_tower.vision_model.head.mlp.fc2.weight": "model-00006-of-00007.safetensors",
699
+ "model.vision_tower.vision_tower.vision_model.head.probe": "model-00006-of-00007.safetensors",
700
+ "model.vision_tower.vision_tower.vision_model.post_layernorm.bias": "model-00006-of-00007.safetensors",
701
+ "model.vision_tower.vision_tower.vision_model.post_layernorm.weight": "model-00006-of-00007.safetensors"
702
+ }
703
+ }
modeling_phi4_visionr.py ADDED
@@ -0,0 +1,1026 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Minimal self-contained Phi4-Siglip model implementation.
3
+
4
+ This module provides:
5
+ - Phi4VisionR: Configuration class
6
+ - Phi4ForCausalLMV: Main vision-language model
7
+ - SiglipVisionTower: Vision encoder (standard SigLIP)
8
+ - Siglip2VisionTower: Vision encoder with NaFlex (variable token count)
9
+ - MLP Projector: Vision-to-language projection
10
+ """
11
+
12
+ import logging
13
+ import os
14
+ import re
15
+ import math
16
+ from abc import ABC, abstractmethod
17
+ from typing import List, Optional, Tuple, Union
18
+ from dataclasses import dataclass
19
+
20
+ import torch
21
+ import torch.nn as nn
22
+ from safetensors.torch import load_file
23
+
24
+ logger = logging.getLogger(__name__)
25
+ from transformers import (
26
+ AutoConfig,
27
+ AutoModelForCausalLM,
28
+ Phi3Config,
29
+ Phi3Model,
30
+ Phi3ForCausalLM,
31
+ SiglipVisionModel,
32
+ SiglipVisionConfig,
33
+ SiglipImageProcessor,
34
+ Siglip2VisionModel,
35
+ Siglip2VisionConfig,
36
+ BatchFeature,
37
+ )
38
+ from transformers.modeling_outputs import CausalLMOutputWithPast
39
+ from transformers.processing_utils import ImagesKwargs
40
+ import transformers.models.siglip2.image_processing_siglip2 as siglip2_ips
41
+
42
+
43
+ # =============================================================================
44
+ # Constants
45
+ # =============================================================================
46
+
47
+ IGNORE_INDEX = -100
48
+ IMAGE_TOKEN_INDEX = -200
49
+ DEFAULT_IMAGE_TOKEN = "<image>"
50
+
51
+
52
+ # =============================================================================
53
+ # Model Arguments (simplified dataclass for initialization)
54
+ # =============================================================================
55
+
56
+ @dataclass
57
+ class ModelArguments:
58
+ """Arguments for model initialization."""
59
+ vision_tower: Optional[str] = None
60
+ vision_tower_path: Optional[str] = None
61
+ mm_projector_type: str = "mlp2x_gelu"
62
+ pretrain_mm_mlp_adapter: Optional[str] = None
63
+ use_s2: bool = False
64
+ s2_scales: str = "384,768,1152"
65
+ hf_cache_dir: Optional[str] = None
66
+ # NaFlex-specific
67
+ min_num_patches: int = 256
68
+ max_num_patches: int = 3600
69
+ # Embedded vision config (to avoid network calls)
70
+ vision_config: Optional[dict] = None
71
+
72
+
73
+ # =============================================================================
74
+ # Vision Projector (MLP)
75
+ # =============================================================================
76
+
77
+ def build_vision_projector(config):
78
+ """Build vision-to-language projector based on config."""
79
+ projector_type = getattr(config, 'mm_projector_type', 'mlp2x_gelu')
80
+
81
+ if projector_type == 'linear':
82
+ return nn.Linear(config.mm_hidden_size, config.hidden_size)
83
+
84
+ elif projector_type.startswith('mlp'):
85
+ mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
86
+ if mlp_gelu_match:
87
+ mlp_depth = int(mlp_gelu_match.group(1))
88
+ modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
89
+ for _ in range(1, mlp_depth):
90
+ modules.append(nn.GELU())
91
+ modules.append(nn.Linear(config.hidden_size, config.hidden_size))
92
+ return nn.Sequential(*modules)
93
+
94
+ elif projector_type == 'identity':
95
+ return nn.Identity()
96
+
97
+ raise ValueError(f'Unknown projector type: {projector_type}')
98
+
99
+
100
+ # =============================================================================
101
+ # Vision Encoders - SigLIP
102
+ # =============================================================================
103
+
104
+ class SiglipVisionTower(nn.Module):
105
+ """Standard SigLIP vision encoder with fixed token count."""
106
+
107
+ def __init__(self, vision_tower: str, args: ModelArguments = None, delay_load: bool = False):
108
+ super().__init__()
109
+
110
+ self.is_loaded = False
111
+ self.vision_tower_name = vision_tower
112
+ self.vision_tower_path = None
113
+ self.select_layer = -2
114
+
115
+ self.hf_hub_cache_dir = None
116
+ self.local_files_only = False
117
+
118
+ if args and getattr(args, 'hf_cache_dir', None):
119
+ self.hf_hub_cache_dir = args.hf_cache_dir
120
+ self.local_files_only = True
121
+
122
+ # Load or create vision config once (avoids network calls if embedded config provided)
123
+ vision_config_dict = getattr(args, "vision_config", None) if args else None
124
+ if vision_config_dict is not None:
125
+ self._vision_config = SiglipVisionConfig(**vision_config_dict)
126
+ else:
127
+ self._vision_config = SiglipVisionConfig.from_pretrained(
128
+ self.vision_tower_name,
129
+ local_files_only=self.local_files_only,
130
+ cache_dir=self.hf_hub_cache_dir,
131
+ )
132
+
133
+ if not delay_load:
134
+ self.load_model()
135
+
136
+ def load_model(self):
137
+ if self.is_loaded:
138
+ return
139
+
140
+ # Create image processor
141
+ self.image_processor = SiglipImageProcessor(
142
+ size={"height": self._vision_config.image_size, "width": self._vision_config.image_size},
143
+ )
144
+ self.image_processor.crop_size = self.image_processor.size
145
+
146
+ vision_tower_path = self.vision_tower_path if self.vision_tower_path else self.vision_tower_name
147
+ self.vision_tower = SiglipVisionModel.from_pretrained(
148
+ vision_tower_path,
149
+ config=self._vision_config,
150
+ local_files_only=self.local_files_only,
151
+ cache_dir=self.hf_hub_cache_dir,
152
+ )
153
+
154
+ self.vision_tower.requires_grad_(False)
155
+ self.is_loaded = True
156
+
157
+ def feature_select(self, image_forward_outs):
158
+ return image_forward_outs.hidden_states[self.select_layer]
159
+
160
+ def forward(self, images):
161
+ if isinstance(images, list):
162
+ image_features = []
163
+ for image in images:
164
+ image_forward_out = self.vision_tower(
165
+ image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
166
+ output_hidden_states=True
167
+ )
168
+ image_feature = self.feature_select(image_forward_out).to(image.dtype)
169
+ image_features.append(image_feature)
170
+ else:
171
+ image_forward_outs = self.vision_tower(
172
+ images.to(device=self.device, dtype=self.dtype),
173
+ output_hidden_states=True
174
+ )
175
+ image_features = self.feature_select(image_forward_outs).to(images.dtype)
176
+
177
+ return image_features
178
+
179
+ @property
180
+ def dummy_feature(self):
181
+ return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
182
+
183
+ @property
184
+ def dtype(self):
185
+ return self.vision_tower.dtype
186
+
187
+ @property
188
+ def device(self):
189
+ return self.vision_tower.device
190
+
191
+ @property
192
+ def config(self):
193
+ return self.vision_tower.config if self.is_loaded else self._vision_config
194
+
195
+ @property
196
+ def hidden_size(self):
197
+ return self.config.hidden_size
198
+
199
+ @property
200
+ def num_patches(self):
201
+ return (self.config.image_size // self.config.patch_size) ** 2
202
+
203
+
204
+ # =============================================================================
205
+ # Vision Encoders - SigLIP2 with NaFlex (variable token count)
206
+ # =============================================================================
207
+
208
+ class Siglip2ImageProcessorKwargsNoUpscale(ImagesKwargs, total=False):
209
+ patch_size: int
210
+ max_num_patches: int
211
+ min_num_patches: int
212
+
213
+
214
+ class Siglip2ImageProcessorNoUpscale(siglip2_ips.Siglip2ImageProcessor):
215
+ """Custom SigLIP2 image processor that doesn't upscale small images."""
216
+
217
+ model_input_names = ["pixel_values", "pixel_attention_mask", "spatial_shapes"]
218
+ valid_kwargs = Siglip2ImageProcessorKwargsNoUpscale
219
+
220
+ def __init__(
221
+ self,
222
+ do_resize: bool = True,
223
+ resample = siglip2_ips.PILImageResampling.BILINEAR,
224
+ do_rescale: bool = True,
225
+ rescale_factor: float = 1 / 255,
226
+ do_normalize: bool = True,
227
+ image_mean: Optional[Union[float, List[float]]] = None,
228
+ image_std: Optional[Union[float, List[float]]] = None,
229
+ do_convert_rgb: Optional[bool] = None,
230
+ patch_size: int = 16,
231
+ max_num_patches: int = 256,
232
+ min_num_patches: int = 1,
233
+ **kwargs,
234
+ ):
235
+ super().__init__(**kwargs)
236
+
237
+ image_mean = image_mean if image_mean is not None else [0.5, 0.5, 0.5]
238
+ image_std = image_std if image_std is not None else [0.5, 0.5, 0.5]
239
+
240
+ self.do_resize = do_resize
241
+ self.resample = resample
242
+ self.do_rescale = do_rescale
243
+ self.rescale_factor = rescale_factor
244
+ self.do_normalize = do_normalize
245
+ self.image_mean = image_mean
246
+ self.image_std = image_std
247
+ self.do_convert_rgb = do_convert_rgb
248
+ self.patch_size = patch_size
249
+ self.max_num_patches = max_num_patches
250
+ self.min_num_patches = min_num_patches
251
+
252
+ @siglip2_ips.filter_out_non_signature_kwargs()
253
+ def preprocess(
254
+ self,
255
+ images,
256
+ resample=None,
257
+ do_rescale: Optional[bool] = None,
258
+ rescale_factor: Optional[float] = None,
259
+ do_normalize: Optional[bool] = None,
260
+ image_mean: Optional[Union[float, List[float]]] = None,
261
+ image_std: Optional[Union[float, List[float]]] = None,
262
+ return_tensors=None,
263
+ input_data_format=None,
264
+ do_convert_rgb: Optional[bool] = None,
265
+ patch_size: Optional[int] = None,
266
+ max_num_patches: Optional[int] = None,
267
+ min_num_patches: Optional[int] = None,
268
+ ):
269
+ resample = resample if resample is not None else self.resample
270
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
271
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
272
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
273
+ image_mean = image_mean if image_mean is not None else self.image_mean
274
+ image_std = image_std if image_std is not None else self.image_std
275
+ do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
276
+ patch_size = patch_size if patch_size is not None else self.patch_size
277
+ max_num_patches = max_num_patches if max_num_patches is not None else self.max_num_patches
278
+ min_num_patches = min_num_patches if min_num_patches is not None else self.min_num_patches
279
+
280
+ data_format = siglip2_ips.ChannelDimension.LAST
281
+
282
+ try:
283
+ images = self.fetch_images(images)
284
+ except TypeError:
285
+ pass
286
+ images = siglip2_ips.make_flat_list_of_images(images)
287
+
288
+ if not siglip2_ips.valid_images(images):
289
+ raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, or torch.Tensor")
290
+
291
+ siglip2_ips.validate_preprocess_arguments(
292
+ do_rescale=do_rescale,
293
+ rescale_factor=rescale_factor,
294
+ do_normalize=do_normalize,
295
+ image_mean=image_mean,
296
+ image_std=image_std,
297
+ )
298
+
299
+ if do_convert_rgb:
300
+ images = [siglip2_ips.convert_to_rgb(image) for image in images]
301
+
302
+ images = [siglip2_ips.to_numpy_array(image) for image in images]
303
+
304
+ if input_data_format is None:
305
+ input_data_format = siglip2_ips.infer_channel_dimension_format(images[0])
306
+
307
+ pixel_masks = []
308
+ pixel_values = []
309
+ spatial_shapes = []
310
+
311
+ for image in images:
312
+ image = siglip2_ips.to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
313
+
314
+ num_patches = max((image.shape[1] // patch_size) * (image.shape[0] // patch_size), 1)
315
+
316
+ # Resize only if image is too large/small
317
+ if num_patches < min_num_patches:
318
+ height, width = siglip2_ips.get_image_size_for_max_num_patches(
319
+ image_height=image.shape[0],
320
+ image_width=image.shape[1],
321
+ patch_size=patch_size,
322
+ max_num_patches=min_num_patches,
323
+ )
324
+ elif num_patches > max_num_patches:
325
+ height, width = siglip2_ips.get_image_size_for_max_num_patches(
326
+ image_height=image.shape[0],
327
+ image_width=image.shape[1],
328
+ patch_size=patch_size,
329
+ max_num_patches=max_num_patches,
330
+ )
331
+ else:
332
+ height, width = siglip2_ips.get_image_size_for_max_num_patches(
333
+ image_height=image.shape[0],
334
+ image_width=image.shape[1],
335
+ patch_size=patch_size,
336
+ max_num_patches=num_patches,
337
+ )
338
+
339
+ image = siglip2_ips.resize(image=image, size=(height, width), resample=resample, input_data_format=data_format)
340
+
341
+ if do_rescale:
342
+ image = self.rescale(image=image, scale=rescale_factor, input_data_format=data_format)
343
+
344
+ if do_normalize:
345
+ image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=data_format)
346
+
347
+ patches = siglip2_ips.convert_image_to_patches(image, patch_size)
348
+ patches, mask = siglip2_ips.pad_along_first_dim(patches, max_num_patches)
349
+ num_patches_height = image.shape[0] // patch_size
350
+ num_patches_width = image.shape[1] // patch_size
351
+
352
+ spatial_shapes.append((num_patches_height, num_patches_width))
353
+ pixel_values.append(patches)
354
+ pixel_masks.append(mask)
355
+
356
+ return siglip2_ips.BatchFeature(
357
+ data={
358
+ "pixel_values": pixel_values,
359
+ "pixel_attention_mask": pixel_masks,
360
+ "spatial_shapes": spatial_shapes,
361
+ },
362
+ tensor_type=return_tensors,
363
+ )
364
+
365
+
366
+ class Siglip2VisionTower(nn.Module):
367
+ """SigLIP2 vision encoder with NaFlex (variable token count per image)."""
368
+
369
+ def __init__(self, vision_tower: str, args: ModelArguments = None, delay_load: bool = False):
370
+ super().__init__()
371
+
372
+ self.is_loaded = False
373
+ self.vision_tower_name = vision_tower
374
+ self.vision_tower_path = None
375
+ self.select_layer = -2
376
+
377
+ self.hf_hub_cache_dir = None
378
+ self.local_files_only = False
379
+
380
+ self.min_num_patches = getattr(args, "min_num_patches", 256) if args else 256
381
+ self.max_num_patches = getattr(args, "max_num_patches", 3600) if args else 3600
382
+
383
+ if args and getattr(args, 'hf_cache_dir', None):
384
+ self.hf_hub_cache_dir = args.hf_cache_dir
385
+ self.local_files_only = True
386
+
387
+ # Load or create vision config once (avoids network calls if embedded config provided)
388
+ vision_config_dict = getattr(args, "vision_config", None) if args else None
389
+ if vision_config_dict is not None:
390
+ # Infer patch_size from model name if not in config
391
+ if 'patch_size' not in vision_config_dict:
392
+ if 'patch14' in self.vision_tower_name.lower():
393
+ vision_config_dict['patch_size'] = 14
394
+ else:
395
+ vision_config_dict['patch_size'] = 16 # default for patch16-naflex
396
+ self._vision_config = Siglip2VisionConfig(**vision_config_dict)
397
+ else:
398
+ self._vision_config = Siglip2VisionConfig.from_pretrained(
399
+ self.vision_tower_name,
400
+ local_files_only=self.local_files_only,
401
+ cache_dir=self.hf_hub_cache_dir,
402
+ )
403
+
404
+ if not delay_load:
405
+ self.load_model()
406
+
407
+ def load_model(self, skip_weights: bool = False):
408
+ """Load the vision tower model.
409
+
410
+ Args:
411
+ skip_weights: If True, only load the architecture without pretrained weights.
412
+ Useful when weights will be loaded from a checkpoint later.
413
+ """
414
+ if self.is_loaded:
415
+ return
416
+
417
+ # Create image processor
418
+ self.image_processor = Siglip2ImageProcessorNoUpscale(
419
+ patch_size=self._vision_config.patch_size,
420
+ max_num_patches=self.max_num_patches,
421
+ min_num_patches=self.min_num_patches,
422
+ )
423
+
424
+ if skip_weights:
425
+ # Load architecture only, no pretrained weights (will load from checkpoint)
426
+ self.vision_tower = Siglip2VisionModel(self._vision_config)
427
+ logger.info("Vision tower initialized without pretrained weights (will load from checkpoint).")
428
+ else:
429
+ vision_tower_path = self.vision_tower_path if self.vision_tower_path else self.vision_tower_name
430
+ self.vision_tower = Siglip2VisionModel.from_pretrained(
431
+ vision_tower_path,
432
+ config=self._vision_config,
433
+ local_files_only=self.local_files_only,
434
+ cache_dir=self.hf_hub_cache_dir,
435
+ )
436
+
437
+ self.vision_tower.config.min_num_patches = self.min_num_patches
438
+ self.vision_tower.config.max_num_patches = self.max_num_patches
439
+
440
+ self.vision_tower.requires_grad_(False)
441
+ self.is_loaded = True
442
+
443
+ def feature_select(self, image_forward_outs):
444
+ return image_forward_outs.hidden_states[self.select_layer]
445
+
446
+ def forward(self, images):
447
+ if isinstance(images, (dict, BatchFeature)):
448
+ images = {
449
+ "pixel_values": images["pixel_values"].to(device=self.device, dtype=self.dtype),
450
+ "pixel_attention_mask": images["pixel_attention_mask"].to(device=self.device, dtype=self.dtype),
451
+ "spatial_shapes": images["spatial_shapes"].cpu().numpy(),
452
+ }
453
+ images_forward_out = self.vision_tower(**images, output_hidden_states=True)
454
+ image_features = self.feature_select(images_forward_out).to(self.dtype)
455
+ # Remove pad tokens
456
+ image_features = [
457
+ feat[images["pixel_attention_mask"][j].bool()]
458
+ for j, feat in enumerate(image_features)
459
+ ]
460
+
461
+ elif isinstance(images, list):
462
+ image_features = []
463
+ for image in images:
464
+ image = {
465
+ "pixel_values": image["pixel_values"].to(device=self.device, dtype=self.dtype),
466
+ "pixel_attention_mask": image["pixel_attention_mask"].to(device=self.device, dtype=self.dtype),
467
+ "spatial_shapes": image["spatial_shapes"].cpu().numpy(),
468
+ }
469
+ image_forward_out = self.vision_tower(**image, output_hidden_states=True)
470
+ image_feature = self.feature_select(image_forward_out).to(self.dtype)
471
+ image_feature = [
472
+ feat[image["pixel_attention_mask"][j].bool()]
473
+ for j, feat in enumerate(image_feature)
474
+ ]
475
+ image_features.append(image_feature)
476
+ else:
477
+ raise ValueError(f"Unsupported image type: {type(images)}")
478
+
479
+ return image_features
480
+
481
+ @property
482
+ def dummy_feature(self):
483
+ return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
484
+
485
+ @property
486
+ def dtype(self):
487
+ return self.vision_tower.dtype
488
+
489
+ @property
490
+ def device(self):
491
+ return self.vision_tower.device
492
+
493
+ @property
494
+ def config(self):
495
+ return self.vision_tower.config if self.is_loaded else self._vision_config
496
+
497
+ @property
498
+ def hidden_size(self):
499
+ return self.config.hidden_size
500
+
501
+
502
+ # =============================================================================
503
+ # Vision Tower Builder
504
+ # =============================================================================
505
+
506
+ def build_vision_tower(config, delay_load: bool = False):
507
+ """Build the appropriate vision tower based on config."""
508
+ vision_tower = getattr(config, 'mm_vision_tower', getattr(config, 'vision_tower', None))
509
+
510
+ if vision_tower is None:
511
+ return None
512
+
513
+ # Create a minimal args object from config
514
+ args = ModelArguments(
515
+ vision_tower=vision_tower,
516
+ hf_cache_dir=getattr(config, 'hf_cache_dir', None),
517
+ min_num_patches=getattr(config, 'min_num_patches', 256),
518
+ max_num_patches=getattr(config, 'max_num_patches', 3600),
519
+ vision_config=getattr(config, 'vision_config', None),
520
+ )
521
+
522
+ if 'siglip' in vision_tower.lower():
523
+ if 'naflex' in vision_tower.lower():
524
+ return Siglip2VisionTower(vision_tower, args=args, delay_load=delay_load)
525
+ else:
526
+ return SiglipVisionTower(vision_tower, args=args, delay_load=delay_load)
527
+
528
+ raise ValueError(f'Unknown vision tower: {vision_tower}. Only SigLIP variants are supported.')
529
+
530
+
531
+ # =============================================================================
532
+ # Configuration
533
+ # =============================================================================
534
+
535
+ class Phi4VisionR(Phi3Config):
536
+ """Configuration for Phi4-Siglip model."""
537
+ model_type = "phi4-siglip"
538
+
539
+ def __init__(
540
+ self,
541
+ mm_vision_tower: Optional[str] = None,
542
+ mm_projector_type: str = "mlp2x_gelu",
543
+ mm_hidden_size: int = 1152,
544
+ min_num_patches: int = 256,
545
+ max_num_patches: int = 3600,
546
+ vision_config: Optional[dict] = None,
547
+ **kwargs
548
+ ):
549
+ super().__init__(**kwargs)
550
+ self.mm_vision_tower = mm_vision_tower
551
+ self.mm_projector_type = mm_projector_type
552
+ self.mm_hidden_size = mm_hidden_size
553
+ self.min_num_patches = min_num_patches
554
+ self.max_num_patches = max_num_patches
555
+ self.vision_config = vision_config
556
+
557
+
558
+ # =============================================================================
559
+ # Base Model with Vision Integration
560
+ # =============================================================================
561
+
562
+ class Phi4VisionRModel(Phi3Model):
563
+ """Phi3 model with vision tower and projector."""
564
+ config_class = Phi4VisionR
565
+
566
+ def __init__(self, config: Phi4VisionR):
567
+ super().__init__(config)
568
+
569
+ if hasattr(config, "mm_vision_tower") and config.mm_vision_tower:
570
+ self.vision_tower = build_vision_tower(config, delay_load=not getattr(config, 'continuous_training', False))
571
+ if getattr(config, 'continuous_training', False):
572
+ config.continuous_training = False
573
+ self.mm_projector = build_vision_projector(config)
574
+
575
+ def get_vision_tower(self):
576
+ vision_tower = getattr(self, 'vision_tower', None)
577
+ if isinstance(vision_tower, list):
578
+ vision_tower = vision_tower[0]
579
+ return vision_tower
580
+
581
+ def initialize_vision_modules(self, model_args: ModelArguments):
582
+ """Initialize vision tower and projector from model arguments."""
583
+ vision_tower_name = model_args.vision_tower
584
+
585
+ self.config.mm_vision_tower = vision_tower_name
586
+
587
+ if self.get_vision_tower() is None:
588
+ vision_tower = build_vision_tower(model_args)
589
+ self.vision_tower = vision_tower
590
+ else:
591
+ vision_tower = self.vision_tower
592
+ if model_args.vision_tower_path:
593
+ vision_tower.vision_tower_path = model_args.vision_tower_path
594
+ vision_tower.load_model()
595
+
596
+ self.config.use_mm_proj = True
597
+ self.config.mm_projector_type = model_args.mm_projector_type
598
+ self.config.mm_hidden_size = vision_tower.hidden_size
599
+
600
+ if getattr(self, 'mm_projector', None) is None:
601
+ self.mm_projector = build_vision_projector(self.config)
602
+
603
+ # Ensure projector is trainable
604
+ for p in self.mm_projector.parameters():
605
+ p.requires_grad = True
606
+
607
+ # Load pretrained projector weights if provided
608
+ if model_args.pretrain_mm_mlp_adapter is not None:
609
+ mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
610
+
611
+ def get_w(weights, keyword):
612
+ return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
613
+
614
+ self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
615
+
616
+
617
+ # =============================================================================
618
+ # Causal LM with Multimodal Support
619
+ # =============================================================================
620
+
621
+ class Phi4ForCausalLMV(Phi3ForCausalLM):
622
+ """Phi4-Siglip model for causal language modeling with vision support."""
623
+ config_class = Phi4VisionR
624
+
625
+ # Tell transformers to not warn about vision tower weights - we load them separately
626
+ _keys_to_ignore_on_load_unexpected = [r"model\.vision_tower\.vision_tower\..*"]
627
+
628
+ def __init__(self, config: Phi4VisionR):
629
+ super(Phi3ForCausalLM, self).__init__(config)
630
+ self.model = Phi4VisionRModel(config)
631
+ self.vocab_size = config.vocab_size
632
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
633
+ self.post_init()
634
+
635
+ def get_model(self):
636
+ return self.model
637
+
638
+ def get_vision_tower(self):
639
+ return self.get_model().get_vision_tower()
640
+
641
+ def encode_images(self, images):
642
+ """Encode images through vision tower and projector."""
643
+ image_features = self.get_model().get_vision_tower()(images)
644
+
645
+ # Handle dynamic tokens (NaFlex)
646
+ if isinstance(image_features, list) and isinstance(image_features[0], list):
647
+ image_features = [
648
+ [self.get_model().mm_projector(image) for image in batch]
649
+ for batch in image_features
650
+ ]
651
+ elif isinstance(image_features, list):
652
+ image_features = [self.get_model().mm_projector(image) for image in image_features]
653
+ else:
654
+ image_features = self.get_model().mm_projector(image_features)
655
+
656
+ return image_features
657
+
658
+ def prepare_inputs_labels_for_multimodal(
659
+ self, input_ids, position_ids, attention_mask, past_key_values, labels, images
660
+ ):
661
+ """
662
+ Prepare inputs by replacing image tokens with actual image embeddings.
663
+
664
+ This is the core multimodal integration logic that:
665
+ 1. Encodes images through the vision tower
666
+ 2. Finds IMAGE_TOKEN_INDEX positions in input_ids
667
+ 3. Replaces those positions with image embeddings
668
+ 4. Handles padding and attention masks
669
+ """
670
+ vision_tower = self.get_vision_tower()
671
+
672
+ if vision_tower is None or images is None or input_ids.shape[1] == 1:
673
+ # Handle KV cache case during generation
674
+ if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
675
+ target_shape = past_key_values[-1][-1].shape[-2] + 1
676
+ attention_mask = torch.cat((
677
+ attention_mask,
678
+ torch.ones(
679
+ (attention_mask.shape[0], target_shape - attention_mask.shape[1]),
680
+ dtype=attention_mask.dtype,
681
+ device=attention_mask.device
682
+ )
683
+ ), dim=1)
684
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
685
+ return input_ids, position_ids, attention_mask, past_key_values, None, labels
686
+
687
+ # Encode images
688
+ if (isinstance(images, torch.Tensor) and images.ndim == 5) or \
689
+ (isinstance(images, list) and isinstance(images[0], torch.Tensor)):
690
+ images = torch.cat([image for image in images], dim=0)
691
+ image_features = self.encode_images(images).to(self.device)
692
+ elif isinstance(images, list) and isinstance(images[0], (dict, BatchFeature)):
693
+ # NaFlex case
694
+ image_features = self.encode_images(images)
695
+ image_features = [image.to(self.device) for batch in image_features for image in batch]
696
+ elif isinstance(images, (dict, BatchFeature)):
697
+ image_features = self.encode_images(images)
698
+ image_features = [image.to(self.device) for image in image_features]
699
+ else:
700
+ image_features = self.encode_images(images).to(self.device)
701
+
702
+ # Store original values
703
+ _labels = labels
704
+ _position_ids = position_ids
705
+ _attention_mask = attention_mask
706
+
707
+ # Create defaults if not provided
708
+ if attention_mask is None:
709
+ attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
710
+ else:
711
+ attention_mask = attention_mask.bool()
712
+ if position_ids is None:
713
+ position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
714
+ if labels is None:
715
+ labels = torch.full_like(input_ids, IGNORE_INDEX)
716
+
717
+ input_ids_temp = input_ids
718
+
719
+ # Remove padding using attention_mask
720
+ input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in
721
+ zip(input_ids, attention_mask)]
722
+ labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
723
+
724
+ # Replace IMAGE_TOKEN_INDEX with 0 for compatibility
725
+ input_ids_temp[input_ids_temp == IMAGE_TOKEN_INDEX] = 0
726
+
727
+ new_input_embeds = []
728
+ new_labels = []
729
+ cur_image_idx = 0
730
+
731
+ for batch_idx, cur_input_ids in enumerate(input_ids):
732
+ num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
733
+
734
+ if num_images == 0:
735
+ # No image tokens - just embed text
736
+ cur_image_features = image_features[cur_image_idx]
737
+ cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
738
+ cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
739
+ new_input_embeds.append(cur_input_embeds)
740
+ new_labels.append(labels[batch_idx])
741
+ cur_image_idx += 1
742
+ continue
743
+
744
+ # Find image token positions
745
+ image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [
746
+ cur_input_ids.shape[0]]
747
+
748
+ cur_input_ids_noim = []
749
+ cur_labels = labels[batch_idx]
750
+ cur_labels_noim = []
751
+
752
+ # Split by image tokens
753
+ for i in range(len(image_token_indices) - 1):
754
+ cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] + 1:image_token_indices[i + 1]])
755
+ cur_labels_noim.append(cur_labels[image_token_indices[i] + 1:image_token_indices[i + 1]])
756
+
757
+ split_sizes = [x.shape[0] for x in cur_labels_noim]
758
+ cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
759
+ cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
760
+
761
+ cur_new_input_embeds = []
762
+ cur_new_labels = []
763
+
764
+ # Interleave text and image embeddings
765
+ for i in range(num_images + 1):
766
+ cur_new_input_embeds.append(cur_input_embeds_no_im[i])
767
+ cur_new_labels.append(cur_labels_noim[i])
768
+ if i < num_images:
769
+ cur_image_features = image_features[cur_image_idx]
770
+ cur_image_idx += 1
771
+ cur_new_input_embeds.append(cur_image_features)
772
+ cur_new_labels.append(
773
+ torch.full(
774
+ (cur_image_features.shape[0],),
775
+ IGNORE_INDEX,
776
+ device=cur_labels.device,
777
+ dtype=cur_labels.dtype
778
+ )
779
+ )
780
+
781
+ cur_new_input_embeds = torch.cat(cur_new_input_embeds)
782
+ cur_new_labels = torch.cat(cur_new_labels)
783
+
784
+ new_input_embeds.append(cur_new_input_embeds)
785
+ new_labels.append(cur_new_labels)
786
+
787
+ # Truncate to max length
788
+ tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
789
+ if tokenizer_model_max_length is not None:
790
+ new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
791
+ new_labels = [x[:tokenizer_model_max_length] for x in new_labels]
792
+
793
+ # Pad sequences to same length
794
+ max_len = max(x.shape[0] for x in new_input_embeds)
795
+ batch_size = len(new_input_embeds)
796
+
797
+ new_input_embeds_padded = []
798
+ new_labels_padded = torch.full(
799
+ (batch_size, max_len), IGNORE_INDEX,
800
+ dtype=new_labels[0].dtype, device=new_labels[0].device
801
+ )
802
+ attention_mask = torch.zeros(
803
+ (batch_size, max_len),
804
+ dtype=attention_mask.dtype, device=attention_mask.device
805
+ )
806
+ position_ids = torch.zeros(
807
+ (batch_size, max_len),
808
+ dtype=position_ids.dtype, device=position_ids.device
809
+ )
810
+
811
+ for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
812
+ cur_len = cur_new_embed.shape[0]
813
+ padding_side = getattr(self.config, 'tokenizer_padding_side', 'right')
814
+
815
+ if padding_side == "left":
816
+ new_input_embeds_padded.append(torch.cat((
817
+ torch.zeros(
818
+ (max_len - cur_len, cur_new_embed.shape[1]),
819
+ dtype=cur_new_embed.dtype, device=cur_new_embed.device
820
+ ),
821
+ cur_new_embed
822
+ ), dim=0))
823
+ if cur_len > 0:
824
+ new_labels_padded[i, -cur_len:] = cur_new_labels
825
+ attention_mask[i, -cur_len:] = True
826
+ position_ids[i, -cur_len:] = torch.arange(
827
+ 0, cur_len, dtype=position_ids.dtype, device=position_ids.device
828
+ )
829
+ else:
830
+ new_input_embeds_padded.append(torch.cat((
831
+ cur_new_embed,
832
+ torch.zeros(
833
+ (max_len - cur_len, cur_new_embed.shape[1]),
834
+ dtype=cur_new_embed.dtype, device=cur_new_embed.device
835
+ )
836
+ ), dim=0))
837
+ if cur_len > 0:
838
+ new_labels_padded[i, :cur_len] = cur_new_labels
839
+ attention_mask[i, :cur_len] = True
840
+ position_ids[i, :cur_len] = torch.arange(
841
+ 0, cur_len, dtype=position_ids.dtype, device=position_ids.device
842
+ )
843
+
844
+ new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
845
+
846
+ # Restore None values if originally None
847
+ new_labels = None if _labels is None else new_labels_padded
848
+ attention_mask = None if _attention_mask is None else attention_mask.to(dtype=_attention_mask.dtype)
849
+ position_ids = None if _position_ids is None else position_ids
850
+
851
+ return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
852
+
853
+ def forward(
854
+ self,
855
+ input_ids: torch.LongTensor = None,
856
+ attention_mask: Optional[torch.Tensor] = None,
857
+ position_ids: Optional[torch.LongTensor] = None,
858
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
859
+ inputs_embeds: Optional[torch.FloatTensor] = None,
860
+ labels: Optional[torch.LongTensor] = None,
861
+ use_cache: Optional[bool] = None,
862
+ output_attentions: Optional[bool] = None,
863
+ output_hidden_states: Optional[bool] = None,
864
+ images: Optional[torch.FloatTensor] = None,
865
+ pixel_values: Optional[torch.FloatTensor] = None,
866
+ pixel_attention_mask: Optional[torch.Tensor] = None,
867
+ spatial_shapes: Optional[torch.Tensor] = None,
868
+ return_dict: Optional[bool] = None,
869
+ cache_position: Optional[torch.LongTensor] = None,
870
+ logits_to_keep: Union[int, torch.Tensor] = 0,
871
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
872
+
873
+ # Accept processor output format (pixel_values, pixel_attention_mask, spatial_shapes)
874
+ if images is None and pixel_values is not None:
875
+ images = BatchFeature({
876
+ "pixel_values": pixel_values,
877
+ "pixel_attention_mask": pixel_attention_mask,
878
+ "spatial_shapes": spatial_shapes,
879
+ })
880
+
881
+ if inputs_embeds is None:
882
+ (
883
+ input_ids,
884
+ position_ids,
885
+ attention_mask,
886
+ past_key_values,
887
+ inputs_embeds,
888
+ labels
889
+ ) = self.prepare_inputs_labels_for_multimodal(
890
+ input_ids,
891
+ position_ids,
892
+ attention_mask,
893
+ past_key_values,
894
+ labels,
895
+ images
896
+ )
897
+
898
+ return super().forward(
899
+ input_ids=input_ids,
900
+ attention_mask=attention_mask,
901
+ position_ids=position_ids,
902
+ past_key_values=past_key_values,
903
+ inputs_embeds=inputs_embeds,
904
+ labels=labels,
905
+ use_cache=use_cache,
906
+ output_attentions=output_attentions,
907
+ output_hidden_states=output_hidden_states,
908
+ return_dict=return_dict,
909
+ cache_position=cache_position,
910
+ logits_to_keep=logits_to_keep
911
+ )
912
+
913
+ def prepare_inputs_for_generation(
914
+ self, input_ids, past_key_values=None, inputs_embeds=None, attention_mask=None, **kwargs
915
+ ):
916
+ images = kwargs.pop("images", None)
917
+
918
+ # Also accept processor output format (pixel_values, pixel_attention_mask, spatial_shapes)
919
+ pixel_values = kwargs.pop("pixel_values", None)
920
+ pixel_attention_mask = kwargs.pop("pixel_attention_mask", None)
921
+ spatial_shapes = kwargs.pop("spatial_shapes", None)
922
+
923
+ # If processor output format is provided, package as BatchFeature for the model
924
+ if images is None and pixel_values is not None:
925
+ images = BatchFeature({
926
+ "pixel_values": pixel_values,
927
+ "pixel_attention_mask": pixel_attention_mask,
928
+ "spatial_shapes": spatial_shapes,
929
+ })
930
+
931
+ _inputs = super().prepare_inputs_for_generation(
932
+ input_ids,
933
+ past_key_values=past_key_values,
934
+ inputs_embeds=inputs_embeds,
935
+ attention_mask=attention_mask,
936
+ **kwargs
937
+ )
938
+
939
+ if images is not None:
940
+ _inputs['images'] = images
941
+ return _inputs
942
+
943
+ @classmethod
944
+ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
945
+ """Load model from pretrained weights."""
946
+ # Extract dtype before passing to super() since we need it later
947
+ torch_dtype = kwargs.get("torch_dtype", None)
948
+
949
+ # Check if loading from local checkpoint that contains vision tower weights
950
+ load_vision_from_checkpoint = False
951
+ if os.path.isdir(pretrained_model_name_or_path):
952
+ for file_name in os.listdir(pretrained_model_name_or_path):
953
+ if file_name.endswith("safetensors"):
954
+ fpath = os.path.join(pretrained_model_name_or_path, file_name)
955
+ shard_weights = load_file(fpath)
956
+ if any(k.startswith("model.vision_tower.vision_tower.") for k in shard_weights.keys()):
957
+ load_vision_from_checkpoint = True
958
+ logger.info("Detected vision tower weights in checkpoint - will skip downloading from HuggingFace.")
959
+ break
960
+
961
+ model = super().from_pretrained(pretrained_model_name_or_path, **kwargs)
962
+
963
+ vision_tower = model.get_vision_tower()
964
+
965
+ # Load vision weights if model is a local path
966
+ if vision_tower is not None:
967
+ if not vision_tower.is_loaded:
968
+ # Skip downloading pretrained weights if we'll load from checkpoint
969
+ vision_tower.load_model(skip_weights=load_vision_from_checkpoint)
970
+
971
+ if load_vision_from_checkpoint:
972
+ try:
973
+ vision_weights = {}
974
+ for file_name in os.listdir(pretrained_model_name_or_path):
975
+ if file_name.endswith("safetensors"):
976
+ fpath = os.path.join(pretrained_model_name_or_path, file_name)
977
+ shard_weights = load_file(fpath)
978
+
979
+ # Handle weights with prefix "model.vision_tower.vision_tower."
980
+ # (the nested vision_tower is the actual encoder)
981
+ prefix_nested = "model.vision_tower.vision_tower."
982
+ prefix_simple = "model.vision_tower."
983
+
984
+ for k, v in shard_weights.items():
985
+ if k.startswith(prefix_nested):
986
+ # Strip to get "vision_tower.xxx"
987
+ new_key = k[len("model.vision_tower."):]
988
+ vision_weights[new_key] = v
989
+ elif k.startswith(prefix_simple) and not k.startswith(prefix_nested):
990
+ # Direct vision_tower weights (like image_processor params if saved)
991
+ new_key = k[len(prefix_simple):]
992
+ vision_weights[new_key] = v
993
+
994
+ if vision_weights:
995
+ vision_tower.load_state_dict(vision_weights, strict=False)
996
+ logger.info("Vision tower weights loaded from checkpoint.")
997
+ else:
998
+ logger.warning("No vision tower weights found in checkpoint!")
999
+ except Exception as e:
1000
+ logger.warning(
1001
+ "Vision tower weights NOT loaded from checkpoint. "
1002
+ f"Exception: {e}"
1003
+ )
1004
+
1005
+ vision_tower.to(model.device)
1006
+
1007
+ # Sync dtype
1008
+ dtype = torch_dtype if torch_dtype is not None else model.dtype
1009
+ dtype = model.dtype if dtype == "auto" else dtype
1010
+ model.to(dtype)
1011
+
1012
+ # Fix generation config
1013
+ if isinstance(model.generation_config.eos_token_id, (list, set)):
1014
+ model.generation_config.eos_token_id = model.generation_config.eos_token_id[0]
1015
+ if model.generation_config.pad_token_id is None:
1016
+ model.generation_config.pad_token_id = model.generation_config.eos_token_id
1017
+
1018
+ return model
1019
+
1020
+
1021
+ # =============================================================================
1022
+ # Register with AutoConfig/AutoModel
1023
+ # =============================================================================
1024
+
1025
+ AutoConfig.register("phi4-siglip", Phi4VisionR)
1026
+ AutoModelForCausalLM.register(Phi4VisionR, Phi4ForCausalLMV)
preprocessor_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_phi4_visionr.Phi4VisionRProcessor"
4
+ },
5
+ "do_convert_rgb": true,
6
+ "do_normalize": true,
7
+ "do_rescale": true,
8
+ "do_resize": true,
9
+ "image_mean": [0.5, 0.5, 0.5],
10
+ "image_processor_type": "Siglip2ImageProcessorNoUpscale",
11
+ "image_std": [0.5, 0.5, 0.5],
12
+ "max_num_patches": 3600,
13
+ "min_num_patches": 256,
14
+ "patch_size": 16,
15
+ "processor_class": "Phi4VisionRProcessor",
16
+ "rescale_factor": 0.00392156862745098,
17
+ "resample": 2
18
+ }
processing_phi4_visionr.py ADDED
@@ -0,0 +1,342 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Processor class for Phi4-Siglip.
3
+
4
+ This module provides:
5
+ - Phi4VisionRProcessor: Combined tokenizer and image processor
6
+ - Utility functions for image and text processing
7
+ """
8
+
9
+ from typing import List, Optional, Union
10
+
11
+ import torch
12
+ from PIL import Image
13
+ from transformers import BatchFeature
14
+ from transformers.image_utils import ImageInput
15
+ from transformers.processing_utils import ProcessorMixin
16
+ from transformers.tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
17
+ from transformers.utils import TensorType
18
+
19
+ # Constants (duplicated here to avoid circular imports when running scripts directly)
20
+ IMAGE_TOKEN_INDEX = -200
21
+ DEFAULT_IMAGE_TOKEN = "<image>"
22
+
23
+
24
+ # =============================================================================
25
+ # Image Utilities
26
+ # =============================================================================
27
+
28
+ def process_images(images: List[Image.Image], image_processor, model_cfg=None):
29
+ """
30
+ Process images for the model.
31
+
32
+ Args:
33
+ images: List of PIL images
34
+ image_processor: The image processor (Siglip2ImageProcessorNoUpscale for NaFlex)
35
+ model_cfg: Optional model config (unused, kept for API compatibility)
36
+
37
+ Returns:
38
+ Processed images as BatchFeature (for NaFlex)
39
+ """
40
+ # Check if NaFlex (has max_num_patches attribute)
41
+ is_naflex = hasattr(image_processor, "max_num_patches")
42
+
43
+ # Process with image processor
44
+ if is_naflex:
45
+ return image_processor(images, return_tensors='pt')
46
+ else:
47
+ return image_processor(images, return_tensors='pt')['pixel_values']
48
+
49
+
50
+ # =============================================================================
51
+ # Tokenizer Utilities
52
+ # =============================================================================
53
+
54
+ def tokenizer_image_token(
55
+ prompt: str,
56
+ tokenizer,
57
+ image_token_index: int = IMAGE_TOKEN_INDEX,
58
+ return_tensors: Optional[str] = None
59
+ ):
60
+ """
61
+ Tokenize a prompt containing <image> tokens.
62
+
63
+ Replaces <image> with IMAGE_TOKEN_INDEX in the token sequence.
64
+
65
+ Args:
66
+ prompt: The text prompt with <image> placeholders
67
+ tokenizer: The tokenizer to use
68
+ image_token_index: The index to use for image tokens
69
+ return_tensors: If 'pt', return as PyTorch tensor
70
+
71
+ Returns:
72
+ List of token ids or tensor
73
+ """
74
+ prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
75
+
76
+ def insert_separator(X, sep):
77
+ return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
78
+
79
+ input_ids = []
80
+ offset = 0
81
+ if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
82
+ offset = 1
83
+ input_ids.append(prompt_chunks[0][0])
84
+
85
+ for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
86
+ input_ids.extend(x[offset:])
87
+
88
+ if return_tensors is not None:
89
+ if return_tensors == 'pt':
90
+ return torch.tensor(input_ids, dtype=torch.long)
91
+ raise ValueError(f'Unsupported tensor type: {return_tensors}')
92
+ return input_ids
93
+
94
+
95
+ # =============================================================================
96
+ # Main Processor Class
97
+ # =============================================================================
98
+
99
+ class Phi4VisionRProcessor(ProcessorMixin):
100
+ """
101
+ Processor for Phi4-Siglip that wraps an image processor and tokenizer.
102
+
103
+ This processor handles:
104
+ - Image preprocessing (via SigLIP or SigLIP2/NaFlex)
105
+ - Text tokenization with image token insertion
106
+ - Conversation formatting
107
+
108
+ Args:
109
+ image_processor: The image processor (from vision tower)
110
+ tokenizer: The text tokenizer
111
+ """
112
+
113
+ attributes = ["image_processor", "tokenizer"]
114
+ image_processor_class = "AutoImageProcessor"
115
+ tokenizer_class = "AutoTokenizer"
116
+
117
+ def __init__(self, image_processor, tokenizer):
118
+ self.image_processor = image_processor
119
+ self.tokenizer = tokenizer
120
+
121
+ def __call__(
122
+ self,
123
+ text: Union[TextInput, List[TextInput]] = None,
124
+ images: ImageInput = None,
125
+ padding: Union[bool, str, PaddingStrategy] = False,
126
+ truncation: Union[bool, str, TruncationStrategy] = None,
127
+ max_length: Optional[int] = None,
128
+ return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
129
+ **kwargs,
130
+ ) -> BatchFeature:
131
+ """
132
+ Process text and images for the model.
133
+
134
+ Args:
135
+ text: The text input(s). Can contain <image> tokens.
136
+ images: The image input(s).
137
+ padding: Padding strategy.
138
+ truncation: Whether to truncate.
139
+ max_length: Maximum sequence length.
140
+ return_tensors: Return type for tensors.
141
+
142
+ Returns:
143
+ BatchFeature with input_ids, attention_mask, and optionally pixel_values.
144
+ """
145
+ # Process images
146
+ if images is not None:
147
+ if not isinstance(images, list):
148
+ images = [images]
149
+ image_inputs = process_images(images, self.image_processor)
150
+ else:
151
+ image_inputs = None
152
+
153
+ # Process text
154
+ if text is not None:
155
+ if isinstance(text, str):
156
+ text = [text]
157
+
158
+ # Check if text contains image tokens
159
+ has_images = any(DEFAULT_IMAGE_TOKEN in t for t in text)
160
+
161
+ if has_images and images is not None:
162
+ # Tokenize with image token handling
163
+ input_ids_list = []
164
+ for t in text:
165
+ ids = tokenizer_image_token(t, self.tokenizer, return_tensors='pt')
166
+ input_ids_list.append(ids)
167
+
168
+ # Pad sequences
169
+ if len(input_ids_list) > 1:
170
+ max_len = max(len(ids) for ids in input_ids_list)
171
+ padded_ids = []
172
+ attention_masks = []
173
+ pad_token_id = self.tokenizer.pad_token_id or 0
174
+
175
+ for ids in input_ids_list:
176
+ pad_len = max_len - len(ids)
177
+ if padding and pad_len > 0:
178
+ padded_ids.append(torch.cat([ids, torch.full((pad_len,), pad_token_id, dtype=torch.long)]))
179
+ attention_masks.append(torch.cat([torch.ones(len(ids)), torch.zeros(pad_len)]))
180
+ else:
181
+ padded_ids.append(ids)
182
+ attention_masks.append(torch.ones(len(ids)))
183
+
184
+ input_ids = torch.stack(padded_ids)
185
+ attention_mask = torch.stack(attention_masks).long()
186
+ else:
187
+ input_ids = input_ids_list[0].unsqueeze(0)
188
+ attention_mask = torch.ones_like(input_ids)
189
+ else:
190
+ # Standard tokenization
191
+ text_inputs = self.tokenizer(
192
+ text,
193
+ padding=padding,
194
+ truncation=truncation,
195
+ max_length=max_length,
196
+ return_tensors=return_tensors,
197
+ )
198
+ input_ids = text_inputs["input_ids"]
199
+ attention_mask = text_inputs["attention_mask"]
200
+ else:
201
+ input_ids = None
202
+ attention_mask = None
203
+
204
+ # Build output
205
+ data = {}
206
+ if input_ids is not None:
207
+ data["input_ids"] = input_ids
208
+ data["attention_mask"] = attention_mask
209
+
210
+ if image_inputs is not None:
211
+ if isinstance(image_inputs, BatchFeature):
212
+ # NaFlex case - merge all fields
213
+ data.update(image_inputs)
214
+ else:
215
+ data["pixel_values"] = image_inputs
216
+
217
+ return BatchFeature(data=data, tensor_type=return_tensors)
218
+
219
+ def batch_decode(self, *args, **kwargs):
220
+ """Decode token ids to text. Forwards to tokenizer."""
221
+ return self.tokenizer.batch_decode(*args, **kwargs)
222
+
223
+ def decode(self, *args, **kwargs):
224
+ """Decode token ids to text. Forwards to tokenizer."""
225
+ return self.tokenizer.decode(*args, **kwargs)
226
+
227
+ @property
228
+ def model_input_names(self):
229
+ """Get model input names from tokenizer and image processor."""
230
+ tokenizer_input_names = self.tokenizer.model_input_names
231
+ image_processor_input_names = getattr(
232
+ self.image_processor,
233
+ 'model_input_names',
234
+ ["pixel_values"]
235
+ )
236
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
237
+
238
+ @classmethod
239
+ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
240
+ """
241
+ Load processor from a pretrained model path.
242
+
243
+ This will load the tokenizer and create the appropriate image processor
244
+ based on the model config.
245
+ """
246
+ from transformers import AutoTokenizer, AutoConfig
247
+
248
+ tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
249
+
250
+ # Try to load config to determine vision tower type
251
+ try:
252
+ config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
253
+ vision_tower_name = getattr(config, 'mm_vision_tower', None)
254
+ vision_config = getattr(config, 'vision_config', None)
255
+
256
+ if vision_tower_name and 'naflex' in vision_tower_name.lower():
257
+ from .modeling_phi4_visionr import Siglip2ImageProcessorNoUpscale
258
+ # Use embedded vision_config to avoid network calls
259
+ # Infer patch_size from model name if not in config (patch14 vs patch16)
260
+ if vision_config is not None:
261
+ if 'patch_size' in vision_config:
262
+ patch_size = vision_config['patch_size']
263
+ elif 'patch14' in vision_tower_name.lower():
264
+ patch_size = 14
265
+ else:
266
+ patch_size = 16 # default for patch16-naflex
267
+ image_processor = Siglip2ImageProcessorNoUpscale(
268
+ patch_size=patch_size,
269
+ max_num_patches=getattr(config, 'max_num_patches', 3600),
270
+ min_num_patches=getattr(config, 'min_num_patches', 256),
271
+ )
272
+ else:
273
+ image_processor = Siglip2ImageProcessorNoUpscale.from_pretrained(
274
+ vision_tower_name,
275
+ max_num_patches=getattr(config, 'max_num_patches', 3600),
276
+ min_num_patches=getattr(config, 'min_num_patches', 256),
277
+ )
278
+ elif vision_tower_name:
279
+ from transformers import SiglipImageProcessor
280
+ # Use embedded vision_config to avoid network calls
281
+ if vision_config is not None:
282
+ image_processor = SiglipImageProcessor(
283
+ size={"height": vision_config.get('image_size', 384), "width": vision_config.get('image_size', 384)},
284
+ )
285
+ else:
286
+ image_processor = SiglipImageProcessor.from_pretrained(vision_tower_name)
287
+ else:
288
+ image_processor = None
289
+ except Exception:
290
+ image_processor = None
291
+
292
+ return cls(image_processor=image_processor, tokenizer=tokenizer)
293
+
294
+
295
+ # =============================================================================
296
+ # Convenience Functions
297
+ # =============================================================================
298
+
299
+ def prepare_inputs_for_generation(
300
+ prompt: str,
301
+ images: Optional[List[Image.Image]],
302
+ processor: Phi4VisionRProcessor,
303
+ device: str = "cuda",
304
+ dtype: torch.dtype = torch.bfloat16,
305
+ ) -> dict:
306
+ """
307
+ Prepare inputs for model generation.
308
+
309
+ Args:
310
+ prompt: The user prompt (without conversation formatting)
311
+ images: Optional list of PIL images
312
+ processor: The Phi4VisionRProcessor
313
+ device: Device to place tensors on
314
+ dtype: Data type for tensors
315
+
316
+ Returns:
317
+ Dictionary with model inputs
318
+ """
319
+ # Add image token to prompt if images provided
320
+ if images:
321
+ prompt = DEFAULT_IMAGE_TOKEN + "\n" + prompt
322
+
323
+ # Use tokenizer's chat_template
324
+ messages = [{"role": "user", "content": prompt}]
325
+ full_prompt = processor.tokenizer.apply_chat_template(
326
+ messages,
327
+ tokenize=False,
328
+ add_generation_prompt=True
329
+ )
330
+
331
+ inputs = processor(
332
+ text=full_prompt,
333
+ images=images,
334
+ return_tensors="pt",
335
+ )
336
+
337
+ # Move to device
338
+ for key in inputs:
339
+ if isinstance(inputs[key], torch.Tensor):
340
+ inputs[key] = inputs[key].to(device=device, dtype=dtype if inputs[key].is_floating_point() else inputs[key].dtype)
341
+
342
+ return inputs
sample_inference.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Sample inference script for Phi4-Siglip.
3
+
4
+ Usage:
5
+ cd phi4mm
6
+ python sample_inference.py
7
+ """
8
+ from PIL import Image
9
+ import torch
10
+ from transformers import AutoModelForCausalLM, AutoProcessor
11
+
12
+ model_path = "." # change to your model path if not running in the same directory as the model
13
+
14
+ # get first argument as an image path if not throw an error explaining how to use the script with an image
15
+ import sys
16
+ with_image_mode = False
17
+ if len(sys.argv) > 1:
18
+ with_image_mode = True
19
+ image_path = sys.argv[1]
20
+ print(f"Image path provided: {image_path}")
21
+ else:
22
+ print("No image path provided. Running in text-only mode. To run with an image, provide the image path as an argument:\npython sample_inference.py /path/to/image.jpg")
23
+
24
+ # Load model and processor
25
+ print("Loading model...")
26
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
27
+ model = AutoModelForCausalLM.from_pretrained(
28
+ model_path,
29
+ trust_remote_code=True,
30
+ dtype=torch.bfloat16,
31
+ device_map="cuda",
32
+ ).eval()
33
+
34
+ # Import helpers for image processing
35
+ from processing_phi4_visionr import DEFAULT_IMAGE_TOKEN
36
+
37
+ print(f"Model loaded on {model.device}")
38
+
39
+ #################################################### text-only ####################################################
40
+ print("\n" + "="*60)
41
+ print("TEST: Text-only generation")
42
+ print("="*60)
43
+
44
+ messages = [{"role": "user", "content": "What is the answer for 1+1? Explain it."}]
45
+ prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
46
+
47
+ print(f">>> Prompt\n{prompt}")
48
+ inputs = processor(prompt, images=None, return_tensors="pt").to("cuda:0")
49
+ generate_ids = model.generate(
50
+ **inputs,
51
+ max_new_tokens=4096,
52
+ eos_token_id=processor.tokenizer.eos_token_id,
53
+ do_sample=False,
54
+ )
55
+ generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
56
+ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
57
+ print(f'>>> Response\n{response}')
58
+
59
+ #################################################### single image ####################################################
60
+ if not with_image_mode:
61
+ print("\n" + "="*60)
62
+ print("No image provided, skipping multimodal test.")
63
+ print("="*60)
64
+ exit(0)
65
+
66
+ print("\n" + "="*60)
67
+ print("TEST: Single image understanding")
68
+ print("="*60)
69
+
70
+ messages = [{"role": "user", "content": DEFAULT_IMAGE_TOKEN + "\nDescribe this image in detail."}]
71
+ prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
72
+
73
+ if with_image_mode:
74
+ print(f">>> Loading image from {image_path}")
75
+ image = Image.open(image_path).convert("RGB")
76
+ print(f"Image size: {image.size}")
77
+ else:
78
+ image = None
79
+
80
+ print(f">>> Prompt\n{prompt}")
81
+
82
+ # Process text and image together using the processor
83
+ inputs = processor(text=prompt, images=[image] if image is not None else None, return_tensors="pt").to("cuda:0")
84
+
85
+ with torch.inference_mode():
86
+ generate_ids = model.generate(
87
+ **inputs,
88
+ max_new_tokens=4096,
89
+ eos_token_id=processor.tokenizer.eos_token_id,
90
+ do_sample=False,
91
+ )
92
+
93
+ generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
94
+ response = processor.tokenizer.decode(generate_ids[0], skip_special_tokens=True)
95
+ print(f'>>> Response\n{response}')
96
+
97
+ print("\n" + "="*60)
98
+ print("All tests completed!")
99
+ print("="*60)
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": true,
5
+ "normalized": false,
6
+ "rstrip": true,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|im_end|>",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": true,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|dummy_85|>",
18
+ "lstrip": true,
19
+ "normalized": false,
20
+ "rstrip": true,
21
+ "single_word": false
22
+ },
23
+ "unk_token": "<|endoftext|>"
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,782 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "100256": {
5
+ "content": "<|dummy_0|>",
6
+ "lstrip": true,
7
+ "normalized": false,
8
+ "rstrip": true,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "100257": {
13
+ "content": "<|endoftext|>",
14
+ "lstrip": true,
15
+ "normalized": false,
16
+ "rstrip": true,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "100258": {
21
+ "content": "<|fim_prefix|>",
22
+ "lstrip": true,
23
+ "normalized": false,
24
+ "rstrip": true,
25
+ "single_word": false,
26
+ "special": false
27
+ },
28
+ "100259": {
29
+ "content": "<|fim_middle|>",
30
+ "lstrip": true,
31
+ "normalized": false,
32
+ "rstrip": true,
33
+ "single_word": false,
34
+ "special": false
35
+ },
36
+ "100260": {
37
+ "content": "<|fim_suffix|>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": true,
41
+ "single_word": false,
42
+ "special": false
43
+ },
44
+ "100261": {
45
+ "content": "<|dummy_1|>",
46
+ "lstrip": true,
47
+ "normalized": false,
48
+ "rstrip": true,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "100262": {
53
+ "content": "<|dummy_2|>",
54
+ "lstrip": true,
55
+ "normalized": false,
56
+ "rstrip": true,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "100263": {
61
+ "content": "<|dummy_3|>",
62
+ "lstrip": true,
63
+ "normalized": false,
64
+ "rstrip": true,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "100264": {
69
+ "content": "<|im_start|>",
70
+ "lstrip": true,
71
+ "normalized": false,
72
+ "rstrip": true,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "100265": {
77
+ "content": "<|im_end|>",
78
+ "lstrip": true,
79
+ "normalized": false,
80
+ "rstrip": true,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "100266": {
85
+ "content": "<|im_sep|>",
86
+ "lstrip": true,
87
+ "normalized": false,
88
+ "rstrip": true,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "100267": {
93
+ "content": "<|dummy_4|>",
94
+ "lstrip": true,
95
+ "normalized": false,
96
+ "rstrip": true,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "100268": {
101
+ "content": "<|dummy_5|>",
102
+ "lstrip": true,
103
+ "normalized": false,
104
+ "rstrip": true,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "100269": {
109
+ "content": "<|dummy_6|>",
110
+ "lstrip": true,
111
+ "normalized": false,
112
+ "rstrip": true,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "100270": {
117
+ "content": "<|dummy_7|>",
118
+ "lstrip": true,
119
+ "normalized": false,
120
+ "rstrip": true,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "100271": {
125
+ "content": "<|dummy_8|>",
126
+ "lstrip": true,
127
+ "normalized": false,
128
+ "rstrip": true,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "100272": {
133
+ "content": "<|dummy_9|>",
134
+ "lstrip": true,
135
+ "normalized": false,
136
+ "rstrip": true,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "100273": {
141
+ "content": "<|dummy_10|>",
142
+ "lstrip": true,
143
+ "normalized": false,
144
+ "rstrip": true,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "100274": {
149
+ "content": "<|dummy_11|>",
150
+ "lstrip": true,
151
+ "normalized": false,
152
+ "rstrip": true,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "100275": {
157
+ "content": "<|dummy_12|>",
158
+ "lstrip": true,
159
+ "normalized": false,
160
+ "rstrip": true,
161
+ "single_word": false,
162
+ "special": true
163
+ },
164
+ "100276": {
165
+ "content": "<|endofprompt|>",
166
+ "lstrip": true,
167
+ "normalized": false,
168
+ "rstrip": true,
169
+ "single_word": false,
170
+ "special": true
171
+ },
172
+ "100277": {
173
+ "content": "<|dummy_13|>",
174
+ "lstrip": true,
175
+ "normalized": false,
176
+ "rstrip": true,
177
+ "single_word": false,
178
+ "special": true
179
+ },
180
+ "100278": {
181
+ "content": "<|dummy_14|>",
182
+ "lstrip": true,
183
+ "normalized": false,
184
+ "rstrip": true,
185
+ "single_word": false,
186
+ "special": true
187
+ },
188
+ "100279": {
189
+ "content": "<|dummy_15|>",
190
+ "lstrip": true,
191
+ "normalized": false,
192
+ "rstrip": true,
193
+ "single_word": false,
194
+ "special": true
195
+ },
196
+ "100280": {
197
+ "content": "<|dummy_16|>",
198
+ "lstrip": true,
199
+ "normalized": false,
200
+ "rstrip": true,
201
+ "single_word": false,
202
+ "special": true
203
+ },
204
+ "100281": {
205
+ "content": "<|dummy_17|>",
206
+ "lstrip": true,
207
+ "normalized": false,
208
+ "rstrip": true,
209
+ "single_word": false,
210
+ "special": true
211
+ },
212
+ "100282": {
213
+ "content": "<|dummy_18|>",
214
+ "lstrip": true,
215
+ "normalized": false,
216
+ "rstrip": true,
217
+ "single_word": false,
218
+ "special": true
219
+ },
220
+ "100283": {
221
+ "content": "<|dummy_19|>",
222
+ "lstrip": true,
223
+ "normalized": false,
224
+ "rstrip": true,
225
+ "single_word": false,
226
+ "special": true
227
+ },
228
+ "100284": {
229
+ "content": "<|dummy_20|>",
230
+ "lstrip": true,
231
+ "normalized": false,
232
+ "rstrip": true,
233
+ "single_word": false,
234
+ "special": true
235
+ },
236
+ "100285": {
237
+ "content": "<|dummy_21|>",
238
+ "lstrip": true,
239
+ "normalized": false,
240
+ "rstrip": true,
241
+ "single_word": false,
242
+ "special": true
243
+ },
244
+ "100286": {
245
+ "content": "<|dummy_22|>",
246
+ "lstrip": true,
247
+ "normalized": false,
248
+ "rstrip": true,
249
+ "single_word": false,
250
+ "special": true
251
+ },
252
+ "100287": {
253
+ "content": "<|dummy_23|>",
254
+ "lstrip": true,
255
+ "normalized": false,
256
+ "rstrip": true,
257
+ "single_word": false,
258
+ "special": true
259
+ },
260
+ "100288": {
261
+ "content": "<|dummy_24|>",
262
+ "lstrip": true,
263
+ "normalized": false,
264
+ "rstrip": true,
265
+ "single_word": false,
266
+ "special": true
267
+ },
268
+ "100289": {
269
+ "content": "<|dummy_25|>",
270
+ "lstrip": true,
271
+ "normalized": false,
272
+ "rstrip": true,
273
+ "single_word": false,
274
+ "special": true
275
+ },
276
+ "100290": {
277
+ "content": "<|dummy_26|>",
278
+ "lstrip": true,
279
+ "normalized": false,
280
+ "rstrip": true,
281
+ "single_word": false,
282
+ "special": true
283
+ },
284
+ "100291": {
285
+ "content": "<|dummy_27|>",
286
+ "lstrip": true,
287
+ "normalized": false,
288
+ "rstrip": true,
289
+ "single_word": false,
290
+ "special": true
291
+ },
292
+ "100292": {
293
+ "content": "<|dummy_28|>",
294
+ "lstrip": true,
295
+ "normalized": false,
296
+ "rstrip": true,
297
+ "single_word": false,
298
+ "special": true
299
+ },
300
+ "100293": {
301
+ "content": "<|dummy_29|>",
302
+ "lstrip": true,
303
+ "normalized": false,
304
+ "rstrip": true,
305
+ "single_word": false,
306
+ "special": true
307
+ },
308
+ "100294": {
309
+ "content": "<|dummy_30|>",
310
+ "lstrip": true,
311
+ "normalized": false,
312
+ "rstrip": true,
313
+ "single_word": false,
314
+ "special": true
315
+ },
316
+ "100295": {
317
+ "content": "<|dummy_31|>",
318
+ "lstrip": true,
319
+ "normalized": false,
320
+ "rstrip": true,
321
+ "single_word": false,
322
+ "special": true
323
+ },
324
+ "100296": {
325
+ "content": "<|dummy_32|>",
326
+ "lstrip": true,
327
+ "normalized": false,
328
+ "rstrip": true,
329
+ "single_word": false,
330
+ "special": true
331
+ },
332
+ "100297": {
333
+ "content": "<|dummy_33|>",
334
+ "lstrip": true,
335
+ "normalized": false,
336
+ "rstrip": true,
337
+ "single_word": false,
338
+ "special": true
339
+ },
340
+ "100298": {
341
+ "content": "<|dummy_34|>",
342
+ "lstrip": true,
343
+ "normalized": false,
344
+ "rstrip": true,
345
+ "single_word": false,
346
+ "special": true
347
+ },
348
+ "100299": {
349
+ "content": "<|dummy_35|>",
350
+ "lstrip": true,
351
+ "normalized": false,
352
+ "rstrip": true,
353
+ "single_word": false,
354
+ "special": true
355
+ },
356
+ "100300": {
357
+ "content": "<|dummy_36|>",
358
+ "lstrip": true,
359
+ "normalized": false,
360
+ "rstrip": true,
361
+ "single_word": false,
362
+ "special": true
363
+ },
364
+ "100301": {
365
+ "content": "<|dummy_37|>",
366
+ "lstrip": true,
367
+ "normalized": false,
368
+ "rstrip": true,
369
+ "single_word": false,
370
+ "special": true
371
+ },
372
+ "100302": {
373
+ "content": "<|dummy_38|>",
374
+ "lstrip": true,
375
+ "normalized": false,
376
+ "rstrip": true,
377
+ "single_word": false,
378
+ "special": true
379
+ },
380
+ "100303": {
381
+ "content": "<|dummy_39|>",
382
+ "lstrip": true,
383
+ "normalized": false,
384
+ "rstrip": true,
385
+ "single_word": false,
386
+ "special": true
387
+ },
388
+ "100304": {
389
+ "content": "<|dummy_40|>",
390
+ "lstrip": true,
391
+ "normalized": false,
392
+ "rstrip": true,
393
+ "single_word": false,
394
+ "special": true
395
+ },
396
+ "100305": {
397
+ "content": "<|dummy_41|>",
398
+ "lstrip": true,
399
+ "normalized": false,
400
+ "rstrip": true,
401
+ "single_word": false,
402
+ "special": true
403
+ },
404
+ "100306": {
405
+ "content": "<|dummy_42|>",
406
+ "lstrip": true,
407
+ "normalized": false,
408
+ "rstrip": true,
409
+ "single_word": false,
410
+ "special": true
411
+ },
412
+ "100307": {
413
+ "content": "<|dummy_43|>",
414
+ "lstrip": true,
415
+ "normalized": false,
416
+ "rstrip": true,
417
+ "single_word": false,
418
+ "special": true
419
+ },
420
+ "100308": {
421
+ "content": "<|dummy_44|>",
422
+ "lstrip": true,
423
+ "normalized": false,
424
+ "rstrip": true,
425
+ "single_word": false,
426
+ "special": true
427
+ },
428
+ "100309": {
429
+ "content": "<|dummy_45|>",
430
+ "lstrip": true,
431
+ "normalized": false,
432
+ "rstrip": true,
433
+ "single_word": false,
434
+ "special": true
435
+ },
436
+ "100310": {
437
+ "content": "<|dummy_46|>",
438
+ "lstrip": true,
439
+ "normalized": false,
440
+ "rstrip": true,
441
+ "single_word": false,
442
+ "special": true
443
+ },
444
+ "100311": {
445
+ "content": "<|dummy_47|>",
446
+ "lstrip": true,
447
+ "normalized": false,
448
+ "rstrip": true,
449
+ "single_word": false,
450
+ "special": true
451
+ },
452
+ "100312": {
453
+ "content": "<|dummy_48|>",
454
+ "lstrip": true,
455
+ "normalized": false,
456
+ "rstrip": true,
457
+ "single_word": false,
458
+ "special": true
459
+ },
460
+ "100313": {
461
+ "content": "<|dummy_49|>",
462
+ "lstrip": true,
463
+ "normalized": false,
464
+ "rstrip": true,
465
+ "single_word": false,
466
+ "special": true
467
+ },
468
+ "100314": {
469
+ "content": "<|dummy_50|>",
470
+ "lstrip": true,
471
+ "normalized": false,
472
+ "rstrip": true,
473
+ "single_word": false,
474
+ "special": true
475
+ },
476
+ "100315": {
477
+ "content": "<|dummy_51|>",
478
+ "lstrip": true,
479
+ "normalized": false,
480
+ "rstrip": true,
481
+ "single_word": false,
482
+ "special": true
483
+ },
484
+ "100316": {
485
+ "content": "<|dummy_52|>",
486
+ "lstrip": true,
487
+ "normalized": false,
488
+ "rstrip": true,
489
+ "single_word": false,
490
+ "special": true
491
+ },
492
+ "100317": {
493
+ "content": "<|dummy_53|>",
494
+ "lstrip": true,
495
+ "normalized": false,
496
+ "rstrip": true,
497
+ "single_word": false,
498
+ "special": true
499
+ },
500
+ "100318": {
501
+ "content": "<|dummy_54|>",
502
+ "lstrip": true,
503
+ "normalized": false,
504
+ "rstrip": true,
505
+ "single_word": false,
506
+ "special": true
507
+ },
508
+ "100319": {
509
+ "content": "<|dummy_55|>",
510
+ "lstrip": true,
511
+ "normalized": false,
512
+ "rstrip": true,
513
+ "single_word": false,
514
+ "special": true
515
+ },
516
+ "100320": {
517
+ "content": "<|dummy_56|>",
518
+ "lstrip": true,
519
+ "normalized": false,
520
+ "rstrip": true,
521
+ "single_word": false,
522
+ "special": true
523
+ },
524
+ "100321": {
525
+ "content": "<|dummy_57|>",
526
+ "lstrip": true,
527
+ "normalized": false,
528
+ "rstrip": true,
529
+ "single_word": false,
530
+ "special": true
531
+ },
532
+ "100322": {
533
+ "content": "<|dummy_58|>",
534
+ "lstrip": true,
535
+ "normalized": false,
536
+ "rstrip": true,
537
+ "single_word": false,
538
+ "special": true
539
+ },
540
+ "100323": {
541
+ "content": "<|dummy_59|>",
542
+ "lstrip": true,
543
+ "normalized": false,
544
+ "rstrip": true,
545
+ "single_word": false,
546
+ "special": true
547
+ },
548
+ "100324": {
549
+ "content": "<|dummy_60|>",
550
+ "lstrip": true,
551
+ "normalized": false,
552
+ "rstrip": true,
553
+ "single_word": false,
554
+ "special": true
555
+ },
556
+ "100325": {
557
+ "content": "<|dummy_61|>",
558
+ "lstrip": true,
559
+ "normalized": false,
560
+ "rstrip": true,
561
+ "single_word": false,
562
+ "special": true
563
+ },
564
+ "100326": {
565
+ "content": "<|dummy_62|>",
566
+ "lstrip": true,
567
+ "normalized": false,
568
+ "rstrip": true,
569
+ "single_word": false,
570
+ "special": true
571
+ },
572
+ "100327": {
573
+ "content": "<|dummy_63|>",
574
+ "lstrip": true,
575
+ "normalized": false,
576
+ "rstrip": true,
577
+ "single_word": false,
578
+ "special": true
579
+ },
580
+ "100328": {
581
+ "content": "<|dummy_64|>",
582
+ "lstrip": true,
583
+ "normalized": false,
584
+ "rstrip": true,
585
+ "single_word": false,
586
+ "special": true
587
+ },
588
+ "100329": {
589
+ "content": "<|dummy_65|>",
590
+ "lstrip": true,
591
+ "normalized": false,
592
+ "rstrip": true,
593
+ "single_word": false,
594
+ "special": true
595
+ },
596
+ "100330": {
597
+ "content": "<|dummy_66|>",
598
+ "lstrip": true,
599
+ "normalized": false,
600
+ "rstrip": true,
601
+ "single_word": false,
602
+ "special": true
603
+ },
604
+ "100331": {
605
+ "content": "<|dummy_67|>",
606
+ "lstrip": true,
607
+ "normalized": false,
608
+ "rstrip": true,
609
+ "single_word": false,
610
+ "special": true
611
+ },
612
+ "100332": {
613
+ "content": "<|dummy_68|>",
614
+ "lstrip": true,
615
+ "normalized": false,
616
+ "rstrip": true,
617
+ "single_word": false,
618
+ "special": true
619
+ },
620
+ "100333": {
621
+ "content": "<|dummy_69|>",
622
+ "lstrip": true,
623
+ "normalized": false,
624
+ "rstrip": true,
625
+ "single_word": false,
626
+ "special": true
627
+ },
628
+ "100334": {
629
+ "content": "<|dummy_70|>",
630
+ "lstrip": true,
631
+ "normalized": false,
632
+ "rstrip": true,
633
+ "single_word": false,
634
+ "special": true
635
+ },
636
+ "100335": {
637
+ "content": "<|dummy_71|>",
638
+ "lstrip": true,
639
+ "normalized": false,
640
+ "rstrip": true,
641
+ "single_word": false,
642
+ "special": true
643
+ },
644
+ "100336": {
645
+ "content": "<|dummy_72|>",
646
+ "lstrip": true,
647
+ "normalized": false,
648
+ "rstrip": true,
649
+ "single_word": false,
650
+ "special": true
651
+ },
652
+ "100337": {
653
+ "content": "<|dummy_73|>",
654
+ "lstrip": true,
655
+ "normalized": false,
656
+ "rstrip": true,
657
+ "single_word": false,
658
+ "special": true
659
+ },
660
+ "100338": {
661
+ "content": "<|dummy_74|>",
662
+ "lstrip": true,
663
+ "normalized": false,
664
+ "rstrip": true,
665
+ "single_word": false,
666
+ "special": true
667
+ },
668
+ "100339": {
669
+ "content": "<|dummy_75|>",
670
+ "lstrip": true,
671
+ "normalized": false,
672
+ "rstrip": true,
673
+ "single_word": false,
674
+ "special": true
675
+ },
676
+ "100340": {
677
+ "content": "<|dummy_76|>",
678
+ "lstrip": true,
679
+ "normalized": false,
680
+ "rstrip": true,
681
+ "single_word": false,
682
+ "special": true
683
+ },
684
+ "100341": {
685
+ "content": "<|dummy_77|>",
686
+ "lstrip": true,
687
+ "normalized": false,
688
+ "rstrip": true,
689
+ "single_word": false,
690
+ "special": true
691
+ },
692
+ "100342": {
693
+ "content": "<|dummy_78|>",
694
+ "lstrip": true,
695
+ "normalized": false,
696
+ "rstrip": true,
697
+ "single_word": false,
698
+ "special": true
699
+ },
700
+ "100343": {
701
+ "content": "<|dummy_79|>",
702
+ "lstrip": true,
703
+ "normalized": false,
704
+ "rstrip": true,
705
+ "single_word": false,
706
+ "special": true
707
+ },
708
+ "100344": {
709
+ "content": "<|dummy_80|>",
710
+ "lstrip": true,
711
+ "normalized": false,
712
+ "rstrip": true,
713
+ "single_word": false,
714
+ "special": true
715
+ },
716
+ "100345": {
717
+ "content": "<|dummy_81|>",
718
+ "lstrip": true,
719
+ "normalized": false,
720
+ "rstrip": true,
721
+ "single_word": false,
722
+ "special": true
723
+ },
724
+ "100346": {
725
+ "content": "<|dummy_82|>",
726
+ "lstrip": true,
727
+ "normalized": false,
728
+ "rstrip": true,
729
+ "single_word": false,
730
+ "special": true
731
+ },
732
+ "100347": {
733
+ "content": "<|dummy_83|>",
734
+ "lstrip": true,
735
+ "normalized": false,
736
+ "rstrip": true,
737
+ "single_word": false,
738
+ "special": true
739
+ },
740
+ "100348": {
741
+ "content": "<nothink>",
742
+ "lstrip": true,
743
+ "normalized": false,
744
+ "rstrip": true,
745
+ "single_word": false,
746
+ "special": true
747
+ },
748
+ "100349": {
749
+ "content": "<|dummy_85|>",
750
+ "lstrip": true,
751
+ "normalized": false,
752
+ "rstrip": true,
753
+ "single_word": false,
754
+ "special": true
755
+ },
756
+ "100350": {
757
+ "content": "<think>",
758
+ "lstrip": true,
759
+ "normalized": false,
760
+ "rstrip": true,
761
+ "single_word": false,
762
+ "special": false
763
+ },
764
+ "100351": {
765
+ "content": "</think>",
766
+ "lstrip": true,
767
+ "normalized": false,
768
+ "rstrip": true,
769
+ "single_word": false,
770
+ "special": false
771
+ }
772
+ },
773
+ "bos_token": "<|endoftext|>",
774
+ "clean_up_tokenization_spaces": false,
775
+ "eos_token": "<|im_end|>",
776
+ "extra_special_tokens": {},
777
+ "model_max_length": 16384,
778
+ "pad_token": "<|dummy_85|>",
779
+ "padding_side": "right",
780
+ "tokenizer_class": "GPT2Tokenizer",
781
+ "unk_token": "<|endoftext|>"
782
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff