akylbekmaxutov commited on
Commit
93eee3d
·
verified ·
1 Parent(s): f1e4dd0

Initial model upload

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/eval-results-text.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/eval-results-vision.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - kk
4
+ - ru
5
+ - en
6
+ base_model:
7
+ - OpenGVLab/InternVL3_5-4B
8
+ pipeline_tag: image-text-to-text
9
+ ---
10
+ [Қазақша](#кіріспе)     [English](#introduction)
11
+
12
+ # Qolda
13
+ [![GitHub](https://img.shields.io/badge/GitHub-Qolda--deployment-blue?logo=github)](https://github.com/IS2AI/Qolda-deployment)
14
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
15
+
16
+ ## Introduction
17
+ Built on top of InternVL3.5 and Qwen3, **Qolda** is a small vision-language model designed to operate in Kazakh, Russian, and English. The model has 4.3B parameters and comprises the InternViT-300M vision encoder and MLP Projector components from [InternVL3.5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B), along with the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) language model. Model training was performed using the [InternVL framework](https://github.com/OpenGVLab/InternVL) 💙
18
+
19
+ The name "Qolda" reflects both its design and purpose in Kazakh: "in hand" (қолда) for its compact accessibility, and "to support" (қолдау) for its assistive nature.
20
+
21
+ ## Evaluation Results
22
+ Evaluation was conducted separately for text-only and vision-language modalities. Qolda demonstrates significant performance improvements for Kazakh while maintaining comparable performance on Russian and English.
23
+
24
+ ### Text Benchmarks
25
+ ![Model performance comparison on language benchmarks](assets/eval-results-text.png)
26
+ *Performance comparison on language tasks including MMLU, Winogrande, HellaSwag, ARC, GSM8K, and DROP.*
27
+
28
+ **Note:** The comparison below presents Qolda's performance against Qwen3-4B on **Kazakh** language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.
29
+
30
+ | Model | Mode | Avg | MMLU | Winogrande | HellaSwag | ARC | GSM8K | DROP |
31
+ |-------|------|-----|------|------------|-----------|-----|-------|------|
32
+ | Qwen3-4B | Direct | 52.00 | 42.43 | 56.88 | 42.04 | 64.77 | 73.62 | 32.27 |
33
+ | Qwen3-4B | Think | 57.73 | 52.98 | 51.27 | 41.86 | 79.65 | 64.82 | 55.81 |
34
+ | Qolda | Direct | 58.77 | 46.55 | 56.37 | 55.75 | 73.62 | 63.50 | 56.84 |
35
+ | Qolda | Think | **71.64** | **64.56** | **70.54** | **57.70** | **89.99** | **79.47** | **67.59** |
36
+
37
+ ### Vision Benchmarks
38
+ ![Model performance comparison on vision-language benchmarks](assets/eval-results-vision.png)
39
+ *Performance comparison on vision-language tasks including AI2D, MMStar, RealWorldQA, and KazakhOCR.*
40
+
41
+ **Note:** The comparison below presents Qolda's performance against InternVL3.5-4B on **Kazakh** vision-language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.
42
+
43
+ | Model | Mode | Avg | AI2D | MMStar | RealWorldQA | KazakhOCR |
44
+ |-------|------|--------|--------|----------|---------------|-------------|
45
+ | InternVL3.5-4B | Direct | 42.23 | 52.33 | 47.47 | 38.32 | 30.81 |
46
+ | InternVL3.5-4B | Think | 42.58 | 51.42 | 49.33 | 38.74 | 30.81 |
47
+ | Qolda | Direct | 59.39 | 66.06 | 55.47 | 54.97 | **61.06** |
48
+ | Qolda | Think | **60.44** | **67.62** | **56.53** | **57.07** | 60.54 |
49
+
50
+ ## Model Usage
51
+ To run inference with Transformers, please follow the [guidelines](https://huggingface.co/OpenGVLab/InternVL3_5-4B#inference-with-transformers) from InternVL.
52
+
53
+ Alternatively, to run the model via an OpenAI-compatible server, you can use lmdeploy:
54
+ ```bash
55
+ pip install lmdeploy>=0.9.1
56
+
57
+ lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch
58
+ ```
59
+
60
+ **Note:** Unlike the original InternVL3.5, this model requires the `enable_thinking` parameter to be explicitly set in the `extra_body` of your API calls. However, depending on the task complexity, an empty thinking response might be generated.
61
+
62
+ Then, make a standard API call:
63
+
64
+ ```python
65
+ import base64
66
+ from openai import OpenAI
67
+
68
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
69
+
70
+ def encode_image(image_path):
71
+ with open(image_path, "rb") as image_file:
72
+ return base64.b64encode(image_file.read()).decode('utf-8')
73
+
74
+ image_path = "./assets/eval-results-text.png"
75
+
76
+ response = client.chat.completions.create(
77
+ model=client.models.list().data[0].id,
78
+ messages=[{
79
+ 'role': 'user',
80
+ 'content': [
81
+ {
82
+ 'type': 'text',
83
+ 'text': 'Берілген диаграмманың сипаттамасын бер.'
84
+ },
85
+ {
86
+ 'type': 'image_url',
87
+ 'image_url': {
88
+ 'url': f'data:image/png;base64,{encode_image(image_path)}',
89
+ },
90
+ }
91
+ ],
92
+ }],
93
+ max_tokens=8192,
94
+ temperature=0.6,
95
+ top_p=0.95,
96
+ extra_body={
97
+ "top_k": 20,
98
+ "enable_thinking": True
99
+ },
100
+ )
101
+
102
+ print(response.choices[0].message.content)
103
+ ```
104
+
105
+ ## License
106
+ This model is licensed under the Apache License 2.0.
107
+
108
+
109
+ ## Кіріспе
110
+ InternVL3.5 және Qwen3 негізінде жаса��ған **Qolda** — қазақ, орыс және ағылшын тілдерінде жұмыс істеуге арналған шағын көру-тілдік моделі (vision-language model). Модель 4,3 млрд параметрге ие және [InternVL3.5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B) моделінің InternViT-300M көру энкодері мен MLP проектор компоненттерін, сондай-ақ [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) тілдік моделін қамтиды. Модельді оқыту [InternVL фреймворкі](https://github.com/OpenGVLab/InternVL) көмегімен жүзеге асырылды 💙
111
+
112
+ "Qolda" атауы модельдің дизайны мен мақсатын қазақ тіліндегі қолда сөзінің қос мағынасы арқылы көрсетеді. Біріншісі, шағын әрі қолжетімді болуы үшін "қолда" cөзі арқылы және екіншісі, көмекші табиғаты үшін, "қолдау" мағынасы арқылы.
113
+
114
+ ## Бағалау нәтижелері
115
+ Мәтіндік және көру-тілдік модальділіктер үшін бағалау бөлек жүргізілді. Qolda орыс және ағылшын тілдеріндегі өзінің бастапқы деңгейін сақтай отырып, қазақ тіліндегі өнімділігін айтарлықтай жақсартты.
116
+
117
+ ### Мәтіндік бенчмарктар
118
+ ![Тілдік бенчмарктардағы модель өнімділігін салыстыру](assets/eval-results-text.png)
119
+ *MMLU, Winogrande, HellaSwag, ARC, GSM8K және DROP сияқты тілдік тапсырмалардағы өнімділікті салыстыру.*
120
+
121
+ **Ескерту:** Төмендегі кестедегі Qolda және Qwen3-4B модельдерінің салыстырылуы тек **қазақ** тіліндегі бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.
122
+
123
+ | Model | Mode | Avg | MMLU | Winogrande | HellaSwag | ARC | GSM8K | DROP |
124
+ |-------|------|-----|------|------------|-----------|-----|-------|------|
125
+ | Qwen3-4B | Direct | 52.00 | 42.43 | 56.88 | 42.04 | 64.77 | 73.62 | 32.27 |
126
+ | Qwen3-4B | Think | 57.73 | 52.98 | 51.27 | 41.86 | 79.65 | 64.82 | 55.81 |
127
+ | Qolda | Direct | 58.77 | 46.55 | 56.37 | 55.75 | 73.62 | 63.50 | 56.84 |
128
+ | Qolda | Think | **71.64** | **64.56** | **70.54** | **57.70** | **89.99** | **79.47** | **67.59** |
129
+
130
+ ### Көру бенчмарктары
131
+ ![Көру-тілдік бенчмарктарындағы модель өнімділігін салыстыру](assets/eval-results-vision.png)
132
+ *AI2D, MMStar, RealWorldQA және KazakhOCR сияқты көру-тілдік тапсырмаларындағы өнімділікті салыстыру.*
133
+
134
+ **Ескерту:** Төмендегі кестедегі Qolda және InternVL3.5-4B модельдерінің салыстырылуы тек **қазақ** тіліндегі көру-тілдік бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.
135
+
136
+ | Model | Mode | Avg | AI2D | MMStar | RealWorldQA | KazakhOCR |
137
+ |-------|------|--------|--------|----------|---------------|-------------|
138
+ | InternVL3.5-4B | Direct | 42.23 | 52.33 | 47.47 | 38.32 | 30.81 |
139
+ | InternVL3.5-4B | Think | 42.58 | 51.42 | 49.33 | 38.74 | 30.81 |
140
+ | Qolda | Direct | 59.39 | 66.06 | 55.47 | 54.97 | **61.06** |
141
+ | Qolda | Think | **60.44** | **67.62** | **56.53** | **57.07** | 60.54 |
142
+
143
+ ## Модельді қолдану
144
+ Transformers арқылы инференсті іске қосу үшін InternVL ұсынған [нұсқаулықтарды](https://huggingface.co/OpenGVLab/InternVL3_5-4B#inference-with-transformers) орындаңыз.
145
+
146
+ Немесе, модельді OpenAI-үйлесімді сервер арқылы іске қосу үшін lmdeploy құралын пайдалануға болады:
147
+ ```bash
148
+ pip install lmdeploy>=0.9.1
149
+
150
+ lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch
151
+ ```
152
+
153
+ **Ескерту:** Qolda-ның түпнұсқалық InternVL3.5-тен айырмашылығы, бұл модель API call жасаған кезде `extra_body` бөлігінде `enable_thinking` параметрінің нақты орнатылуын талап етеді. Тапсырманың күрделілігіне байланысты бос thinking жауабы қайтарылуы мүмкін.
154
+
155
+ Содан соң, стандартты API call жасаңыз:
156
+
157
+ ```python
158
+ import base64
159
+ from openai import OpenAI
160
+
161
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
162
+
163
+ def encode_image(image_path):
164
+ with open(image_path, "rb") as image_file:
165
+ return base64.b64encode(image_file.read()).decode('utf-8')
166
+
167
+ image_path = "./assets/eval-results-text.png"
168
+
169
+ response = client.chat.completions.create(
170
+ model=client.models.list().data[0].id,
171
+ messages=[{
172
+ 'role': 'user',
173
+ 'content': [
174
+ {
175
+ 'type': 'text',
176
+ 'text': 'Берілген диаграмманың сипаттамасын бер.'
177
+ },
178
+ {
179
+ 'type': 'image_url',
180
+ 'image_url': {
181
+ 'url': f'data:image/png;base64,{encode_image(image_path)}',
182
+ },
183
+ }
184
+ ],
185
+ }],
186
+ max_tokens=8192,
187
+ temperature=0.6,
188
+ top_p=0.95,
189
+ extra_body={
190
+ "top_k": 20,
191
+ "enable_thinking": True
192
+ },
193
+ )
194
+
195
+ print(response.choices[0].message.content)
196
+ ```
197
+
198
+ ## Лицензия
199
+ Бұл модель Apache License 2.0 бойынша лицензияланған.
added_tokens.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</box>": 151677,
3
+ "</img>": 151670,
4
+ "</quad>": 151673,
5
+ "</ref>": 151675,
6
+ "</think>": 151668,
7
+ "</tool_call>": 151658,
8
+ "</tool_response>": 151666,
9
+ "<IMG_CONTEXT>": 151671,
10
+ "<box>": 151676,
11
+ "<img>": 151669,
12
+ "<quad>": 151672,
13
+ "<ref>": 151674,
14
+ "<think>": 151667,
15
+ "<tool_call>": 151657,
16
+ "<tool_response>": 151665,
17
+ "<|box_end|>": 151649,
18
+ "<|box_start|>": 151648,
19
+ "<|endoftext|>": 151643,
20
+ "<|file_sep|>": 151664,
21
+ "<|fim_middle|>": 151660,
22
+ "<|fim_pad|>": 151662,
23
+ "<|fim_prefix|>": 151659,
24
+ "<|fim_suffix|>": 151661,
25
+ "<|im_end|>": 151645,
26
+ "<|im_start|>": 151644,
27
+ "<|image_pad|>": 151655,
28
+ "<|object_ref_end|>": 151647,
29
+ "<|object_ref_start|>": 151646,
30
+ "<|quad_end|>": 151651,
31
+ "<|quad_start|>": 151650,
32
+ "<|repo_name|>": 151663,
33
+ "<|video_pad|>": 151656,
34
+ "<|vision_end|>": 151653,
35
+ "<|vision_pad|>": 151654,
36
+ "<|vision_start|>": 151652
37
+ }
assets/eval-results-text.png ADDED

Git LFS Details

  • SHA256: 977a09aae1499267ef35085597644a4ef1586b45ab0efb662014a22f298cc961
  • Pointer size: 131 Bytes
  • Size of remote file: 265 kB
assets/eval-results-vision.png ADDED

Git LFS Details

  • SHA256: a52904c1fb42d0ea8d430c0f46be0215eef0ebe81463198df602f0258ccfa7e1
  • Pointer size: 131 Bytes
  • Size of remote file: 232 kB
chat_template.jinja ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
18
+ {%- for message in messages[::-1] %}
19
+ {%- set index = (messages|length - 1) - loop.index0 %}
20
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
21
+ {%- set ns.multi_step_tool = false %}
22
+ {%- set ns.last_query_index = index %}
23
+ {%- endif %}
24
+ {%- endfor %}
25
+ {%- for message in messages %}
26
+ {%- if message.content is string %}
27
+ {%- set content = message.content %}
28
+ {%- else %}
29
+ {%- set content = '' %}
30
+ {%- endif %}
31
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
32
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
33
+ {%- elif message.role == "assistant" %}
34
+ {%- set reasoning_content = '' %}
35
+ {%- if message.reasoning_content is string %}
36
+ {%- set reasoning_content = message.reasoning_content %}
37
+ {%- else %}
38
+ {%- if '</think>' in content %}
39
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
40
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
41
+ {%- endif %}
42
+ {%- endif %}
43
+ {%- if loop.index0 > ns.last_query_index %}
44
+ {%- if loop.last or (not loop.last and reasoning_content) %}
45
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
46
+ {%- else %}
47
+ {{- '<|im_start|>' + message.role + '\n' + content }}
48
+ {%- endif %}
49
+ {%- else %}
50
+ {{- '<|im_start|>' + message.role + '\n' + content }}
51
+ {%- endif %}
52
+ {%- if message.tool_calls %}
53
+ {%- for tool_call in message.tool_calls %}
54
+ {%- if (loop.first and content) or (not loop.first) %}
55
+ {{- '\n' }}
56
+ {%- endif %}
57
+ {%- if tool_call.function %}
58
+ {%- set tool_call = tool_call.function %}
59
+ {%- endif %}
60
+ {{- '<tool_call>\n{"name": "' }}
61
+ {{- tool_call.name }}
62
+ {{- '", "arguments": ' }}
63
+ {%- if tool_call.arguments is string %}
64
+ {{- tool_call.arguments }}
65
+ {%- else %}
66
+ {{- tool_call.arguments | tojson }}
67
+ {%- endif %}
68
+ {{- '}\n</tool_call>' }}
69
+ {%- endfor %}
70
+ {%- endif %}
71
+ {{- '<|im_end|>\n' }}
72
+ {%- elif message.role == "tool" %}
73
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
74
+ {{- '<|im_start|>user' }}
75
+ {%- endif %}
76
+ {{- '\n<tool_response>\n' }}
77
+ {{- content }}
78
+ {{- '\n</tool_response>' }}
79
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
80
+ {{- '<|im_end|>\n' }}
81
+ {%- endif %}
82
+ {%- endif %}
83
+ {%- endfor %}
84
+ {%- if add_generation_prompt %}
85
+ {{- '<|im_start|>assistant\n' }}
86
+ {%- if enable_thinking is defined and enable_thinking is false %}
87
+ {{- '<think>\n\n</think>\n\n' }}
88
+ {%- endif %}
89
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "InternVLChatModel"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
7
+ "AutoModel": "modeling_internvl_chat.InternVLChatModel",
8
+ "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
9
+ },
10
+ "downsample_ratio": 0.5,
11
+ "dynamic_image_size": true,
12
+ "eos_token_id": 151645,
13
+ "force_image_size": 448,
14
+ "hidden_size": 2560,
15
+ "llm_config": {
16
+ "_attn_implementation_autoset": true,
17
+ "architectures": [
18
+ "Qwen3ForCausalLM"
19
+ ],
20
+ "attention_bias": false,
21
+ "attention_dropout": 0.0,
22
+ "eos_token_id": 151645,
23
+ "head_dim": 128,
24
+ "hidden_act": "silu",
25
+ "hidden_size": 2560,
26
+ "initializer_range": 0.02,
27
+ "intermediate_size": 9728,
28
+ "layer_types": [
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention",
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention",
50
+ "full_attention",
51
+ "full_attention",
52
+ "full_attention",
53
+ "full_attention",
54
+ "full_attention",
55
+ "full_attention",
56
+ "full_attention",
57
+ "full_attention",
58
+ "full_attention",
59
+ "full_attention",
60
+ "full_attention",
61
+ "full_attention",
62
+ "full_attention",
63
+ "full_attention",
64
+ "full_attention"
65
+ ],
66
+ "max_position_embeddings": 40960,
67
+ "max_window_layers": 36,
68
+ "model_type": "qwen3",
69
+ "num_attention_heads": 32,
70
+ "num_hidden_layers": 36,
71
+ "num_key_value_heads": 8,
72
+ "rms_norm_eps": 1e-06,
73
+ "rope_scaling": null,
74
+ "rope_theta": 1000000,
75
+ "sliding_window": null,
76
+ "tie_word_embeddings": true,
77
+ "torch_dtype": "bfloat16",
78
+ "use_cache": false,
79
+ "use_sliding_window": false,
80
+ "vocab_size": 151936
81
+ },
82
+ "max_dynamic_patch": 12,
83
+ "min_dynamic_patch": 1,
84
+ "model_type": "internvl_chat",
85
+ "output_attentions": false,
86
+ "pad2square": false,
87
+ "pad_token_id": 151643,
88
+ "ps_version": "v2",
89
+ "select_layer": -1,
90
+ "template": "internvl2_5",
91
+ "tie_word_embeddings": true,
92
+ "torch_dtype": "bfloat16",
93
+ "transformers_version": null,
94
+ "use_backbone_lora": 0,
95
+ "use_llm_lora": 0,
96
+ "use_thumbnail": true,
97
+ "vision_config": {
98
+ "_attn_implementation_autoset": true,
99
+ "architectures": [
100
+ "InternVisionModel"
101
+ ],
102
+ "attention_dropout": 0.0,
103
+ "auto_map": {
104
+ "AutoConfig": "configuration_intern_vit.InternVisionConfig",
105
+ "AutoModel": "modeling_intern_vit.InternVisionModel"
106
+ },
107
+ "drop_path_rate": 0.0,
108
+ "dropout": 0.0,
109
+ "hidden_act": "gelu",
110
+ "hidden_size": 1024,
111
+ "image_size": 448,
112
+ "initializer_factor": 1.0,
113
+ "initializer_range": 0.02,
114
+ "intermediate_size": 4096,
115
+ "layer_norm_eps": 1e-06,
116
+ "model_type": "intern_vit_6b",
117
+ "norm_type": "layer_norm",
118
+ "num_attention_heads": 16,
119
+ "num_channels": 3,
120
+ "num_hidden_layers": 24,
121
+ "patch_size": 14,
122
+ "qk_normalization": false,
123
+ "qkv_bias": true,
124
+ "torch_dtype": "bfloat16",
125
+ "use_fa3": false,
126
+ "use_flash_attn": true
127
+ }
128
+ }
configuration_intern_vit.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+ import os
7
+ from typing import Union
8
+
9
+ from transformers.configuration_utils import PretrainedConfig
10
+ from transformers.utils import logging
11
+
12
+ logger = logging.get_logger(__name__)
13
+
14
+
15
+ class InternVisionConfig(PretrainedConfig):
16
+ r"""
17
+ This is the configuration class to store the configuration of a [`InternVisionModel`]. It is used to
18
+ instantiate a vision encoder according to the specified arguments, defining the model architecture.
19
+
20
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
21
+ documentation from [`PretrainedConfig`] for more information.
22
+
23
+ Args:
24
+ num_channels (`int`, *optional*, defaults to 3):
25
+ Number of color channels in the input images (e.g., 3 for RGB).
26
+ patch_size (`int`, *optional*, defaults to 14):
27
+ The size (resolution) of each patch.
28
+ image_size (`int`, *optional*, defaults to 224):
29
+ The size (resolution) of each image.
30
+ qkv_bias (`bool`, *optional*, defaults to `False`):
31
+ Whether to add a bias to the queries and values in the self-attention layers.
32
+ hidden_size (`int`, *optional*, defaults to 3200):
33
+ Dimensionality of the encoder layers and the pooler layer.
34
+ num_attention_heads (`int`, *optional*, defaults to 25):
35
+ Number of attention heads for each attention layer in the Transformer encoder.
36
+ intermediate_size (`int`, *optional*, defaults to 12800):
37
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
38
+ qk_normalization (`bool`, *optional*, defaults to `True`):
39
+ Whether to normalize the queries and keys in the self-attention layers.
40
+ num_hidden_layers (`int`, *optional*, defaults to 48):
41
+ Number of hidden layers in the Transformer encoder.
42
+ use_flash_attn (`bool`, *optional*, defaults to `True`):
43
+ Whether to use flash attention mechanism.
44
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
45
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
46
+ `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported.
47
+ layer_norm_eps (`float`, *optional*, defaults to 1e-6):
48
+ The epsilon used by the layer normalization layers.
49
+ dropout (`float`, *optional*, defaults to 0.0):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ drop_path_rate (`float`, *optional*, defaults to 0.0):
52
+ Dropout rate for stochastic depth.
53
+ attention_dropout (`float`, *optional*, defaults to 0.0):
54
+ The dropout ratio for the attention probabilities.
55
+ initializer_range (`float`, *optional*, defaults to 0.02):
56
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
57
+ initializer_factor (`float`, *optional*, defaults to 0.1):
58
+ A factor for layer scale.
59
+ """
60
+
61
+ model_type = 'intern_vit_6b'
62
+
63
+ def __init__(
64
+ self,
65
+ num_channels=3,
66
+ patch_size=14,
67
+ image_size=224,
68
+ qkv_bias=False,
69
+ hidden_size=3200,
70
+ num_attention_heads=25,
71
+ intermediate_size=12800,
72
+ qk_normalization=True,
73
+ num_hidden_layers=48,
74
+ use_flash_attn=True,
75
+ hidden_act='gelu',
76
+ norm_type='rms_norm',
77
+ layer_norm_eps=1e-6,
78
+ dropout=0.0,
79
+ drop_path_rate=0.0,
80
+ attention_dropout=0.0,
81
+ initializer_range=0.02,
82
+ initializer_factor=0.1,
83
+ **kwargs,
84
+ ):
85
+ super().__init__(**kwargs)
86
+
87
+ self.hidden_size = hidden_size
88
+ self.intermediate_size = intermediate_size
89
+ self.dropout = dropout
90
+ self.drop_path_rate = drop_path_rate
91
+ self.num_hidden_layers = num_hidden_layers
92
+ self.num_attention_heads = num_attention_heads
93
+ self.num_channels = num_channels
94
+ self.patch_size = patch_size
95
+ self.image_size = image_size
96
+ self.initializer_range = initializer_range
97
+ self.initializer_factor = initializer_factor
98
+ self.attention_dropout = attention_dropout
99
+ self.layer_norm_eps = layer_norm_eps
100
+ self.hidden_act = hidden_act
101
+ self.norm_type = norm_type
102
+ self.qkv_bias = qkv_bias
103
+ self.qk_normalization = qk_normalization
104
+ self.use_flash_attn = use_flash_attn
105
+
106
+ @classmethod
107
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> 'PretrainedConfig':
108
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
109
+
110
+ if 'vision_config' in config_dict:
111
+ config_dict = config_dict['vision_config']
112
+
113
+ if 'model_type' in config_dict and hasattr(cls, 'model_type') and config_dict['model_type'] != cls.model_type:
114
+ logger.warning(
115
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
116
+ f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
117
+ )
118
+
119
+ return cls.from_dict(config_dict, **kwargs)
configuration_internvl_chat.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+
7
+ import copy
8
+ from typing import Dict, Any, Optional
9
+
10
+ from transformers.configuration_utils import PretrainedConfig
11
+ from transformers.utils import logging
12
+
13
+ from .configuration_intern_vit import InternVisionConfig
14
+
15
+ logger = logging.get_logger(__name__)
16
+
17
+
18
+ class InternVLChatConfig(PretrainedConfig):
19
+ model_type = 'internvl_chat'
20
+ is_composition = True
21
+
22
+ def __init__(
23
+ self,
24
+ vision_config: Optional[Dict[str, Any]] = None,
25
+ llm_config: Optional[Dict[str, Any]] = None,
26
+ use_backbone_lora=0,
27
+ use_llm_lora=0,
28
+ select_layer=-1,
29
+ force_image_size=None,
30
+ downsample_ratio=0.5,
31
+ template=None,
32
+ dynamic_image_size=False,
33
+ use_thumbnail=False,
34
+ ps_version="v1",
35
+ min_dynamic_patch=1,
36
+ max_dynamic_patch=6,
37
+ **kwargs,
38
+ ):
39
+ super().__init__(**kwargs)
40
+
41
+ if vision_config is None:
42
+ vision_config = {'architectures': ['InternVisionModel']}
43
+ logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
44
+
45
+ if llm_config is None:
46
+ llm_config = {'architectures': ['Qwen2ForCausalLM']}
47
+ logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
+ assert 'architectures' in llm_config, "Should specify architecture in llm_config"
49
+
50
+ if isinstance(vision_config, dict):
51
+ self.vision_config = InternVisionConfig(**vision_config)
52
+ else:
53
+ self.vision_config = vision_config
54
+
55
+ if isinstance(llm_config, dict):
56
+ architecture: str = llm_config['architectures'][0]
57
+ if architecture == 'LlamaForCausalLM':
58
+ from transformers import LlamaConfig
59
+ self.llm_config = LlamaConfig(**llm_config)
60
+ elif architecture == 'Qwen2ForCausalLM':
61
+ from transformers import Qwen2Config
62
+ self.llm_config = Qwen2Config(**llm_config)
63
+ elif architecture == 'Qwen3MoeForCausalLM':
64
+ from transformers import Qwen3MoeConfig
65
+ self.llm_config = Qwen3MoeConfig(**llm_config)
66
+ elif architecture == 'Qwen3ForCausalLM':
67
+ from transformers import Qwen3Config
68
+ self.llm_config = Qwen3Config(**llm_config)
69
+ else:
70
+ raise ValueError('Unsupported architecture: {}'.format(architecture))
71
+ else:
72
+ self.llm_config = llm_config
73
+
74
+ self.use_backbone_lora = use_backbone_lora
75
+ self.use_llm_lora = use_llm_lora
76
+ self.select_layer = select_layer
77
+ self.force_image_size = force_image_size
78
+ self.downsample_ratio = downsample_ratio
79
+ self.template = template
80
+ self.dynamic_image_size = dynamic_image_size
81
+ self.use_thumbnail = use_thumbnail
82
+ self.ps_version = ps_version # pixel shuffle version
83
+ self.min_dynamic_patch = min_dynamic_patch
84
+ self.max_dynamic_patch = max_dynamic_patch
85
+ self.tie_word_embeddings = self.llm_config.tie_word_embeddings
86
+
87
+ logger.info(f'vision_select_layer: {self.select_layer}')
88
+ logger.info(f'ps_version: {self.ps_version}')
89
+ logger.info(f'min_dynamic_patch: {self.min_dynamic_patch}')
90
+ logger.info(f'max_dynamic_patch: {self.max_dynamic_patch}')
91
+
92
+ def to_dict(self):
93
+ """
94
+ Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
95
+
96
+ Returns:
97
+ `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
98
+ """
99
+ output = copy.deepcopy(self.__dict__)
100
+ output['vision_config'] = self.vision_config.to_dict()
101
+ output['llm_config'] = self.llm_config.to_dict()
102
+ output['model_type'] = self.__class__.model_type
103
+ output['use_backbone_lora'] = self.use_backbone_lora
104
+ output['use_llm_lora'] = self.use_llm_lora
105
+ output['select_layer'] = self.select_layer
106
+ output['force_image_size'] = self.force_image_size
107
+ output['downsample_ratio'] = self.downsample_ratio
108
+ output['template'] = self.template
109
+ output['dynamic_image_size'] = self.dynamic_image_size
110
+ output['use_thumbnail'] = self.use_thumbnail
111
+ output['ps_version'] = self.ps_version
112
+ output['min_dynamic_patch'] = self.min_dynamic_patch
113
+ output['max_dynamic_patch'] = self.max_dynamic_patch
114
+
115
+ return output
conversation.py ADDED
@@ -0,0 +1,391 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Conversation prompt templates.
3
+
4
+ We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
+ If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
+
7
+ Modified from https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
8
+ """
9
+
10
+ import dataclasses
11
+ from enum import IntEnum, auto
12
+ from typing import Dict, List, Tuple, Union
13
+
14
+
15
+ class SeparatorStyle(IntEnum):
16
+ """Separator styles."""
17
+
18
+ ADD_COLON_SINGLE = auto()
19
+ ADD_COLON_TWO = auto()
20
+ ADD_COLON_SPACE_SINGLE = auto()
21
+ NO_COLON_SINGLE = auto()
22
+ NO_COLON_TWO = auto()
23
+ ADD_NEW_LINE_SINGLE = auto()
24
+ LLAMA2 = auto()
25
+ CHATGLM = auto()
26
+ CHATML = auto()
27
+ CHATINTERN = auto()
28
+ DOLLY = auto()
29
+ RWKV = auto()
30
+ PHOENIX = auto()
31
+ ROBIN = auto()
32
+ FALCON_CHAT = auto()
33
+ CHATGLM3 = auto()
34
+ INTERNVL_ZH = auto()
35
+ MPT = auto()
36
+
37
+
38
+ @dataclasses.dataclass
39
+ class Conversation:
40
+ """A class that manages prompt templates and keeps all conversation history."""
41
+
42
+ # The name of this template
43
+ name: str
44
+ # The template of the system prompt
45
+ system_template: str = '{system_message}'
46
+ # The system message
47
+ system_message: str = ''
48
+ # The names of two roles
49
+ roles: Tuple[str] = ('USER', 'ASSISTANT')
50
+ # All messages. Each item is (role, message).
51
+ messages: List[List[str]] = ()
52
+ # The number of few shot examples
53
+ offset: int = 0
54
+ # The separator style and configurations
55
+ sep_style: SeparatorStyle = SeparatorStyle.ADD_COLON_SINGLE
56
+ sep: str = '\n'
57
+ sep2: str = None
58
+ # Stop criteria (the default one is EOS token)
59
+ stop_str: Union[str, List[str]] = None
60
+ # Stops generation if meeting any token in this list
61
+ stop_token_ids: List[int] = None
62
+
63
+ def get_prompt(self) -> str:
64
+ """Get the prompt for generation."""
65
+ system_prompt = self.system_template.format(system_message=self.system_message)
66
+ if self.sep_style == SeparatorStyle.ADD_COLON_SINGLE:
67
+ ret = system_prompt + self.sep
68
+ for role, message in self.messages:
69
+ if message:
70
+ ret += role + ': ' + message + self.sep
71
+ else:
72
+ ret += role + ':'
73
+ return ret
74
+ elif self.sep_style == SeparatorStyle.ADD_COLON_TWO:
75
+ seps = [self.sep, self.sep2]
76
+ ret = system_prompt + seps[0]
77
+ for i, (role, message) in enumerate(self.messages):
78
+ if message:
79
+ ret += role + ': ' + message + seps[i % 2]
80
+ else:
81
+ ret += role + ':'
82
+ return ret
83
+ elif self.sep_style == SeparatorStyle.ADD_COLON_SPACE_SINGLE:
84
+ ret = system_prompt + self.sep
85
+ for role, message in self.messages:
86
+ if message:
87
+ ret += role + ': ' + message + self.sep
88
+ else:
89
+ ret += role + ': ' # must be end with a space
90
+ return ret
91
+ elif self.sep_style == SeparatorStyle.ADD_NEW_LINE_SINGLE:
92
+ ret = '' if system_prompt == '' else system_prompt + self.sep
93
+ for role, message in self.messages:
94
+ if message:
95
+ ret += role + '\n' + message + self.sep
96
+ else:
97
+ ret += role + '\n'
98
+ return ret
99
+ elif self.sep_style == SeparatorStyle.NO_COLON_SINGLE:
100
+ ret = system_prompt
101
+ for role, message in self.messages:
102
+ if message:
103
+ ret += role + message + self.sep
104
+ else:
105
+ ret += role
106
+ return ret
107
+ elif self.sep_style == SeparatorStyle.NO_COLON_TWO:
108
+ seps = [self.sep, self.sep2]
109
+ ret = system_prompt
110
+ for i, (role, message) in enumerate(self.messages):
111
+ if message:
112
+ ret += role + message + seps[i % 2]
113
+ else:
114
+ ret += role
115
+ return ret
116
+ elif self.sep_style == SeparatorStyle.RWKV:
117
+ ret = system_prompt
118
+ for i, (role, message) in enumerate(self.messages):
119
+ if message:
120
+ ret += (
121
+ role
122
+ + ': '
123
+ + message.replace('\r\n', '\n').replace('\n\n', '\n')
124
+ )
125
+ ret += '\n\n'
126
+ else:
127
+ ret += role + ':'
128
+ return ret
129
+ elif self.sep_style == SeparatorStyle.LLAMA2:
130
+ seps = [self.sep, self.sep2]
131
+ if self.system_message:
132
+ ret = system_prompt
133
+ else:
134
+ ret = '[INST] '
135
+ for i, (role, message) in enumerate(self.messages):
136
+ tag = self.roles[i % 2]
137
+ if message:
138
+ if i == 0:
139
+ ret += message + ' '
140
+ else:
141
+ ret += tag + ' ' + message + seps[i % 2]
142
+ else:
143
+ ret += tag
144
+ return ret
145
+ elif self.sep_style == SeparatorStyle.CHATGLM:
146
+ # source: https://huggingface.co/THUDM/chatglm-6b/blob/1d240ba371910e9282298d4592532d7f0f3e9f3e/modeling_chatglm.py#L1302-L1308
147
+ # source2: https://huggingface.co/THUDM/chatglm2-6b/blob/e186c891cf64310ac66ef10a87e6635fa6c2a579/modeling_chatglm.py#L926
148
+ round_add_n = 1 if self.name == 'chatglm2' else 0
149
+ if system_prompt:
150
+ ret = system_prompt + self.sep
151
+ else:
152
+ ret = ''
153
+
154
+ for i, (role, message) in enumerate(self.messages):
155
+ if i % 2 == 0:
156
+ ret += f'[Round {i//2 + round_add_n}]{self.sep}'
157
+
158
+ if message:
159
+ ret += f'{role}:{message}{self.sep}'
160
+ else:
161
+ ret += f'{role}:'
162
+ return ret
163
+ elif self.sep_style == SeparatorStyle.CHATML:
164
+ ret = '' if system_prompt == '' else system_prompt + self.sep + '\n'
165
+ for role, message in self.messages:
166
+ if message:
167
+ ret += role + '\n' + message + self.sep + '\n'
168
+ else:
169
+ ret += role + '\n'
170
+ return ret
171
+ elif self.sep_style == SeparatorStyle.CHATGLM3:
172
+ ret = ''
173
+ if self.system_message:
174
+ ret += system_prompt
175
+ for role, message in self.messages:
176
+ if message:
177
+ ret += role + '\n' + ' ' + message
178
+ else:
179
+ ret += role
180
+ return ret
181
+ elif self.sep_style == SeparatorStyle.CHATINTERN:
182
+ # source: https://huggingface.co/internlm/internlm-chat-7b-8k/blob/bd546fa984b4b0b86958f56bf37f94aa75ab8831/modeling_internlm.py#L771
183
+ seps = [self.sep, self.sep2]
184
+ ret = system_prompt
185
+ for i, (role, message) in enumerate(self.messages):
186
+ # if i % 2 == 0:
187
+ # ret += "<s>"
188
+ if message:
189
+ ret += role + ':' + message + seps[i % 2] + '\n'
190
+ else:
191
+ ret += role + ':'
192
+ return ret
193
+ elif self.sep_style == SeparatorStyle.DOLLY:
194
+ seps = [self.sep, self.sep2]
195
+ ret = system_prompt
196
+ for i, (role, message) in enumerate(self.messages):
197
+ if message:
198
+ ret += role + ':\n' + message + seps[i % 2]
199
+ if i % 2 == 1:
200
+ ret += '\n\n'
201
+ else:
202
+ ret += role + ':\n'
203
+ return ret
204
+ elif self.sep_style == SeparatorStyle.PHOENIX:
205
+ ret = system_prompt
206
+ for role, message in self.messages:
207
+ if message:
208
+ ret += role + ': ' + '<s>' + message + '</s>'
209
+ else:
210
+ ret += role + ': ' + '<s>'
211
+ return ret
212
+ elif self.sep_style == SeparatorStyle.ROBIN:
213
+ ret = system_prompt + self.sep
214
+ for role, message in self.messages:
215
+ if message:
216
+ ret += role + ':\n' + message + self.sep
217
+ else:
218
+ ret += role + ':\n'
219
+ return ret
220
+ elif self.sep_style == SeparatorStyle.FALCON_CHAT:
221
+ ret = ''
222
+ if self.system_message:
223
+ ret += system_prompt + self.sep
224
+ for role, message in self.messages:
225
+ if message:
226
+ ret += role + ': ' + message + self.sep
227
+ else:
228
+ ret += role + ':'
229
+
230
+ return ret
231
+ elif self.sep_style == SeparatorStyle.INTERNVL_ZH:
232
+ seps = [self.sep, self.sep2]
233
+ ret = self.system_message + seps[0]
234
+ for i, (role, message) in enumerate(self.messages):
235
+ if message:
236
+ ret += role + ': ' + message + seps[i % 2]
237
+ else:
238
+ ret += role + ':'
239
+ return ret
240
+ elif self.sep_style == SeparatorStyle.MPT:
241
+ ret = system_prompt + self.sep
242
+ for role, message in self.messages:
243
+ if message:
244
+ if type(message) is tuple:
245
+ message, _, _ = message
246
+ ret += role + message + self.sep
247
+ else:
248
+ ret += role
249
+ return ret
250
+ else:
251
+ raise ValueError(f'Invalid style: {self.sep_style}')
252
+
253
+ def set_system_message(self, system_message: str):
254
+ """Set the system message."""
255
+ self.system_message = system_message
256
+
257
+ def append_message(self, role: str, message: str):
258
+ """Append a new message."""
259
+ self.messages.append([role, message])
260
+
261
+ def update_last_message(self, message: str):
262
+ """Update the last output.
263
+
264
+ The last message is typically set to be None when constructing the prompt,
265
+ so we need to update it in-place after getting the response from a model.
266
+ """
267
+ self.messages[-1][1] = message
268
+
269
+ def to_gradio_chatbot(self):
270
+ """Convert the conversation to gradio chatbot format."""
271
+ ret = []
272
+ for i, (role, msg) in enumerate(self.messages[self.offset :]):
273
+ if i % 2 == 0:
274
+ ret.append([msg, None])
275
+ else:
276
+ ret[-1][-1] = msg
277
+ return ret
278
+
279
+ def to_openai_api_messages(self):
280
+ """Convert the conversation to OpenAI chat completion format."""
281
+ ret = [{'role': 'system', 'content': self.system_message}]
282
+
283
+ for i, (_, msg) in enumerate(self.messages[self.offset :]):
284
+ if i % 2 == 0:
285
+ ret.append({'role': 'user', 'content': msg})
286
+ else:
287
+ if msg is not None:
288
+ ret.append({'role': 'assistant', 'content': msg})
289
+ return ret
290
+
291
+ def copy(self):
292
+ return Conversation(
293
+ name=self.name,
294
+ system_template=self.system_template,
295
+ system_message=self.system_message,
296
+ roles=self.roles,
297
+ messages=[[x, y] for x, y in self.messages],
298
+ offset=self.offset,
299
+ sep_style=self.sep_style,
300
+ sep=self.sep,
301
+ sep2=self.sep2,
302
+ stop_str=self.stop_str,
303
+ stop_token_ids=self.stop_token_ids,
304
+ )
305
+
306
+ def dict(self):
307
+ return {
308
+ 'template_name': self.name,
309
+ 'system_message': self.system_message,
310
+ 'roles': self.roles,
311
+ 'messages': self.messages,
312
+ 'offset': self.offset,
313
+ }
314
+
315
+
316
+ # A global registry for all conversation templates
317
+ conv_templates: Dict[str, Conversation] = {}
318
+
319
+
320
+ def register_conv_template(template: Conversation, override: bool = False):
321
+ """Register a new conversation template."""
322
+ if not override:
323
+ assert (
324
+ template.name not in conv_templates
325
+ ), f'{template.name} has been registered.'
326
+
327
+ conv_templates[template.name] = template
328
+
329
+
330
+ def get_conv_template(name: str) -> Conversation:
331
+ """Get a conversation template."""
332
+ return conv_templates[name].copy()
333
+
334
+
335
+ # Both Hermes-2 and internlm2-chat are chatml-format conversation templates. The difference
336
+ # is that during training, the preprocessing function for the Hermes-2 template doesn't add
337
+ # <s> at the beginning of the tokenized sequence, while the internlm2-chat template does.
338
+ # Therefore, they are completely equivalent during inference.
339
+ register_conv_template(
340
+ Conversation(
341
+ name='Hermes-2',
342
+ system_template='<|im_start|>system\n{system_message}',
343
+ # note: The new system prompt was not used here to avoid changes in benchmark performance.
344
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
345
+ system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
346
+ roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
347
+ sep_style=SeparatorStyle.MPT,
348
+ sep='<|im_end|>',
349
+ stop_str='<|endoftext|>',
350
+ )
351
+ )
352
+
353
+
354
+ register_conv_template(
355
+ Conversation(
356
+ name='internlm2-chat',
357
+ system_template='<|im_start|>system\n{system_message}',
358
+ # note: The new system prompt was not used here to avoid changes in benchmark performance.
359
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
360
+ system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
361
+ roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
362
+ sep_style=SeparatorStyle.MPT,
363
+ sep='<|im_end|>',
364
+ )
365
+ )
366
+
367
+
368
+ register_conv_template(
369
+ Conversation(
370
+ name='phi3-chat',
371
+ system_template='<|system|>\n{system_message}',
372
+ # note: The new system prompt was not used here to avoid changes in benchmark performance.
373
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
374
+ system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
375
+ roles=('<|user|>\n', '<|assistant|>\n'),
376
+ sep_style=SeparatorStyle.MPT,
377
+ sep='<|end|>',
378
+ )
379
+ )
380
+
381
+
382
+ register_conv_template(
383
+ Conversation(
384
+ name='internvl2_5',
385
+ system_template='<|im_start|>system\n{system_message}',
386
+ system_message='You are a useful multi-modal AI assistant.',
387
+ roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
388
+ sep_style=SeparatorStyle.MPT,
389
+ sep='<|im_end|>\n',
390
+ )
391
+ )
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "pad_token_id": 151643,
6
+ "transformers_version": "4.51.0"
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d75dc0e63a0c829641d2da6c36b1d2f7c3916d7e3453953a4b18a7e4f713da1
3
+ size 4969694112
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67882c675f5ab2648849cdabf03ad552df3fe7ac7aa3a238fbe1b1bb9b695d3d
3
+ size 3717464248
model.safetensors.index.json ADDED
@@ -0,0 +1,752 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 499518976,
4
+ "total_size": 8687066112
5
+ },
6
+ "weight_map": {
7
+ "language_model.model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "language_model.model.layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
14
+ "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
15
+ "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
16
+ "language_model.model.layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
17
+ "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
18
+ "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
19
+ "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
20
+ "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
21
+ "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
22
+ "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
23
+ "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
24
+ "language_model.model.layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
25
+ "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
26
+ "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
27
+ "language_model.model.layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
28
+ "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
29
+ "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
30
+ "language_model.model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
31
+ "language_model.model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
32
+ "language_model.model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
33
+ "language_model.model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
34
+ "language_model.model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
35
+ "language_model.model.layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
36
+ "language_model.model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
37
+ "language_model.model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
38
+ "language_model.model.layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
39
+ "language_model.model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
40
+ "language_model.model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
41
+ "language_model.model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
42
+ "language_model.model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
43
+ "language_model.model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
44
+ "language_model.model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
45
+ "language_model.model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
46
+ "language_model.model.layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
47
+ "language_model.model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
48
+ "language_model.model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
49
+ "language_model.model.layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
50
+ "language_model.model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
51
+ "language_model.model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
52
+ "language_model.model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
53
+ "language_model.model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
54
+ "language_model.model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
55
+ "language_model.model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
56
+ "language_model.model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
57
+ "language_model.model.layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
58
+ "language_model.model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
59
+ "language_model.model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
60
+ "language_model.model.layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
61
+ "language_model.model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
62
+ "language_model.model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
63
+ "language_model.model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
64
+ "language_model.model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
65
+ "language_model.model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
66
+ "language_model.model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
67
+ "language_model.model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
68
+ "language_model.model.layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
69
+ "language_model.model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
70
+ "language_model.model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
71
+ "language_model.model.layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
72
+ "language_model.model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
73
+ "language_model.model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
74
+ "language_model.model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
75
+ "language_model.model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
76
+ "language_model.model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
77
+ "language_model.model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
78
+ "language_model.model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
79
+ "language_model.model.layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
80
+ "language_model.model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
81
+ "language_model.model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
82
+ "language_model.model.layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
83
+ "language_model.model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
84
+ "language_model.model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
85
+ "language_model.model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
86
+ "language_model.model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
87
+ "language_model.model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
88
+ "language_model.model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
89
+ "language_model.model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
90
+ "language_model.model.layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
91
+ "language_model.model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
92
+ "language_model.model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
93
+ "language_model.model.layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
94
+ "language_model.model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
95
+ "language_model.model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
96
+ "language_model.model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
97
+ "language_model.model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
98
+ "language_model.model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
99
+ "language_model.model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
100
+ "language_model.model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
101
+ "language_model.model.layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
102
+ "language_model.model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
103
+ "language_model.model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
104
+ "language_model.model.layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
105
+ "language_model.model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
106
+ "language_model.model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
107
+ "language_model.model.layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
108
+ "language_model.model.layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
109
+ "language_model.model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
110
+ "language_model.model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
111
+ "language_model.model.layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
112
+ "language_model.model.layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
113
+ "language_model.model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
114
+ "language_model.model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
115
+ "language_model.model.layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
116
+ "language_model.model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
117
+ "language_model.model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
118
+ "language_model.model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
119
+ "language_model.model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
120
+ "language_model.model.layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
121
+ "language_model.model.layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
122
+ "language_model.model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
123
+ "language_model.model.layers.18.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
124
+ "language_model.model.layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
125
+ "language_model.model.layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
126
+ "language_model.model.layers.18.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
127
+ "language_model.model.layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
128
+ "language_model.model.layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
129
+ "language_model.model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
130
+ "language_model.model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
131
+ "language_model.model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
132
+ "language_model.model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
133
+ "language_model.model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
134
+ "language_model.model.layers.19.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
135
+ "language_model.model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
136
+ "language_model.model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
137
+ "language_model.model.layers.19.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
138
+ "language_model.model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
139
+ "language_model.model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
140
+ "language_model.model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
141
+ "language_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
142
+ "language_model.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
143
+ "language_model.model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
144
+ "language_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
145
+ "language_model.model.layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
146
+ "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
147
+ "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
148
+ "language_model.model.layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
149
+ "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
150
+ "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
151
+ "language_model.model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
152
+ "language_model.model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
153
+ "language_model.model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
154
+ "language_model.model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
155
+ "language_model.model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
156
+ "language_model.model.layers.20.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
157
+ "language_model.model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
158
+ "language_model.model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
159
+ "language_model.model.layers.20.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
160
+ "language_model.model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
161
+ "language_model.model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
162
+ "language_model.model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
163
+ "language_model.model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
164
+ "language_model.model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
165
+ "language_model.model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
166
+ "language_model.model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
167
+ "language_model.model.layers.21.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
168
+ "language_model.model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
169
+ "language_model.model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
170
+ "language_model.model.layers.21.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
171
+ "language_model.model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
172
+ "language_model.model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
173
+ "language_model.model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
174
+ "language_model.model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
175
+ "language_model.model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
176
+ "language_model.model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
177
+ "language_model.model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
178
+ "language_model.model.layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
179
+ "language_model.model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
180
+ "language_model.model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
181
+ "language_model.model.layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
182
+ "language_model.model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
183
+ "language_model.model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
184
+ "language_model.model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
185
+ "language_model.model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
186
+ "language_model.model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
187
+ "language_model.model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
188
+ "language_model.model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
189
+ "language_model.model.layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
190
+ "language_model.model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
191
+ "language_model.model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
192
+ "language_model.model.layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
193
+ "language_model.model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
194
+ "language_model.model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
195
+ "language_model.model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
196
+ "language_model.model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
197
+ "language_model.model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
198
+ "language_model.model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
199
+ "language_model.model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
200
+ "language_model.model.layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
201
+ "language_model.model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
202
+ "language_model.model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
203
+ "language_model.model.layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
204
+ "language_model.model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
205
+ "language_model.model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
206
+ "language_model.model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
207
+ "language_model.model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
208
+ "language_model.model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
209
+ "language_model.model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
210
+ "language_model.model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
211
+ "language_model.model.layers.25.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
212
+ "language_model.model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
213
+ "language_model.model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
214
+ "language_model.model.layers.25.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
215
+ "language_model.model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
216
+ "language_model.model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
217
+ "language_model.model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
218
+ "language_model.model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
219
+ "language_model.model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
220
+ "language_model.model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
221
+ "language_model.model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
222
+ "language_model.model.layers.26.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
223
+ "language_model.model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
224
+ "language_model.model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
225
+ "language_model.model.layers.26.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
226
+ "language_model.model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
227
+ "language_model.model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
228
+ "language_model.model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
229
+ "language_model.model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
230
+ "language_model.model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
231
+ "language_model.model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
232
+ "language_model.model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
233
+ "language_model.model.layers.27.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
234
+ "language_model.model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
235
+ "language_model.model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
236
+ "language_model.model.layers.27.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
237
+ "language_model.model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
238
+ "language_model.model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
239
+ "language_model.model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
240
+ "language_model.model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
241
+ "language_model.model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
242
+ "language_model.model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
243
+ "language_model.model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
244
+ "language_model.model.layers.28.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
245
+ "language_model.model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
246
+ "language_model.model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
247
+ "language_model.model.layers.28.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
248
+ "language_model.model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
249
+ "language_model.model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
250
+ "language_model.model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
251
+ "language_model.model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
252
+ "language_model.model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
253
+ "language_model.model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
254
+ "language_model.model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
255
+ "language_model.model.layers.29.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
256
+ "language_model.model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
257
+ "language_model.model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
258
+ "language_model.model.layers.29.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
259
+ "language_model.model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
260
+ "language_model.model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
261
+ "language_model.model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
262
+ "language_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
263
+ "language_model.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
264
+ "language_model.model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
265
+ "language_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
266
+ "language_model.model.layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
267
+ "language_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
268
+ "language_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
269
+ "language_model.model.layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
270
+ "language_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
271
+ "language_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
272
+ "language_model.model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
273
+ "language_model.model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
274
+ "language_model.model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
275
+ "language_model.model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
276
+ "language_model.model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
277
+ "language_model.model.layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
278
+ "language_model.model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
279
+ "language_model.model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
280
+ "language_model.model.layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
281
+ "language_model.model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
282
+ "language_model.model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
283
+ "language_model.model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
284
+ "language_model.model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
285
+ "language_model.model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
286
+ "language_model.model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
287
+ "language_model.model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
288
+ "language_model.model.layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
289
+ "language_model.model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
290
+ "language_model.model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
291
+ "language_model.model.layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
292
+ "language_model.model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
293
+ "language_model.model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
294
+ "language_model.model.layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
295
+ "language_model.model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
296
+ "language_model.model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
297
+ "language_model.model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
298
+ "language_model.model.layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
299
+ "language_model.model.layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
300
+ "language_model.model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
301
+ "language_model.model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
302
+ "language_model.model.layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
303
+ "language_model.model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
304
+ "language_model.model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
305
+ "language_model.model.layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
306
+ "language_model.model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
307
+ "language_model.model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
308
+ "language_model.model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
309
+ "language_model.model.layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
310
+ "language_model.model.layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
311
+ "language_model.model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
312
+ "language_model.model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
313
+ "language_model.model.layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
314
+ "language_model.model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
315
+ "language_model.model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
316
+ "language_model.model.layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
317
+ "language_model.model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
318
+ "language_model.model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
319
+ "language_model.model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
320
+ "language_model.model.layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
321
+ "language_model.model.layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
322
+ "language_model.model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
323
+ "language_model.model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
324
+ "language_model.model.layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
325
+ "language_model.model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
326
+ "language_model.model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
327
+ "language_model.model.layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
328
+ "language_model.model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
329
+ "language_model.model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
330
+ "language_model.model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
331
+ "language_model.model.layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
332
+ "language_model.model.layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
333
+ "language_model.model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
334
+ "language_model.model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
335
+ "language_model.model.layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
336
+ "language_model.model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
337
+ "language_model.model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
338
+ "language_model.model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
339
+ "language_model.model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
340
+ "language_model.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
341
+ "language_model.model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
342
+ "language_model.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
343
+ "language_model.model.layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
344
+ "language_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
345
+ "language_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
346
+ "language_model.model.layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
347
+ "language_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
348
+ "language_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
349
+ "language_model.model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
350
+ "language_model.model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
351
+ "language_model.model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
352
+ "language_model.model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
353
+ "language_model.model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
354
+ "language_model.model.layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
355
+ "language_model.model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
356
+ "language_model.model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
357
+ "language_model.model.layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
358
+ "language_model.model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
359
+ "language_model.model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
360
+ "language_model.model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
361
+ "language_model.model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
362
+ "language_model.model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
363
+ "language_model.model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
364
+ "language_model.model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
365
+ "language_model.model.layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
366
+ "language_model.model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
367
+ "language_model.model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
368
+ "language_model.model.layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
369
+ "language_model.model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
370
+ "language_model.model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
371
+ "language_model.model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
372
+ "language_model.model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
373
+ "language_model.model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
374
+ "language_model.model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
375
+ "language_model.model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
376
+ "language_model.model.layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
377
+ "language_model.model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
378
+ "language_model.model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
379
+ "language_model.model.layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
380
+ "language_model.model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
381
+ "language_model.model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
382
+ "language_model.model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
383
+ "language_model.model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
384
+ "language_model.model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
385
+ "language_model.model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
386
+ "language_model.model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
387
+ "language_model.model.layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
388
+ "language_model.model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
389
+ "language_model.model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
390
+ "language_model.model.layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
391
+ "language_model.model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
392
+ "language_model.model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
393
+ "language_model.model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
394
+ "language_model.model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
395
+ "language_model.model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
396
+ "language_model.model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
397
+ "language_model.model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
398
+ "language_model.model.layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
399
+ "language_model.model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
400
+ "language_model.model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
401
+ "language_model.model.layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
402
+ "language_model.model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
403
+ "language_model.model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
404
+ "language_model.model.norm.weight": "model-00002-of-00002.safetensors",
405
+ "mlp1.0.bias": "model-00002-of-00002.safetensors",
406
+ "mlp1.0.weight": "model-00002-of-00002.safetensors",
407
+ "mlp1.1.bias": "model-00002-of-00002.safetensors",
408
+ "mlp1.1.weight": "model-00002-of-00002.safetensors",
409
+ "mlp1.3.bias": "model-00002-of-00002.safetensors",
410
+ "mlp1.3.weight": "model-00002-of-00002.safetensors",
411
+ "vision_model.embeddings.class_embedding": "model-00001-of-00002.safetensors",
412
+ "vision_model.embeddings.patch_embedding.bias": "model-00001-of-00002.safetensors",
413
+ "vision_model.embeddings.patch_embedding.weight": "model-00001-of-00002.safetensors",
414
+ "vision_model.embeddings.position_embedding": "model-00001-of-00002.safetensors",
415
+ "vision_model.encoder.layers.0.attn.proj.bias": "model-00001-of-00002.safetensors",
416
+ "vision_model.encoder.layers.0.attn.proj.weight": "model-00001-of-00002.safetensors",
417
+ "vision_model.encoder.layers.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
418
+ "vision_model.encoder.layers.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
419
+ "vision_model.encoder.layers.0.ls1": "model-00001-of-00002.safetensors",
420
+ "vision_model.encoder.layers.0.ls2": "model-00001-of-00002.safetensors",
421
+ "vision_model.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
422
+ "vision_model.encoder.layers.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
423
+ "vision_model.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
424
+ "vision_model.encoder.layers.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
425
+ "vision_model.encoder.layers.0.norm1.bias": "model-00001-of-00002.safetensors",
426
+ "vision_model.encoder.layers.0.norm1.weight": "model-00001-of-00002.safetensors",
427
+ "vision_model.encoder.layers.0.norm2.bias": "model-00001-of-00002.safetensors",
428
+ "vision_model.encoder.layers.0.norm2.weight": "model-00001-of-00002.safetensors",
429
+ "vision_model.encoder.layers.1.attn.proj.bias": "model-00001-of-00002.safetensors",
430
+ "vision_model.encoder.layers.1.attn.proj.weight": "model-00001-of-00002.safetensors",
431
+ "vision_model.encoder.layers.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
432
+ "vision_model.encoder.layers.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
433
+ "vision_model.encoder.layers.1.ls1": "model-00001-of-00002.safetensors",
434
+ "vision_model.encoder.layers.1.ls2": "model-00001-of-00002.safetensors",
435
+ "vision_model.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
436
+ "vision_model.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
437
+ "vision_model.encoder.layers.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
438
+ "vision_model.encoder.layers.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
439
+ "vision_model.encoder.layers.1.norm1.bias": "model-00001-of-00002.safetensors",
440
+ "vision_model.encoder.layers.1.norm1.weight": "model-00001-of-00002.safetensors",
441
+ "vision_model.encoder.layers.1.norm2.bias": "model-00001-of-00002.safetensors",
442
+ "vision_model.encoder.layers.1.norm2.weight": "model-00001-of-00002.safetensors",
443
+ "vision_model.encoder.layers.10.attn.proj.bias": "model-00001-of-00002.safetensors",
444
+ "vision_model.encoder.layers.10.attn.proj.weight": "model-00001-of-00002.safetensors",
445
+ "vision_model.encoder.layers.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
446
+ "vision_model.encoder.layers.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
447
+ "vision_model.encoder.layers.10.ls1": "model-00001-of-00002.safetensors",
448
+ "vision_model.encoder.layers.10.ls2": "model-00001-of-00002.safetensors",
449
+ "vision_model.encoder.layers.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
450
+ "vision_model.encoder.layers.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
451
+ "vision_model.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
452
+ "vision_model.encoder.layers.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
453
+ "vision_model.encoder.layers.10.norm1.bias": "model-00001-of-00002.safetensors",
454
+ "vision_model.encoder.layers.10.norm1.weight": "model-00001-of-00002.safetensors",
455
+ "vision_model.encoder.layers.10.norm2.bias": "model-00001-of-00002.safetensors",
456
+ "vision_model.encoder.layers.10.norm2.weight": "model-00001-of-00002.safetensors",
457
+ "vision_model.encoder.layers.11.attn.proj.bias": "model-00001-of-00002.safetensors",
458
+ "vision_model.encoder.layers.11.attn.proj.weight": "model-00001-of-00002.safetensors",
459
+ "vision_model.encoder.layers.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
460
+ "vision_model.encoder.layers.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
461
+ "vision_model.encoder.layers.11.ls1": "model-00001-of-00002.safetensors",
462
+ "vision_model.encoder.layers.11.ls2": "model-00001-of-00002.safetensors",
463
+ "vision_model.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
464
+ "vision_model.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
465
+ "vision_model.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
466
+ "vision_model.encoder.layers.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
467
+ "vision_model.encoder.layers.11.norm1.bias": "model-00001-of-00002.safetensors",
468
+ "vision_model.encoder.layers.11.norm1.weight": "model-00001-of-00002.safetensors",
469
+ "vision_model.encoder.layers.11.norm2.bias": "model-00001-of-00002.safetensors",
470
+ "vision_model.encoder.layers.11.norm2.weight": "model-00001-of-00002.safetensors",
471
+ "vision_model.encoder.layers.12.attn.proj.bias": "model-00001-of-00002.safetensors",
472
+ "vision_model.encoder.layers.12.attn.proj.weight": "model-00001-of-00002.safetensors",
473
+ "vision_model.encoder.layers.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
474
+ "vision_model.encoder.layers.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
475
+ "vision_model.encoder.layers.12.ls1": "model-00001-of-00002.safetensors",
476
+ "vision_model.encoder.layers.12.ls2": "model-00001-of-00002.safetensors",
477
+ "vision_model.encoder.layers.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
478
+ "vision_model.encoder.layers.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
479
+ "vision_model.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
480
+ "vision_model.encoder.layers.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
481
+ "vision_model.encoder.layers.12.norm1.bias": "model-00001-of-00002.safetensors",
482
+ "vision_model.encoder.layers.12.norm1.weight": "model-00001-of-00002.safetensors",
483
+ "vision_model.encoder.layers.12.norm2.bias": "model-00001-of-00002.safetensors",
484
+ "vision_model.encoder.layers.12.norm2.weight": "model-00001-of-00002.safetensors",
485
+ "vision_model.encoder.layers.13.attn.proj.bias": "model-00001-of-00002.safetensors",
486
+ "vision_model.encoder.layers.13.attn.proj.weight": "model-00001-of-00002.safetensors",
487
+ "vision_model.encoder.layers.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
488
+ "vision_model.encoder.layers.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
489
+ "vision_model.encoder.layers.13.ls1": "model-00001-of-00002.safetensors",
490
+ "vision_model.encoder.layers.13.ls2": "model-00001-of-00002.safetensors",
491
+ "vision_model.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
492
+ "vision_model.encoder.layers.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
493
+ "vision_model.encoder.layers.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
494
+ "vision_model.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
495
+ "vision_model.encoder.layers.13.norm1.bias": "model-00001-of-00002.safetensors",
496
+ "vision_model.encoder.layers.13.norm1.weight": "model-00001-of-00002.safetensors",
497
+ "vision_model.encoder.layers.13.norm2.bias": "model-00001-of-00002.safetensors",
498
+ "vision_model.encoder.layers.13.norm2.weight": "model-00001-of-00002.safetensors",
499
+ "vision_model.encoder.layers.14.attn.proj.bias": "model-00001-of-00002.safetensors",
500
+ "vision_model.encoder.layers.14.attn.proj.weight": "model-00001-of-00002.safetensors",
501
+ "vision_model.encoder.layers.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
502
+ "vision_model.encoder.layers.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
503
+ "vision_model.encoder.layers.14.ls1": "model-00001-of-00002.safetensors",
504
+ "vision_model.encoder.layers.14.ls2": "model-00001-of-00002.safetensors",
505
+ "vision_model.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
506
+ "vision_model.encoder.layers.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
507
+ "vision_model.encoder.layers.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
508
+ "vision_model.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
509
+ "vision_model.encoder.layers.14.norm1.bias": "model-00001-of-00002.safetensors",
510
+ "vision_model.encoder.layers.14.norm1.weight": "model-00001-of-00002.safetensors",
511
+ "vision_model.encoder.layers.14.norm2.bias": "model-00001-of-00002.safetensors",
512
+ "vision_model.encoder.layers.14.norm2.weight": "model-00001-of-00002.safetensors",
513
+ "vision_model.encoder.layers.15.attn.proj.bias": "model-00001-of-00002.safetensors",
514
+ "vision_model.encoder.layers.15.attn.proj.weight": "model-00001-of-00002.safetensors",
515
+ "vision_model.encoder.layers.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
516
+ "vision_model.encoder.layers.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
517
+ "vision_model.encoder.layers.15.ls1": "model-00001-of-00002.safetensors",
518
+ "vision_model.encoder.layers.15.ls2": "model-00001-of-00002.safetensors",
519
+ "vision_model.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
520
+ "vision_model.encoder.layers.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
521
+ "vision_model.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
522
+ "vision_model.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
523
+ "vision_model.encoder.layers.15.norm1.bias": "model-00001-of-00002.safetensors",
524
+ "vision_model.encoder.layers.15.norm1.weight": "model-00001-of-00002.safetensors",
525
+ "vision_model.encoder.layers.15.norm2.bias": "model-00001-of-00002.safetensors",
526
+ "vision_model.encoder.layers.15.norm2.weight": "model-00001-of-00002.safetensors",
527
+ "vision_model.encoder.layers.16.attn.proj.bias": "model-00001-of-00002.safetensors",
528
+ "vision_model.encoder.layers.16.attn.proj.weight": "model-00001-of-00002.safetensors",
529
+ "vision_model.encoder.layers.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
530
+ "vision_model.encoder.layers.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
531
+ "vision_model.encoder.layers.16.ls1": "model-00001-of-00002.safetensors",
532
+ "vision_model.encoder.layers.16.ls2": "model-00001-of-00002.safetensors",
533
+ "vision_model.encoder.layers.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
534
+ "vision_model.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
535
+ "vision_model.encoder.layers.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
536
+ "vision_model.encoder.layers.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
537
+ "vision_model.encoder.layers.16.norm1.bias": "model-00001-of-00002.safetensors",
538
+ "vision_model.encoder.layers.16.norm1.weight": "model-00001-of-00002.safetensors",
539
+ "vision_model.encoder.layers.16.norm2.bias": "model-00001-of-00002.safetensors",
540
+ "vision_model.encoder.layers.16.norm2.weight": "model-00001-of-00002.safetensors",
541
+ "vision_model.encoder.layers.17.attn.proj.bias": "model-00001-of-00002.safetensors",
542
+ "vision_model.encoder.layers.17.attn.proj.weight": "model-00001-of-00002.safetensors",
543
+ "vision_model.encoder.layers.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
544
+ "vision_model.encoder.layers.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
545
+ "vision_model.encoder.layers.17.ls1": "model-00001-of-00002.safetensors",
546
+ "vision_model.encoder.layers.17.ls2": "model-00001-of-00002.safetensors",
547
+ "vision_model.encoder.layers.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
548
+ "vision_model.encoder.layers.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
549
+ "vision_model.encoder.layers.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
550
+ "vision_model.encoder.layers.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
551
+ "vision_model.encoder.layers.17.norm1.bias": "model-00001-of-00002.safetensors",
552
+ "vision_model.encoder.layers.17.norm1.weight": "model-00001-of-00002.safetensors",
553
+ "vision_model.encoder.layers.17.norm2.bias": "model-00001-of-00002.safetensors",
554
+ "vision_model.encoder.layers.17.norm2.weight": "model-00001-of-00002.safetensors",
555
+ "vision_model.encoder.layers.18.attn.proj.bias": "model-00001-of-00002.safetensors",
556
+ "vision_model.encoder.layers.18.attn.proj.weight": "model-00001-of-00002.safetensors",
557
+ "vision_model.encoder.layers.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
558
+ "vision_model.encoder.layers.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
559
+ "vision_model.encoder.layers.18.ls1": "model-00001-of-00002.safetensors",
560
+ "vision_model.encoder.layers.18.ls2": "model-00001-of-00002.safetensors",
561
+ "vision_model.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
562
+ "vision_model.encoder.layers.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
563
+ "vision_model.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
564
+ "vision_model.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
565
+ "vision_model.encoder.layers.18.norm1.bias": "model-00001-of-00002.safetensors",
566
+ "vision_model.encoder.layers.18.norm1.weight": "model-00001-of-00002.safetensors",
567
+ "vision_model.encoder.layers.18.norm2.bias": "model-00001-of-00002.safetensors",
568
+ "vision_model.encoder.layers.18.norm2.weight": "model-00001-of-00002.safetensors",
569
+ "vision_model.encoder.layers.19.attn.proj.bias": "model-00001-of-00002.safetensors",
570
+ "vision_model.encoder.layers.19.attn.proj.weight": "model-00001-of-00002.safetensors",
571
+ "vision_model.encoder.layers.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
572
+ "vision_model.encoder.layers.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
573
+ "vision_model.encoder.layers.19.ls1": "model-00001-of-00002.safetensors",
574
+ "vision_model.encoder.layers.19.ls2": "model-00001-of-00002.safetensors",
575
+ "vision_model.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
576
+ "vision_model.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
577
+ "vision_model.encoder.layers.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
578
+ "vision_model.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
579
+ "vision_model.encoder.layers.19.norm1.bias": "model-00001-of-00002.safetensors",
580
+ "vision_model.encoder.layers.19.norm1.weight": "model-00001-of-00002.safetensors",
581
+ "vision_model.encoder.layers.19.norm2.bias": "model-00001-of-00002.safetensors",
582
+ "vision_model.encoder.layers.19.norm2.weight": "model-00001-of-00002.safetensors",
583
+ "vision_model.encoder.layers.2.attn.proj.bias": "model-00001-of-00002.safetensors",
584
+ "vision_model.encoder.layers.2.attn.proj.weight": "model-00001-of-00002.safetensors",
585
+ "vision_model.encoder.layers.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
586
+ "vision_model.encoder.layers.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
587
+ "vision_model.encoder.layers.2.ls1": "model-00001-of-00002.safetensors",
588
+ "vision_model.encoder.layers.2.ls2": "model-00001-of-00002.safetensors",
589
+ "vision_model.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
590
+ "vision_model.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
591
+ "vision_model.encoder.layers.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
592
+ "vision_model.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
593
+ "vision_model.encoder.layers.2.norm1.bias": "model-00001-of-00002.safetensors",
594
+ "vision_model.encoder.layers.2.norm1.weight": "model-00001-of-00002.safetensors",
595
+ "vision_model.encoder.layers.2.norm2.bias": "model-00001-of-00002.safetensors",
596
+ "vision_model.encoder.layers.2.norm2.weight": "model-00001-of-00002.safetensors",
597
+ "vision_model.encoder.layers.20.attn.proj.bias": "model-00001-of-00002.safetensors",
598
+ "vision_model.encoder.layers.20.attn.proj.weight": "model-00001-of-00002.safetensors",
599
+ "vision_model.encoder.layers.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
600
+ "vision_model.encoder.layers.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
601
+ "vision_model.encoder.layers.20.ls1": "model-00001-of-00002.safetensors",
602
+ "vision_model.encoder.layers.20.ls2": "model-00001-of-00002.safetensors",
603
+ "vision_model.encoder.layers.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
604
+ "vision_model.encoder.layers.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
605
+ "vision_model.encoder.layers.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
606
+ "vision_model.encoder.layers.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
607
+ "vision_model.encoder.layers.20.norm1.bias": "model-00001-of-00002.safetensors",
608
+ "vision_model.encoder.layers.20.norm1.weight": "model-00001-of-00002.safetensors",
609
+ "vision_model.encoder.layers.20.norm2.bias": "model-00001-of-00002.safetensors",
610
+ "vision_model.encoder.layers.20.norm2.weight": "model-00001-of-00002.safetensors",
611
+ "vision_model.encoder.layers.21.attn.proj.bias": "model-00001-of-00002.safetensors",
612
+ "vision_model.encoder.layers.21.attn.proj.weight": "model-00001-of-00002.safetensors",
613
+ "vision_model.encoder.layers.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
614
+ "vision_model.encoder.layers.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
615
+ "vision_model.encoder.layers.21.ls1": "model-00001-of-00002.safetensors",
616
+ "vision_model.encoder.layers.21.ls2": "model-00001-of-00002.safetensors",
617
+ "vision_model.encoder.layers.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
618
+ "vision_model.encoder.layers.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
619
+ "vision_model.encoder.layers.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
620
+ "vision_model.encoder.layers.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
621
+ "vision_model.encoder.layers.21.norm1.bias": "model-00001-of-00002.safetensors",
622
+ "vision_model.encoder.layers.21.norm1.weight": "model-00001-of-00002.safetensors",
623
+ "vision_model.encoder.layers.21.norm2.bias": "model-00001-of-00002.safetensors",
624
+ "vision_model.encoder.layers.21.norm2.weight": "model-00001-of-00002.safetensors",
625
+ "vision_model.encoder.layers.22.attn.proj.bias": "model-00001-of-00002.safetensors",
626
+ "vision_model.encoder.layers.22.attn.proj.weight": "model-00001-of-00002.safetensors",
627
+ "vision_model.encoder.layers.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
628
+ "vision_model.encoder.layers.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
629
+ "vision_model.encoder.layers.22.ls1": "model-00001-of-00002.safetensors",
630
+ "vision_model.encoder.layers.22.ls2": "model-00001-of-00002.safetensors",
631
+ "vision_model.encoder.layers.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
632
+ "vision_model.encoder.layers.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
633
+ "vision_model.encoder.layers.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
634
+ "vision_model.encoder.layers.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
635
+ "vision_model.encoder.layers.22.norm1.bias": "model-00001-of-00002.safetensors",
636
+ "vision_model.encoder.layers.22.norm1.weight": "model-00001-of-00002.safetensors",
637
+ "vision_model.encoder.layers.22.norm2.bias": "model-00001-of-00002.safetensors",
638
+ "vision_model.encoder.layers.22.norm2.weight": "model-00001-of-00002.safetensors",
639
+ "vision_model.encoder.layers.23.attn.proj.bias": "model-00001-of-00002.safetensors",
640
+ "vision_model.encoder.layers.23.attn.proj.weight": "model-00001-of-00002.safetensors",
641
+ "vision_model.encoder.layers.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
642
+ "vision_model.encoder.layers.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
643
+ "vision_model.encoder.layers.23.ls1": "model-00001-of-00002.safetensors",
644
+ "vision_model.encoder.layers.23.ls2": "model-00001-of-00002.safetensors",
645
+ "vision_model.encoder.layers.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
646
+ "vision_model.encoder.layers.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
647
+ "vision_model.encoder.layers.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
648
+ "vision_model.encoder.layers.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
649
+ "vision_model.encoder.layers.23.norm1.bias": "model-00001-of-00002.safetensors",
650
+ "vision_model.encoder.layers.23.norm1.weight": "model-00001-of-00002.safetensors",
651
+ "vision_model.encoder.layers.23.norm2.bias": "model-00001-of-00002.safetensors",
652
+ "vision_model.encoder.layers.23.norm2.weight": "model-00001-of-00002.safetensors",
653
+ "vision_model.encoder.layers.3.attn.proj.bias": "model-00001-of-00002.safetensors",
654
+ "vision_model.encoder.layers.3.attn.proj.weight": "model-00001-of-00002.safetensors",
655
+ "vision_model.encoder.layers.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
656
+ "vision_model.encoder.layers.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
657
+ "vision_model.encoder.layers.3.ls1": "model-00001-of-00002.safetensors",
658
+ "vision_model.encoder.layers.3.ls2": "model-00001-of-00002.safetensors",
659
+ "vision_model.encoder.layers.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
660
+ "vision_model.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
661
+ "vision_model.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
662
+ "vision_model.encoder.layers.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
663
+ "vision_model.encoder.layers.3.norm1.bias": "model-00001-of-00002.safetensors",
664
+ "vision_model.encoder.layers.3.norm1.weight": "model-00001-of-00002.safetensors",
665
+ "vision_model.encoder.layers.3.norm2.bias": "model-00001-of-00002.safetensors",
666
+ "vision_model.encoder.layers.3.norm2.weight": "model-00001-of-00002.safetensors",
667
+ "vision_model.encoder.layers.4.attn.proj.bias": "model-00001-of-00002.safetensors",
668
+ "vision_model.encoder.layers.4.attn.proj.weight": "model-00001-of-00002.safetensors",
669
+ "vision_model.encoder.layers.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
670
+ "vision_model.encoder.layers.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
671
+ "vision_model.encoder.layers.4.ls1": "model-00001-of-00002.safetensors",
672
+ "vision_model.encoder.layers.4.ls2": "model-00001-of-00002.safetensors",
673
+ "vision_model.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
674
+ "vision_model.encoder.layers.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
675
+ "vision_model.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
676
+ "vision_model.encoder.layers.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
677
+ "vision_model.encoder.layers.4.norm1.bias": "model-00001-of-00002.safetensors",
678
+ "vision_model.encoder.layers.4.norm1.weight": "model-00001-of-00002.safetensors",
679
+ "vision_model.encoder.layers.4.norm2.bias": "model-00001-of-00002.safetensors",
680
+ "vision_model.encoder.layers.4.norm2.weight": "model-00001-of-00002.safetensors",
681
+ "vision_model.encoder.layers.5.attn.proj.bias": "model-00001-of-00002.safetensors",
682
+ "vision_model.encoder.layers.5.attn.proj.weight": "model-00001-of-00002.safetensors",
683
+ "vision_model.encoder.layers.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
684
+ "vision_model.encoder.layers.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
685
+ "vision_model.encoder.layers.5.ls1": "model-00001-of-00002.safetensors",
686
+ "vision_model.encoder.layers.5.ls2": "model-00001-of-00002.safetensors",
687
+ "vision_model.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
688
+ "vision_model.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
689
+ "vision_model.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
690
+ "vision_model.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
691
+ "vision_model.encoder.layers.5.norm1.bias": "model-00001-of-00002.safetensors",
692
+ "vision_model.encoder.layers.5.norm1.weight": "model-00001-of-00002.safetensors",
693
+ "vision_model.encoder.layers.5.norm2.bias": "model-00001-of-00002.safetensors",
694
+ "vision_model.encoder.layers.5.norm2.weight": "model-00001-of-00002.safetensors",
695
+ "vision_model.encoder.layers.6.attn.proj.bias": "model-00001-of-00002.safetensors",
696
+ "vision_model.encoder.layers.6.attn.proj.weight": "model-00001-of-00002.safetensors",
697
+ "vision_model.encoder.layers.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
698
+ "vision_model.encoder.layers.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
699
+ "vision_model.encoder.layers.6.ls1": "model-00001-of-00002.safetensors",
700
+ "vision_model.encoder.layers.6.ls2": "model-00001-of-00002.safetensors",
701
+ "vision_model.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
702
+ "vision_model.encoder.layers.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
703
+ "vision_model.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
704
+ "vision_model.encoder.layers.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
705
+ "vision_model.encoder.layers.6.norm1.bias": "model-00001-of-00002.safetensors",
706
+ "vision_model.encoder.layers.6.norm1.weight": "model-00001-of-00002.safetensors",
707
+ "vision_model.encoder.layers.6.norm2.bias": "model-00001-of-00002.safetensors",
708
+ "vision_model.encoder.layers.6.norm2.weight": "model-00001-of-00002.safetensors",
709
+ "vision_model.encoder.layers.7.attn.proj.bias": "model-00001-of-00002.safetensors",
710
+ "vision_model.encoder.layers.7.attn.proj.weight": "model-00001-of-00002.safetensors",
711
+ "vision_model.encoder.layers.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
712
+ "vision_model.encoder.layers.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
713
+ "vision_model.encoder.layers.7.ls1": "model-00001-of-00002.safetensors",
714
+ "vision_model.encoder.layers.7.ls2": "model-00001-of-00002.safetensors",
715
+ "vision_model.encoder.layers.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
716
+ "vision_model.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
717
+ "vision_model.encoder.layers.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
718
+ "vision_model.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
719
+ "vision_model.encoder.layers.7.norm1.bias": "model-00001-of-00002.safetensors",
720
+ "vision_model.encoder.layers.7.norm1.weight": "model-00001-of-00002.safetensors",
721
+ "vision_model.encoder.layers.7.norm2.bias": "model-00001-of-00002.safetensors",
722
+ "vision_model.encoder.layers.7.norm2.weight": "model-00001-of-00002.safetensors",
723
+ "vision_model.encoder.layers.8.attn.proj.bias": "model-00001-of-00002.safetensors",
724
+ "vision_model.encoder.layers.8.attn.proj.weight": "model-00001-of-00002.safetensors",
725
+ "vision_model.encoder.layers.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
726
+ "vision_model.encoder.layers.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
727
+ "vision_model.encoder.layers.8.ls1": "model-00001-of-00002.safetensors",
728
+ "vision_model.encoder.layers.8.ls2": "model-00001-of-00002.safetensors",
729
+ "vision_model.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
730
+ "vision_model.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
731
+ "vision_model.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
732
+ "vision_model.encoder.layers.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
733
+ "vision_model.encoder.layers.8.norm1.bias": "model-00001-of-00002.safetensors",
734
+ "vision_model.encoder.layers.8.norm1.weight": "model-00001-of-00002.safetensors",
735
+ "vision_model.encoder.layers.8.norm2.bias": "model-00001-of-00002.safetensors",
736
+ "vision_model.encoder.layers.8.norm2.weight": "model-00001-of-00002.safetensors",
737
+ "vision_model.encoder.layers.9.attn.proj.bias": "model-00001-of-00002.safetensors",
738
+ "vision_model.encoder.layers.9.attn.proj.weight": "model-00001-of-00002.safetensors",
739
+ "vision_model.encoder.layers.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
740
+ "vision_model.encoder.layers.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
741
+ "vision_model.encoder.layers.9.ls1": "model-00001-of-00002.safetensors",
742
+ "vision_model.encoder.layers.9.ls2": "model-00001-of-00002.safetensors",
743
+ "vision_model.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
744
+ "vision_model.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
745
+ "vision_model.encoder.layers.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
746
+ "vision_model.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
747
+ "vision_model.encoder.layers.9.norm1.bias": "model-00001-of-00002.safetensors",
748
+ "vision_model.encoder.layers.9.norm1.weight": "model-00001-of-00002.safetensors",
749
+ "vision_model.encoder.layers.9.norm2.bias": "model-00001-of-00002.safetensors",
750
+ "vision_model.encoder.layers.9.norm2.weight": "model-00001-of-00002.safetensors"
751
+ }
752
+ }
modeling_intern_vit.py ADDED
@@ -0,0 +1,433 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+
7
+ from typing import Optional, Tuple, Union
8
+
9
+ import torch
10
+ import torch.nn.functional as F
11
+ import torch.utils.checkpoint
12
+ from einops import rearrange
13
+ from timm.layers import DropPath
14
+ from torch import nn
15
+ from transformers.activations import ACT2FN
16
+ from transformers.modeling_outputs import (BaseModelOutput,
17
+ BaseModelOutputWithPooling)
18
+ from transformers.modeling_utils import PreTrainedModel
19
+ from transformers.utils import logging
20
+
21
+ from .configuration_intern_vit import InternVisionConfig
22
+
23
+ try:
24
+ from flash_attn.bert_padding import pad_input, unpad_input
25
+ from flash_attn.flash_attn_interface import \
26
+ flash_attn_varlen_qkvpacked_func
27
+ has_flash_attn = True
28
+ except:
29
+ print('FlashAttention2 is not installed.')
30
+ has_flash_attn = False
31
+
32
+ logger = logging.get_logger(__name__)
33
+
34
+
35
+ class FlashAttention(nn.Module):
36
+ """Implement the scaled dot product attention with softmax.
37
+ Arguments
38
+ ---------
39
+ softmax_scale: The temperature to use for the softmax attention.
40
+ (default: 1/sqrt(d_keys) where d_keys is computed at
41
+ runtime)
42
+ attention_dropout: The dropout rate to apply to the attention
43
+ (default: 0.0)
44
+ """
45
+
46
+ def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
47
+ super().__init__()
48
+ self.softmax_scale = softmax_scale
49
+ self.dropout_p = attention_dropout
50
+
51
+ def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
52
+ max_s=None, need_weights=False):
53
+ """Implements the multihead softmax attention.
54
+ Arguments
55
+ ---------
56
+ qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None
57
+ if unpadded: (nnz, 3, h, d)
58
+ key_padding_mask: a bool tensor of shape (B, S)
59
+ """
60
+ assert not need_weights
61
+ assert qkv.dtype in [torch.float16, torch.bfloat16]
62
+ assert qkv.is_cuda
63
+
64
+ if cu_seqlens is None:
65
+ batch_size = qkv.shape[0]
66
+ seqlen = qkv.shape[1]
67
+ if key_padding_mask is None:
68
+ qkv = rearrange(qkv, 'b s ... -> (b s) ...')
69
+ max_s = seqlen
70
+ cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
71
+ device=qkv.device)
72
+ output = flash_attn_varlen_qkvpacked_func(
73
+ qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
74
+ softmax_scale=self.softmax_scale, causal=causal
75
+ )
76
+ output = rearrange(output, '(b s) ... -> b s ...', b=batch_size)
77
+ else:
78
+ nheads = qkv.shape[-2]
79
+ x = rearrange(qkv, 'b s three h d -> b s (three h d)')
80
+ x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
81
+ x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
82
+ output_unpad = flash_attn_varlen_qkvpacked_func(
83
+ x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
84
+ softmax_scale=self.softmax_scale, causal=causal
85
+ )
86
+ output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
87
+ indices, batch_size, seqlen),
88
+ 'b s (h d) -> b s h d', h=nheads)
89
+ else:
90
+ assert max_s is not None
91
+ output = flash_attn_varlen_qkvpacked_func(
92
+ qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
93
+ softmax_scale=self.softmax_scale, causal=causal
94
+ )
95
+
96
+ return output, None
97
+
98
+
99
+ class InternRMSNorm(nn.Module):
100
+ def __init__(self, hidden_size, eps=1e-6):
101
+ super().__init__()
102
+ self.weight = nn.Parameter(torch.ones(hidden_size))
103
+ self.variance_epsilon = eps
104
+
105
+ def forward(self, hidden_states):
106
+ input_dtype = hidden_states.dtype
107
+ hidden_states = hidden_states.to(torch.float32)
108
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
109
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
110
+ return self.weight * hidden_states.to(input_dtype)
111
+
112
+
113
+ try:
114
+ from apex.normalization import FusedRMSNorm
115
+
116
+ InternRMSNorm = FusedRMSNorm # noqa
117
+
118
+ logger.info('Discovered apex.normalization.FusedRMSNorm - will use it instead of InternRMSNorm')
119
+ except ImportError:
120
+ # using the normal InternRMSNorm
121
+ pass
122
+ except Exception:
123
+ logger.warning('discovered apex but it failed to load, falling back to InternRMSNorm')
124
+ pass
125
+
126
+
127
+ NORM2FN = {
128
+ 'rms_norm': InternRMSNorm,
129
+ 'layer_norm': nn.LayerNorm,
130
+ }
131
+
132
+
133
+ class InternVisionEmbeddings(nn.Module):
134
+ def __init__(self, config: InternVisionConfig):
135
+ super().__init__()
136
+ self.config = config
137
+ self.embed_dim = config.hidden_size
138
+ self.image_size = config.image_size
139
+ self.patch_size = config.patch_size
140
+
141
+ self.class_embedding = nn.Parameter(
142
+ torch.randn(1, 1, self.embed_dim),
143
+ )
144
+
145
+ self.patch_embedding = nn.Conv2d(
146
+ in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size
147
+ )
148
+
149
+ self.num_patches = (self.image_size // self.patch_size) ** 2
150
+ self.num_positions = self.num_patches + 1
151
+
152
+ self.position_embedding = nn.Parameter(torch.randn(1, self.num_positions, self.embed_dim))
153
+
154
+ def _get_pos_embed(self, pos_embed, H, W):
155
+ target_dtype = pos_embed.dtype
156
+ pos_embed = pos_embed.float().reshape(
157
+ 1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
158
+ pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
159
+ reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
160
+ return pos_embed
161
+
162
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
163
+ target_dtype = self.patch_embedding.weight.dtype
164
+ patch_embeds = self.patch_embedding(pixel_values) # shape = [*, channel, width, height]
165
+ batch_size, _, height, width = patch_embeds.shape
166
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
167
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype)
168
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
169
+ position_embedding = torch.cat([
170
+ self.position_embedding[:, :1, :],
171
+ self._get_pos_embed(self.position_embedding[:, 1:, :], height, width)
172
+ ], dim=1)
173
+ embeddings = embeddings + position_embedding.to(target_dtype)
174
+ return embeddings
175
+
176
+
177
+ class InternAttention(nn.Module):
178
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
179
+
180
+ def __init__(self, config: InternVisionConfig):
181
+ super().__init__()
182
+ self.config = config
183
+ self.embed_dim = config.hidden_size
184
+ self.num_heads = config.num_attention_heads
185
+ self.use_flash_attn = config.use_flash_attn and has_flash_attn
186
+ if config.use_flash_attn and not has_flash_attn:
187
+ print('Warning: Flash Attention is not available, use_flash_attn is set to False.')
188
+ self.head_dim = self.embed_dim // self.num_heads
189
+ if self.head_dim * self.num_heads != self.embed_dim:
190
+ raise ValueError(
191
+ f'embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:'
192
+ f' {self.num_heads}).'
193
+ )
194
+
195
+ self.scale = self.head_dim ** -0.5
196
+ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=config.qkv_bias)
197
+ self.attn_drop = nn.Dropout(config.attention_dropout)
198
+ self.proj_drop = nn.Dropout(config.dropout)
199
+
200
+ self.qk_normalization = config.qk_normalization
201
+
202
+ if self.qk_normalization:
203
+ self.q_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
204
+ self.k_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
205
+
206
+ if self.use_flash_attn:
207
+ self.inner_attn = FlashAttention(attention_dropout=config.attention_dropout)
208
+ self.proj = nn.Linear(self.embed_dim, self.embed_dim)
209
+
210
+ def _naive_attn(self, x):
211
+ B, N, C = x.shape
212
+ qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
213
+ q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)
214
+
215
+ if self.qk_normalization:
216
+ B_, H_, N_, D_ = q.shape
217
+ q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
218
+ k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
219
+
220
+ attn = ((q * self.scale) @ k.transpose(-2, -1))
221
+ attn = attn.softmax(dim=-1)
222
+ attn = self.attn_drop(attn)
223
+
224
+ x = (attn @ v).transpose(1, 2).reshape(B, N, C)
225
+ x = self.proj(x)
226
+ x = self.proj_drop(x)
227
+ return x
228
+
229
+ def _flash_attn(self, x, key_padding_mask=None, need_weights=False):
230
+ qkv = self.qkv(x)
231
+ qkv = rearrange(qkv, 'b s (three h d) -> b s three h d', three=3, h=self.num_heads)
232
+
233
+ if self.qk_normalization:
234
+ q, k, v = qkv.unbind(2)
235
+ q = self.q_norm(q.flatten(-2, -1)).view(q.shape)
236
+ k = self.k_norm(k.flatten(-2, -1)).view(k.shape)
237
+ qkv = torch.stack([q, k, v], dim=2)
238
+
239
+ context, _ = self.inner_attn(
240
+ qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=False
241
+ )
242
+ outs = self.proj(rearrange(context, 'b s h d -> b s (h d)'))
243
+ outs = self.proj_drop(outs)
244
+ return outs
245
+
246
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
247
+ x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
248
+ return x
249
+
250
+
251
+ class InternMLP(nn.Module):
252
+ def __init__(self, config: InternVisionConfig):
253
+ super().__init__()
254
+ self.config = config
255
+ self.act = ACT2FN[config.hidden_act]
256
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
257
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
258
+
259
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
260
+ hidden_states = self.fc1(hidden_states)
261
+ hidden_states = self.act(hidden_states)
262
+ hidden_states = self.fc2(hidden_states)
263
+ return hidden_states
264
+
265
+
266
+ class InternVisionEncoderLayer(nn.Module):
267
+ def __init__(self, config: InternVisionConfig, drop_path_rate: float):
268
+ super().__init__()
269
+ self.embed_dim = config.hidden_size
270
+ self.intermediate_size = config.intermediate_size
271
+ self.norm_type = config.norm_type
272
+
273
+ self.attn = InternAttention(config)
274
+ self.mlp = InternMLP(config)
275
+ self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
276
+ self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
277
+
278
+ self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
279
+ self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
280
+ self.drop_path1 = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
281
+ self.drop_path2 = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
282
+
283
+ def forward(
284
+ self,
285
+ hidden_states: torch.Tensor,
286
+ ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor], Optional[Tuple[torch.FloatTensor]]]:
287
+ """
288
+ Args:
289
+ hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
290
+ """
291
+ hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states).to(hidden_states.dtype)) * self.ls1)
292
+
293
+ hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states).to(hidden_states.dtype)) * self.ls2)
294
+
295
+ return hidden_states
296
+
297
+
298
+ class InternVisionEncoder(nn.Module):
299
+ """
300
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
301
+ [`InternEncoderLayer`].
302
+
303
+ Args:
304
+ config (`InternConfig`):
305
+ The corresponding vision configuration for the `InternEncoder`.
306
+ """
307
+
308
+ def __init__(self, config: InternVisionConfig):
309
+ super().__init__()
310
+ self.config = config
311
+ # stochastic depth decay rule
312
+ dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
313
+ self.layers = nn.ModuleList([
314
+ InternVisionEncoderLayer(config, dpr[idx]) for idx in range(config.num_hidden_layers)])
315
+ self.gradient_checkpointing = True
316
+
317
+ def forward(
318
+ self,
319
+ inputs_embeds,
320
+ output_hidden_states: Optional[bool] = None,
321
+ return_dict: Optional[bool] = None,
322
+ ) -> Union[Tuple, BaseModelOutput]:
323
+ r"""
324
+ Args:
325
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
326
+ Embedded representation of the inputs. Should be float, not int tokens.
327
+ output_hidden_states (`bool`, *optional*):
328
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
329
+ for more detail.
330
+ return_dict (`bool`, *optional*):
331
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
332
+ """
333
+ output_hidden_states = (
334
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
335
+ )
336
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
337
+
338
+ encoder_states = () if output_hidden_states else None
339
+ hidden_states = inputs_embeds
340
+
341
+ for idx, encoder_layer in enumerate(self.layers):
342
+ if output_hidden_states:
343
+ encoder_states = encoder_states + (hidden_states,)
344
+ if self.gradient_checkpointing and self.training:
345
+ layer_outputs = torch.utils.checkpoint.checkpoint(
346
+ encoder_layer,
347
+ hidden_states)
348
+ else:
349
+ layer_outputs = encoder_layer(
350
+ hidden_states,
351
+ )
352
+ hidden_states = layer_outputs
353
+
354
+ if output_hidden_states:
355
+ encoder_states = encoder_states + (hidden_states,)
356
+
357
+ if not return_dict:
358
+ return tuple(v for v in [hidden_states, encoder_states] if v is not None)
359
+ return BaseModelOutput(
360
+ last_hidden_state=hidden_states, hidden_states=encoder_states
361
+ )
362
+
363
+
364
+ class InternVisionModel(PreTrainedModel):
365
+ main_input_name = 'pixel_values'
366
+ _supports_flash_attn_2 = True
367
+ supports_gradient_checkpointing = True
368
+ config_class = InternVisionConfig
369
+ _no_split_modules = ['InternVisionEncoderLayer']
370
+ # support transformers 4.51.+
371
+ _tp_plan = ''
372
+
373
+ def __init__(self, config: InternVisionConfig):
374
+ super().__init__(config)
375
+ self.config = config
376
+
377
+ self.embeddings = InternVisionEmbeddings(config)
378
+ self.encoder = InternVisionEncoder(config)
379
+
380
+ def resize_pos_embeddings(self, old_size, new_size, patch_size):
381
+ pos_emb = self.embeddings.position_embedding
382
+ _, num_positions, embed_dim = pos_emb.shape
383
+ cls_emb = pos_emb[:, :1, :]
384
+ pos_emb = pos_emb[:, 1:, :].reshape(1, old_size // patch_size, old_size // patch_size, -1).permute(0, 3, 1, 2)
385
+ pos_emb = F.interpolate(pos_emb.float(), size=new_size // patch_size, mode='bicubic', align_corners=False)
386
+ pos_emb = pos_emb.to(cls_emb.dtype).reshape(1, embed_dim, -1).permute(0, 2, 1)
387
+ pos_emb = torch.cat([cls_emb, pos_emb], dim=1)
388
+ self.embeddings.position_embedding = nn.Parameter(pos_emb)
389
+ self.embeddings.image_size = new_size
390
+ logger.info('Resized position embeddings from {} to {}'.format(old_size, new_size))
391
+
392
+ def get_input_embeddings(self):
393
+ return self.embeddings
394
+
395
+ def forward(
396
+ self,
397
+ pixel_values: Optional[torch.FloatTensor] = None,
398
+ output_hidden_states: Optional[bool] = None,
399
+ return_dict: Optional[bool] = None,
400
+ pixel_embeds: Optional[torch.FloatTensor] = None,
401
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
402
+ output_hidden_states = (
403
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
404
+ )
405
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
406
+
407
+ if pixel_values is None and pixel_embeds is None:
408
+ raise ValueError('You have to specify pixel_values or pixel_embeds')
409
+
410
+ if pixel_embeds is not None:
411
+ hidden_states = pixel_embeds
412
+ else:
413
+ if len(pixel_values.shape) == 4:
414
+ hidden_states = self.embeddings(pixel_values)
415
+ else:
416
+ raise ValueError(f'wrong pixel_values size: {pixel_values.shape}')
417
+ encoder_outputs = self.encoder(
418
+ inputs_embeds=hidden_states,
419
+ output_hidden_states=output_hidden_states,
420
+ return_dict=return_dict,
421
+ )
422
+ last_hidden_state = encoder_outputs.last_hidden_state
423
+ pooled_output = last_hidden_state[:, 0, :]
424
+
425
+ if not return_dict:
426
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
427
+
428
+ return BaseModelOutputWithPooling(
429
+ last_hidden_state=last_hidden_state,
430
+ pooler_output=pooled_output,
431
+ hidden_states=encoder_outputs.hidden_states,
432
+ attentions=encoder_outputs.attentions,
433
+ )
modeling_internvl_chat.py ADDED
@@ -0,0 +1,376 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+
7
+ import warnings
8
+ from typing import List, Optional, Tuple, Union
9
+
10
+ import torch.utils.checkpoint
11
+ import transformers
12
+ from torch import nn
13
+ from torch.nn import CrossEntropyLoss
14
+ from transformers import GenerationConfig
15
+ from transformers.modeling_outputs import CausalLMOutputWithPast
16
+ from transformers.modeling_utils import PreTrainedModel
17
+ from transformers.utils import logging
18
+ from transformers import LlamaForCausalLM, Qwen2ForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM
19
+
20
+ from .configuration_internvl_chat import InternVLChatConfig
21
+ from .conversation import get_conv_template
22
+ from .modeling_intern_vit import InternVisionModel, has_flash_attn
23
+
24
+ logger = logging.get_logger(__name__)
25
+
26
+
27
+ def version_cmp(v1, v2, op='eq'):
28
+ import operator
29
+
30
+ from packaging import version
31
+ op_func = getattr(operator, op)
32
+ return op_func(version.parse(v1), version.parse(v2))
33
+
34
+
35
+ class InternVLChatModel(PreTrainedModel):
36
+ config_class = InternVLChatConfig
37
+ main_input_name = 'pixel_values'
38
+ base_model_prefix = 'language_model'
39
+ _supports_flash_attn_2 = True
40
+ supports_gradient_checkpointing = True
41
+ _no_split_modules = [
42
+ "InternVisionModel",
43
+ "Qwen3DecoderLayer",
44
+ ]
45
+
46
+ # support transformers 4.51.+
47
+ _tp_plan = ''
48
+
49
+ def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None, use_flash_attn=True):
50
+ super().__init__(config)
51
+
52
+ assert version_cmp(transformers.__version__, '4.37.0', 'ge')
53
+ image_size = config.force_image_size or config.vision_config.image_size
54
+ patch_size = config.vision_config.patch_size
55
+ self.patch_size = patch_size
56
+ self.select_layer = config.select_layer
57
+ self.template = config.template
58
+ self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
59
+ self.downsample_ratio = config.downsample_ratio
60
+ self.ps_version = config.ps_version
61
+ use_flash_attn = use_flash_attn if has_flash_attn else False
62
+ config.vision_config.use_flash_attn = True if use_flash_attn else False
63
+ config.llm_config._attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'
64
+
65
+ logger.info(f'num_image_token: {self.num_image_token}')
66
+ logger.info(f'ps_version: {self.ps_version}')
67
+ if vision_model is not None:
68
+ self.vision_model = vision_model
69
+ else:
70
+ self.vision_model = InternVisionModel(config.vision_config)
71
+ if language_model is not None:
72
+ self.language_model = language_model
73
+ else:
74
+ architecture: str = config.llm_config.architectures[0]
75
+ if architecture == 'LlamaForCausalLM':
76
+ self.language_model = LlamaForCausalLM(config.llm_config)
77
+ elif architecture == 'Qwen2ForCausalLM':
78
+ self.language_model = Qwen2ForCausalLM(config.llm_config)
79
+ elif architecture == 'Qwen3MoeForCausalLM':
80
+ self.language_model = Qwen3MoeForCausalLM(config.llm_config)
81
+ elif architecture == 'Qwen3ForCausalLM':
82
+ self.language_model = Qwen3ForCausalLM(config.llm_config)
83
+ else:
84
+ raise NotImplementedError(f'{architecture} is not implemented.')
85
+
86
+ vit_hidden_size = config.vision_config.hidden_size
87
+ llm_hidden_size = config.llm_config.hidden_size
88
+
89
+ self.mlp1 = nn.Sequential(
90
+ nn.LayerNorm(vit_hidden_size * int(1 / self.downsample_ratio) ** 2),
91
+ nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio) ** 2, llm_hidden_size),
92
+ nn.GELU(),
93
+ nn.Linear(llm_hidden_size, llm_hidden_size)
94
+ )
95
+
96
+ self.img_context_token_id = None
97
+ self.conv_template = get_conv_template(self.template)
98
+ self.system_message = self.conv_template.system_message
99
+
100
+ def forward(
101
+ self,
102
+ pixel_values: torch.FloatTensor,
103
+ input_ids: torch.LongTensor = None,
104
+ attention_mask: Optional[torch.Tensor] = None,
105
+ position_ids: Optional[torch.LongTensor] = None,
106
+ image_flags: Optional[torch.LongTensor] = None,
107
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
108
+ labels: Optional[torch.LongTensor] = None,
109
+ use_cache: Optional[bool] = None,
110
+ output_attentions: Optional[bool] = None,
111
+ output_hidden_states: Optional[bool] = None,
112
+ return_dict: Optional[bool] = None,
113
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
114
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
115
+
116
+ image_flags = image_flags.squeeze(-1)
117
+ input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()
118
+
119
+ vit_embeds = self.extract_feature(pixel_values)
120
+ vit_embeds = vit_embeds[image_flags == 1]
121
+ vit_batch_size = pixel_values.shape[0]
122
+
123
+ B, N, C = input_embeds.shape
124
+ input_embeds = input_embeds.reshape(B * N, C)
125
+
126
+ # if torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
127
+ # print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')
128
+
129
+ input_ids = input_ids.reshape(B * N)
130
+ selected = (input_ids == self.img_context_token_id)
131
+ try:
132
+ input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
133
+ except Exception as e:
134
+ vit_embeds = vit_embeds.reshape(-1, C)
135
+ print(f'warning: {e}, input_embeds[selected].shape={input_embeds[selected].shape}, '
136
+ f'vit_embeds.shape={vit_embeds.shape}')
137
+ n_token = min(selected.sum(), vit_embeds.size(0))
138
+ input_embeds[selected][:n_token] = input_embeds[selected][:n_token] * 0.0 + vit_embeds[:n_token]
139
+
140
+ input_embeds = input_embeds.reshape(B, N, C)
141
+
142
+ outputs = self.language_model(
143
+ inputs_embeds=input_embeds,
144
+ attention_mask=attention_mask,
145
+ position_ids=position_ids,
146
+ past_key_values=past_key_values,
147
+ use_cache=use_cache,
148
+ output_attentions=output_attentions,
149
+ output_hidden_states=output_hidden_states,
150
+ return_dict=return_dict,
151
+ )
152
+ logits = outputs.logits
153
+
154
+ loss = None
155
+ if labels is not None:
156
+ # Shift so that tokens < n predict n
157
+ shift_logits = logits[..., :-1, :].contiguous()
158
+ shift_labels = labels[..., 1:].contiguous()
159
+ # Flatten the tokens
160
+ loss_fct = CrossEntropyLoss()
161
+ shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
162
+ shift_labels = shift_labels.view(-1)
163
+ # Enable model parallelism
164
+ shift_labels = shift_labels.to(shift_logits.device)
165
+ loss = loss_fct(shift_logits, shift_labels)
166
+
167
+ if not return_dict:
168
+ output = (logits,) + outputs[1:]
169
+ return (loss,) + output if loss is not None else output
170
+
171
+ return CausalLMOutputWithPast(
172
+ loss=loss,
173
+ logits=logits,
174
+ past_key_values=outputs.past_key_values,
175
+ hidden_states=outputs.hidden_states,
176
+ attentions=outputs.attentions,
177
+ )
178
+
179
+ def pixel_shuffle(self, x, scale_factor=0.5):
180
+ n, w, h, c = x.size()
181
+ # N, W, H, C --> N, W, H * scale, C // scale
182
+ x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
183
+ # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
184
+ x = x.permute(0, 2, 1, 3).contiguous()
185
+ # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
186
+ x = x.view(n, int(h * scale_factor), int(w * scale_factor),
187
+ int(c / (scale_factor * scale_factor)))
188
+ if self.ps_version == 'v1':
189
+ warnings.warn("In ps_version 'v1', the height and width have not been swapped back, "
190
+ 'which results in a transposed image.')
191
+ else:
192
+ x = x.permute(0, 2, 1, 3).contiguous()
193
+ return x
194
+
195
+ def extract_feature(self, pixel_values):
196
+ if self.select_layer == -1:
197
+ vit_embeds = self.vision_model(
198
+ pixel_values=pixel_values,
199
+ output_hidden_states=False,
200
+ return_dict=True).last_hidden_state
201
+ else:
202
+ vit_embeds = self.vision_model(
203
+ pixel_values=pixel_values,
204
+ output_hidden_states=True,
205
+ return_dict=True).hidden_states[self.select_layer]
206
+ vit_embeds = vit_embeds[:, 1:, :]
207
+
208
+ h = w = int(vit_embeds.shape[1] ** 0.5)
209
+ vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
210
+ vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
211
+ vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
212
+ vit_embeds = self.mlp1(vit_embeds)
213
+ return vit_embeds
214
+
215
+ def batch_chat(self, tokenizer, pixel_values, questions, generation_config, num_patches_list=None,
216
+ history=None, return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
217
+ IMG_CONTEXT_TOKEN='<IMG_CONTEXT>', verbose=False, image_counts=None):
218
+ if history is not None or return_history:
219
+ print('Now multi-turn chat is not supported in batch_chat.')
220
+ raise NotImplementedError
221
+
222
+ if image_counts is not None:
223
+ num_patches_list = image_counts
224
+ print('Warning: `image_counts` is deprecated. Please use `num_patches_list` instead.')
225
+
226
+ img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
227
+ self.img_context_token_id = img_context_token_id
228
+
229
+ if verbose and pixel_values is not None:
230
+ image_bs = pixel_values.shape[0]
231
+ print(f'dynamic ViT batch size: {image_bs}')
232
+
233
+ queries = []
234
+ for idx, num_patches in enumerate(num_patches_list):
235
+ question = questions[idx]
236
+ if pixel_values is not None and '<image>' not in question:
237
+ question = '<image>\n' + question
238
+ template = get_conv_template(self.template)
239
+ template.system_message = self.system_message
240
+ template.append_message(template.roles[0], question)
241
+ template.append_message(template.roles[1], None)
242
+ query = template.get_prompt()
243
+
244
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
245
+ query = query.replace('<image>', image_tokens, 1)
246
+ queries.append(query)
247
+
248
+ tokenizer.padding_side = 'left'
249
+ model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
250
+ input_ids = model_inputs['input_ids'].to(self.device)
251
+ attention_mask = model_inputs['attention_mask'].to(self.device)
252
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
253
+ generation_config['eos_token_id'] = eos_token_id
254
+ generation_output = self.generate(
255
+ pixel_values=pixel_values,
256
+ input_ids=input_ids,
257
+ attention_mask=attention_mask,
258
+ **generation_config
259
+ )
260
+ responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
261
+ responses = [response.split(template.sep.strip())[0].strip() for response in responses]
262
+ return responses
263
+
264
+ def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
265
+ num_patches_list=None, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
266
+ verbose=False):
267
+
268
+ if history is None and pixel_values is not None and '<image>' not in question:
269
+ question = '<image>\n' + question
270
+
271
+ if num_patches_list is None:
272
+ num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
273
+ assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
274
+
275
+ img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
276
+ self.img_context_token_id = img_context_token_id
277
+
278
+ template = get_conv_template(self.template)
279
+ template.system_message = self.system_message
280
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
281
+
282
+ history = [] if history is None else history
283
+ for (old_question, old_answer) in history:
284
+ template.append_message(template.roles[0], old_question)
285
+ template.append_message(template.roles[1], old_answer)
286
+ template.append_message(template.roles[0], question)
287
+ template.append_message(template.roles[1], None)
288
+ query = template.get_prompt()
289
+
290
+ if verbose and pixel_values is not None:
291
+ image_bs = pixel_values.shape[0]
292
+ print(f'dynamic ViT batch size: {image_bs}')
293
+
294
+ for num_patches in num_patches_list:
295
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
296
+ query = query.replace('<image>', image_tokens, 1)
297
+
298
+ model_inputs = tokenizer(query, return_tensors='pt')
299
+ input_ids = model_inputs['input_ids'].to(self.device)
300
+ attention_mask = model_inputs['attention_mask'].to(self.device)
301
+ generation_config['eos_token_id'] = eos_token_id
302
+ generation_output = self.generate(
303
+ pixel_values=pixel_values,
304
+ input_ids=input_ids,
305
+ attention_mask=attention_mask,
306
+ **generation_config
307
+ )
308
+ response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
309
+ response = response.split(template.sep.strip())[0].strip()
310
+ history.append((question, response))
311
+ if return_history:
312
+ return response, history
313
+ else:
314
+ query_to_print = query.replace(IMG_CONTEXT_TOKEN, '')
315
+ query_to_print = query_to_print.replace(f'{IMG_START_TOKEN}{IMG_END_TOKEN}', '<image>')
316
+ if verbose:
317
+ print(query_to_print, response)
318
+ return response
319
+
320
+ @torch.no_grad()
321
+ def generate(
322
+ self,
323
+ pixel_values: Optional[torch.FloatTensor] = None,
324
+ input_ids: Optional[torch.FloatTensor] = None,
325
+ attention_mask: Optional[torch.LongTensor] = None,
326
+ visual_features: Optional[torch.FloatTensor] = None,
327
+ generation_config: Optional[GenerationConfig] = None,
328
+ output_hidden_states: Optional[bool] = None,
329
+ **generate_kwargs,
330
+ ) -> torch.LongTensor:
331
+
332
+ assert self.img_context_token_id is not None
333
+ if pixel_values is not None:
334
+ if visual_features is not None:
335
+ vit_embeds = visual_features
336
+ else:
337
+ vit_embeds = self.extract_feature(pixel_values)
338
+ input_embeds = self.language_model.get_input_embeddings()(input_ids)
339
+ B, N, C = input_embeds.shape
340
+ input_embeds = input_embeds.reshape(B * N, C)
341
+
342
+ input_ids = input_ids.reshape(B * N)
343
+ selected = (input_ids == self.img_context_token_id)
344
+ assert selected.sum() != 0
345
+ input_embeds[selected] = vit_embeds.reshape(-1, C).to(input_embeds.device)
346
+
347
+ input_embeds = input_embeds.reshape(B, N, C)
348
+ else:
349
+ input_embeds = self.language_model.get_input_embeddings()(input_ids)
350
+
351
+ outputs = self.language_model.generate(
352
+ inputs_embeds=input_embeds,
353
+ attention_mask=attention_mask,
354
+ generation_config=generation_config,
355
+ output_hidden_states=output_hidden_states,
356
+ use_cache=True,
357
+ **generate_kwargs,
358
+ )
359
+
360
+ return outputs
361
+
362
+ @property
363
+ def lm_head(self):
364
+ return self.language_model.get_output_embeddings()
365
+
366
+ def get_output_embeddings(self):
367
+ return self.language_model.get_output_embeddings()
368
+
369
+ def get_input_embeddings(self):
370
+ return self.language_model.get_input_embeddings()
371
+
372
+ def set_input_embeddings(self, value):
373
+ return self.language_model.set_input_embeddings(value)
374
+
375
+ def set_output_embeddings(self, value):
376
+ return self.language_model.set_output_embeddings(value)
preprocessor_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "crop_to_patches": false,
4
+ "data_format": "channels_first",
5
+ "default_to_square": true,
6
+ "device": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "image_mean": [
13
+ 0.485,
14
+ 0.456,
15
+ 0.406
16
+ ],
17
+ "image_processor_type": "GotOcr2ImageProcessorFast",
18
+ "image_std": [
19
+ 0.229,
20
+ 0.224,
21
+ 0.225
22
+ ],
23
+ "input_data_format": null,
24
+ "max_patches": 12,
25
+ "min_patches": 1,
26
+ "processor_class": "InternVLProcessor",
27
+ "resample": 3,
28
+ "rescale_factor": 0.00392156862745098,
29
+ "return_tensors": null,
30
+ "size": {
31
+ "height": 448,
32
+ "width": 448
33
+ }
34
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<img>",
4
+ "</img>",
5
+ "<IMG_CONTEXT>",
6
+ "<quad>",
7
+ "</quad>",
8
+ "<ref>",
9
+ "</ref>",
10
+ "<box>",
11
+ "</box>"
12
+ ],
13
+ "eos_token": {
14
+ "content": "<|im_end|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "pad_token": {
21
+ "content": "<|endoftext|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ }
27
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,308 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "151643": {
7
+ "content": "<|endoftext|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "151644": {
15
+ "content": "<|im_start|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "151645": {
23
+ "content": "<|im_end|>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "151646": {
31
+ "content": "<|object_ref_start|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "151647": {
39
+ "content": "<|object_ref_end|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "151648": {
47
+ "content": "<|box_start|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "151649": {
55
+ "content": "<|box_end|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "151650": {
63
+ "content": "<|quad_start|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "151651": {
71
+ "content": "<|quad_end|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "151652": {
79
+ "content": "<|vision_start|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "151653": {
87
+ "content": "<|vision_end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "151654": {
95
+ "content": "<|vision_pad|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "151655": {
103
+ "content": "<|image_pad|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "151656": {
111
+ "content": "<|video_pad|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "151657": {
119
+ "content": "<tool_call>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "151658": {
127
+ "content": "</tool_call>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "151659": {
135
+ "content": "<|fim_prefix|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "151660": {
143
+ "content": "<|fim_middle|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "151661": {
151
+ "content": "<|fim_suffix|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "151662": {
159
+ "content": "<|fim_pad|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "151663": {
167
+ "content": "<|repo_name|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "151664": {
175
+ "content": "<|file_sep|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "151665": {
183
+ "content": "<tool_response>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "151666": {
191
+ "content": "</tool_response>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "151667": {
199
+ "content": "<think>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "151668": {
207
+ "content": "</think>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ },
214
+ "151669": {
215
+ "content": "<img>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": true
221
+ },
222
+ "151670": {
223
+ "content": "</img>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": true
229
+ },
230
+ "151671": {
231
+ "content": "<IMG_CONTEXT>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": true
237
+ },
238
+ "151672": {
239
+ "content": "<quad>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": true
245
+ },
246
+ "151673": {
247
+ "content": "</quad>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": true
253
+ },
254
+ "151674": {
255
+ "content": "<ref>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": true
261
+ },
262
+ "151675": {
263
+ "content": "</ref>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": true
269
+ },
270
+ "151676": {
271
+ "content": "<box>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": true
277
+ },
278
+ "151677": {
279
+ "content": "</box>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": true
285
+ }
286
+ },
287
+ "additional_special_tokens": [
288
+ "<img>",
289
+ "</img>",
290
+ "<IMG_CONTEXT>",
291
+ "<quad>",
292
+ "</quad>",
293
+ "<ref>",
294
+ "</ref>",
295
+ "<box>",
296
+ "</box>"
297
+ ],
298
+ "bos_token": null,
299
+ "clean_up_tokenization_spaces": false,
300
+ "eos_token": "<|im_end|>",
301
+ "errors": "replace",
302
+ "extra_special_tokens": {},
303
+ "model_max_length": 8192,
304
+ "pad_token": "<|endoftext|>",
305
+ "split_special_tokens": false,
306
+ "tokenizer_class": "Qwen2Tokenizer",
307
+ "unk_token": null
308
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_valid_kwargs_names": [
3
+ "do_convert_rgb",
4
+ "do_resize",
5
+ "size",
6
+ "size_divisor",
7
+ "default_to_square",
8
+ "resample",
9
+ "do_rescale",
10
+ "rescale_factor",
11
+ "do_normalize",
12
+ "image_mean",
13
+ "image_std",
14
+ "do_pad",
15
+ "do_center_crop",
16
+ "crop_size",
17
+ "data_format",
18
+ "input_data_format",
19
+ "device"
20
+ ],
21
+ "crop_size": null,
22
+ "data_format": "channels_first",
23
+ "default_to_square": true,
24
+ "device": null,
25
+ "do_center_crop": null,
26
+ "do_convert_rgb": true,
27
+ "do_normalize": true,
28
+ "do_pad": null,
29
+ "do_rescale": true,
30
+ "do_resize": true,
31
+ "image_mean": [
32
+ 0.48145466,
33
+ 0.4578275,
34
+ 0.40821073
35
+ ],
36
+ "image_std": [
37
+ 0.26862954,
38
+ 0.26130258,
39
+ 0.27577711
40
+ ],
41
+ "input_data_format": null,
42
+ "model_valid_processing_keys": [
43
+ "do_convert_rgb",
44
+ "do_resize",
45
+ "size",
46
+ "size_divisor",
47
+ "default_to_square",
48
+ "resample",
49
+ "do_rescale",
50
+ "rescale_factor",
51
+ "do_normalize",
52
+ "image_mean",
53
+ "image_std",
54
+ "do_pad",
55
+ "do_center_crop",
56
+ "crop_size",
57
+ "data_format",
58
+ "input_data_format",
59
+ "device"
60
+ ],
61
+ "processor_class": "InternVLProcessor",
62
+ "resample": 3,
63
+ "rescale_factor": 0.00392156862745098,
64
+ "size": {
65
+ "height": 384,
66
+ "width": 384
67
+ },
68
+ "size_divisor": null,
69
+ "video_processor_type": "InternVLVideoProcessor"
70
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff