Prince-1 commited on
Commit
4904233
·
verified ·
1 Parent(s): 53f9194

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -152,3 +152,5 @@ onnx/vpm.blocks.23.mlp.linear_fc2.weight filter=lfs diff=lfs merge=lfs -text
152
  onnx/vpm.blocks.14.mlp.linear_fc1.weight filter=lfs diff=lfs merge=lfs -text
153
  onnx/vpm.blocks.1.attn.proj.weight filter=lfs diff=lfs merge=lfs -text
154
  chandra.rkllm filter=lfs diff=lfs merge=lfs -text
 
 
 
152
  onnx/vpm.blocks.14.mlp.linear_fc1.weight filter=lfs diff=lfs merge=lfs -text
153
  onnx/vpm.blocks.1.attn.proj.weight filter=lfs diff=lfs merge=lfs -text
154
  chandra.rkllm filter=lfs diff=lfs merge=lfs -text
155
+ data/demo.jpg filter=lfs diff=lfs merge=lfs -text
156
+ chandra_quant_w8a8_rk3588.rkllm filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -8,6 +8,9 @@ base_model:
8
  - datalab-to/chandra
9
  ---
10
 
 
 
 
11
  # Chandra
12
 
13
  Chandra is an OCR model that outputs markdown, HTML, and JSON. It is highly accurate at extracting text from images and PDFs, while preserving layout information.
 
8
  - datalab-to/chandra
9
  ---
10
 
11
+ # NOTE
12
+ rkllm required `setuptools`
13
+
14
  # Chandra
15
 
16
  Chandra is an OCR model that outputs markdown, HTML, and JSON. It is highly accurate at extracting text from images and PDFs, while preserving layout information.
chandra_quant_w8a8_rk3588.rkllm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf52d7c5cc7760680af626b9450d93a5df779d92d6ba2e592846a62a22b78224
3
+ size 8863852436
data/datasets.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {"image_path": "data/datasets", "image": "1.jpg", "input": "Question: What is correct Python code to generate the content of the image?\nOptions:\nA. for x in range(6):\n print(x)\nelse:\n print(\"Finally finished!\")\n\nB. thisdict = {\n \"brand\": \"Ford\",\n \"model\": \"Mustang\",\n \"year\": 1964\n}\n\nprint(len(thisdict))\nC. x = 1\ny = 2.8\nz = 1j\n\nprint(type(x))\nprint(type(y))\nprint(type(z))\n\nD. fruits = [\"apple\", \"banana\", \"cherry\"]\nfor x in fruits:\n print(x)\nPlease select the correct answer from the options above. \n", "target":"D"},
3
+ {"image_path": "data/datasets", "image": "2.jpg", "input": "Question: What is correct Python code to generate the content of the image?\nOptions:\nA. class Person:\n def __init__(self, name, age):\n self.name = name\n self.age = age\n\np1 = Person(\"John\", 36)\n\nprint(p1.name)\nprint(p1.age)\nB. fruits = [\"apple\", \"banana\", \"cherry\"]\nfor x in fruits:\n print(x)\nC. x = min(5, 10, 25)\ny = max(5, 10, 25)\n\nprint(x)\nprint(y)\nD. a = 33\nb = 200\nif b > a:\n print(\"b is greater than a\")\nPlease select the correct answer from the options above. \n", "target":"D"},
4
+ {"image_path": "data/datasets", "image": "21.jpg", "input": "Question: Which one is the correct caption of this image?\nOptions:\nA. A man rides a surfboard on a large wave.\nB. a young boy barefoot holding an umbrella touching the horn of a cow\nC. A giraffe standing by a stall in a field.\nD. A stop sign that has been vandalized with graffiti.\nPlease select the correct answer from the options above. \n", "target":"B"},
5
+ {"image_path": "data/datasets", "image": "22.jpg", "input": "Question: Which one is the correct caption of this image?\nOptions:\nA. A narrow kitchen filled with appliances and cooking utensils.\nB. A person with glasses and a tie in a room.\nC. Tray of vegetables with cucumber, carrots, broccoli and celery.\nD. A pretty young woman riding a surfboard on a wave in the ocean.\nPlease select the correct answer from the options above. \n", "target":"A"},
6
+ {"image_path": "data/datasets", "image": "241.jpg", "input": "Hint: The passage below describes an experiment. Read the passage and then follow the instructions below.\n\nMadelyn applied a thin layer of wax to the underside of her snowboard and rode the board straight down a hill. Then, she removed the wax and rode the snowboard straight down the hill again. She repeated the rides four more times, alternating whether she rode with a thin layer of wax on the board or not. Her friend Tucker timed each ride. Madelyn and Tucker calculated the average time it took to slide straight down the hill on the snowboard with wax compared to the average time on the snowboard without wax.\nFigure: snowboarding down a hill.\nQuestion: Identify the question that Madelyn and Tucker's experiment can best answer.\nOptions:\nA. Does Madelyn's snowboard slide down a hill in less time when it has a thin layer of wax or a thick layer of wax?\nB. Does Madelyn's snowboard slide down a hill in less time when it has a layer of wax or when it does not have a layer of wax?\nPlease select the correct answer from the options above. \n", "target":"B"},
7
+ {"image_path": "data/datasets", "image": "252.jpg", "input": "Hint: People can use the engineering-design process to develop solutions to problems. One step in the process is testing if a potential solution meets the requirements of the design.\nThe passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.\n\nLaura and Isabella were making batches of concrete for a construction project. To make the concrete, they mixed together dry cement powder, gravel, and water. Then, they checked if each batch was firm enough using a test called a slump test.\nThey poured some of the fresh concrete into an upside-down metal cone. They left the concrete in the metal cone for 30 seconds. Then, they lifted the cone to see if the concrete stayed in a cone shape or if it collapsed. If the concrete in a batch collapsed, they would know the batch should not be used.\nFigure: preparing a concrete slump test.\nQuestion: Which of the following could Laura and Isabella's test show?\nOptions:\nA. if the concrete from each batch took the same amount of time to dry\nB. if a new batch of concrete was firm enough to use\nPlease select the correct answer from the options above. \n", "target":"B"},
8
+ {"image_path": "data/datasets", "image": "362.jpg", "input": "Hint: Native copper has the following properties:\nsolid\nnot made by living things\nfound in nature\nfixed crystal structure\nmade of the metal copper\nQuestion: Is native copper a mineral?\nOptions:\nA. no\nB. yes\nPlease select the correct answer from the options above. \n", "target":"B"},
9
+ {"image_path": "data/datasets", "image": "364.jpg", "input": "Hint: Plastic has the following properties:\nsolid\nno fixed crystal structure\nnot a pure substance\nmade in a factory\nQuestion: Is plastic a mineral?\nOptions:\nA. yes\nB. no\nPlease select the correct answer from the options above. \n", "target":"B"},
10
+ {"image_path": "data/datasets", "image": "448.jpg", "input": "Hint: Read the text.\nButterflies and moths are easily mistaken for each other, but one distinction between them often appears during their pupal stage. When most butterfly caterpillars reach full size, they attach themselves to a leaf or other object and shed their skin a final time, forming a chrysalis, a hard, shell-like skin, which protects the pupa inside. The chrysalis may be dull and rough or shiny and smooth, usually blending into its surroundings. Most moth caterpillars, by contrast, create a cocoon to protect the pupa, rather than forming a chrysalis. The cocoons usually resemble hard silk pouches, but some moths also incorporate materials like hairs and twigs.\nQuestion: Which term matches the picture?\nOptions:\nA. cocoon\nB. chrysalis\nPlease select the correct answer from the options above. \n", "target":"B"},
11
+ {"image_path": "data/datasets", "image": "477.jpg", "input": "Hint: Read the text.\nHeat transfer can occur in different ways. Two common ways are through conduction and convection. Conduction occurs when molecules from one object collide with molecules from another object. Burning your hand by touching a hot car door on a sunny summer day is an example of conduction.\nConvection is another form of heat transfer. When a liquid or gas is heated, the heated matter rises upward, away from the heat source. Hot bubbles rising in a pot of water boiling on a stove is an example of convection.\nQuestion: Which term matches the picture?\nOptions:\nA. conduction\nB. convection\nPlease select the correct answer from the options above. \n", "target":"B"},
12
+ {"image_path": "data/datasets", "image": "1231.jpg", "input": "Question: Which image is more brightful?\nOptions:\nA. The first image\nB. The second image\nPlease select the correct answer from the options above. \n", "target":"A"},
13
+ {"image_path": "data/datasets", "image": "1232.jpg", "input": "Question: Which image is more brightful?\nOptions:\nA. The first image\nB. The second image\nPlease select the correct answer from the options above. \n", "target":"A"},
14
+ {"image_path": "data/datasets", "image": "1085.jpg", "input": "Question: is this place crowded?\nOptions:\nA. yes\nB. no\nPlease select the correct answer from the options above. \n", "target":"A"},
15
+ {"image_path": "data/datasets", "image": "1086.jpg", "input": "Question: is this place crowded?\nOptions:\nA. yes\nB. no\nPlease select the correct answer from the options above. \n", "target":"A"},
16
+ {"image_path": "data/datasets", "image": "1128.jpg", "input": "Question: In this picture, are the two dolphins the same size?\nOptions:\nA. same\nB. Not the same\nC. Can't judge\nPlease select the correct answer from the options above. \n", "target":"B"},
17
+ {"image_path": "data/datasets", "image": "1129.jpg", "input": "Question: In this picture, are the two butterfly wings the same shape?\nOptions:\nA. same\nB. Not the same\nC. Can't judge\nPlease select the correct answer from the options above. \n", "target":"B"},
18
+ {"image_path": "data/datasets", "image": "1200.jpg", "input": "Question: What will happen next?\nOptions:\nA. the motorcyle is gonna go forward\nB. the motorcyle is gonna crash\nC. the motorcyle is gonna go backward\nD. both A,B, and C\nPlease select the correct answer from the options above. \n", "target":"B"},
19
+ {"image_path": "data/datasets", "image": "1201.jpg", "input": "Question: What will happen next?\nOptions:\nA. this person is gonna stay still\nB. this person is gonna keep walking\nC. this person is gonna fall into the water\nD. both A,B, and C\nPlease select the correct answer from the options above. \n", "target":"C"},
20
+ {"image_path": "data/datasets", "image": "1554.jpg", "input": "Question: The object shown in this figure:\nOptions:\nA. Is a colorless, flammable liquid that is commonly used as a solvent and fuel\nB. Has a boiling point of 64.7°C\nC. Can be toxic if ingested or absorbed through the skin\nD. None of these options are correct.\nPlease select the correct answer from the options above. \n", "target":"C"},
21
+ {"image_path": "data/datasets", "image": "1555.jpg", "input": "Question: The object shown in this figure:\nOptions:\nA. Is a lustrous, white metal that is highly reflective and ductile\nB. Has the highest electrical and thermal conductivity of all metals\nC. Has a boiling point of 2,162°C\nD. All of these options are correct.\nPlease select the correct answer from the options above. \n", "target":"D"}
22
+ ]
data/datasets/1.jpg ADDED
data/datasets/1085.jpg ADDED
data/datasets/1086.jpg ADDED
data/datasets/1128.jpg ADDED
data/datasets/1129.jpg ADDED
data/datasets/1200.jpg ADDED
data/datasets/1201.jpg ADDED
data/datasets/1231.jpg ADDED
data/datasets/1232.jpg ADDED
data/datasets/1554.jpg ADDED
data/datasets/1555.jpg ADDED
data/datasets/2.jpg ADDED
data/datasets/21.jpg ADDED
data/datasets/22.jpg ADDED
data/datasets/241.jpg ADDED
data/datasets/252.jpg ADDED
data/datasets/362.jpg ADDED
data/datasets/364.jpg ADDED
data/datasets/448.jpg ADDED
data/datasets/477.jpg ADDED
data/demo.jpg ADDED

Git LFS Details

  • SHA256: 58c5c9898c5359bcf53797711e3d954c8ef529e141cb012ffc433376933839e7
  • Pointer size: 131 Bytes
  • Size of remote file: 245 kB
export_rkllm.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from rkllm.api import RKLLM
3
+ from datasets import load_dataset
4
+ from transformers import AutoTokenizer
5
+ from tqdm import tqdm
6
+ import torch
7
+ from torch import nn
8
+ import argparse
9
+
10
+ argparse = argparse.ArgumentParser()
11
+ argparse.add_argument('--path', type=str, default='Qwen/Qwen2-VL-2B-Instruct', help='model path', required=False)
12
+ argparse.add_argument('--target-platform', type=str, default='rk3588', help='target platform', required=False)
13
+ argparse.add_argument('--num_npu_core', type=int, default=3, help='npu core num', required=False)
14
+ argparse.add_argument('--quantized_dtype', type=str, default='w8a8', help='quantized dtype', required=False)
15
+ argparse.add_argument('--device', type=str, default='cpu', help='device', required=False)
16
+ argparse.add_argument('--savepath', type=str, default='qwen2_vl_2b_instruct.rkllm', help='save path', required=False)
17
+ args = argparse.parse_args()
18
+
19
+ modelpath = args.path
20
+ target_platform = args.target_platform
21
+ num_npu_core = args.num_npu_core
22
+ quantized_dtype = args.quantized_dtype
23
+
24
+ savepath = os.path.join("./rkllm", os.path.basename(modelpath).lower() + "_" + quantized_dtype + "_" + target_platform + ".rkllm")
25
+ os.makedirs(os.path.dirname(savepath), exist_ok=True)
26
+
27
+ llm = RKLLM()
28
+ # Load model
29
+ # Use 'export CUDA_VISIBLE_DEVICES=2' to specify GPU device
30
+ ret = llm.load_huggingface(model=modelpath, device=args.device)
31
+ if ret != 0:
32
+ print('Load model failed!')
33
+ exit(ret)
34
+
35
+ # Build model
36
+ dataset = 'data/datasets.json'
37
+
38
+ qparams = None
39
+ ret = llm.build(do_quantization=True, optimization_level=1, quantized_dtype=quantized_dtype,
40
+ quantized_algorithm='normal', target_platform=target_platform, num_npu_core=num_npu_core, extra_qparams=qparams, dataset=dataset)
41
+
42
+ if ret != 0:
43
+ print('Build model failed!')
44
+ exit(ret)
45
+
46
+ # # Export rkllm model
47
+ ret = llm.export_rkllm(savepath)
48
+ if ret != 0:
49
+ print('Export model failed!')
50
+ exit(ret)
51
+
52
+
export_vision.py ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ import os
4
+ import math
5
+ import argparse
6
+ import torch.nn.functional as F
7
+ from transformers import AutoModel
8
+
9
+ class minicpm_v_2_6_vision(torch.nn.Module):
10
+ def __init__(self, vlm, batch_size, in_h, in_w):
11
+ super(minicpm_v_2_6_vision, self).__init__()
12
+ self.vpm = vlm.vpm
13
+ self.resampler = vlm.resampler
14
+ patch_size = vlm.config.patch_size
15
+ num_patches_per_side = vlm.vpm.embeddings.num_patches_per_side
16
+ tgt_sizes = torch.Tensor([[(in_h // patch_size), math.ceil(in_w / patch_size)]]).type(torch.int32)
17
+ patch_attention_mask = torch.ones(
18
+ size=(batch_size, in_h // patch_size, in_w // patch_size),
19
+ dtype=torch.bool, device=vlm.device,
20
+ )
21
+ max_im_h, max_im_w = in_h, in_w
22
+ max_nb_patches_h, max_nb_patches_w = max_im_h // patch_size, max_im_w // patch_size
23
+ boundaries = torch.arange(1 / num_patches_per_side, 1.0, 1 / num_patches_per_side)
24
+ position_ids = torch.full(
25
+ size=(batch_size, max_nb_patches_h * max_nb_patches_w),
26
+ fill_value=0,
27
+ )
28
+ for batch_idx, p_attn_mask in enumerate(patch_attention_mask):
29
+ if tgt_sizes is not None:
30
+ nb_patches_h = tgt_sizes[batch_idx][0]
31
+ nb_patches_w = tgt_sizes[batch_idx][1]
32
+ else:
33
+ nb_patches_h = p_attn_mask[:, 0].sum()
34
+ nb_patches_w = p_attn_mask[0].sum()
35
+
36
+ fractional_coords_h = torch.arange(0, 1 - 1e-6, 1 / nb_patches_h)
37
+ fractional_coords_w = torch.arange(0, 1 - 1e-6, 1 / nb_patches_w)
38
+
39
+ bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
40
+ bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
41
+
42
+ pos_ids = (bucket_coords_h[:, None] * num_patches_per_side + bucket_coords_w).flatten()
43
+ position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
44
+
45
+ position_ids = position_ids.to(vlm.device)
46
+ self.position_ids = position_ids
47
+
48
+ patch_len = tgt_sizes[:, 0] * tgt_sizes[:, 1]
49
+ max_patch_len = torch.max(patch_len)
50
+ key_padding_mask = torch.zeros((batch_size, max_patch_len), dtype=torch.bool, device=vlm.device)
51
+ pos_embed = []
52
+ for i in range(batch_size):
53
+ tgt_h, tgt_w = tgt_sizes[i]
54
+ pos_embed.append(self.resampler.pos_embed[:tgt_h, :tgt_w, :].reshape((tgt_h * tgt_w, -1)).to(torch.float32)) # patches * D
55
+ key_padding_mask[i, patch_len[i]:] = True
56
+
57
+ self.pos_embed = torch.nn.utils.rnn.pad_sequence(
58
+ pos_embed, batch_first=True, padding_value=0.0).permute(1, 0, 2) # BLD => L * B * D
59
+
60
+ def forward(self, pixel_values):
61
+ batch_size = pixel_values.size(0)
62
+ # patch embedding
63
+ patch_embeds = self.vpm.embeddings.patch_embedding(pixel_values)
64
+ embeddings = patch_embeds.flatten(2).transpose(1, 2)
65
+ hidden_states = embeddings + self.vpm.embeddings.position_embedding(self.position_ids)
66
+ # encoder
67
+ encoder_outputs = self.vpm.encoder(inputs_embeds=hidden_states)
68
+ last_hidden_state = encoder_outputs[0]
69
+ last_hidden_state = self.vpm.post_layernorm(last_hidden_state)
70
+ # resampler
71
+ x = self.resampler.kv_proj(last_hidden_state) # B * L * D
72
+ x = self.resampler.ln_kv(x).permute(1, 0, 2) # L * B * D
73
+
74
+ q = self.resampler.ln_q(self.resampler.query) # Q * D
75
+
76
+ out = self.resampler.attn(
77
+ self.resampler._repeat(q, batch_size), # Q * B * D
78
+ x + self.pos_embed, # L * B * D + L * B * D
79
+ x)[0]
80
+ # out: Q * B * D
81
+ x = out.permute(1, 0, 2) # B * Q * D
82
+
83
+ x = self.resampler.ln_post(x)
84
+ x = x @ self.resampler.proj
85
+ return x
86
+
87
+ class qwen2_5_vl_3b_vision(torch.nn.Module):
88
+ def __init__(self, vlm, batch_size):
89
+ super(qwen2_5_vl_3b_vision, self).__init__()
90
+ self.merge_size = 2
91
+ self.temporal_patch_size = 2
92
+ self.patch_size = 14
93
+ self.channel = 3
94
+ self.vpm = vlm.visual
95
+ self.batch_size = batch_size
96
+
97
+ def forward(self, pixel_value, grid_thw):
98
+ if self.batch_size == 1:
99
+ patches = pixel_value.repeat(self.temporal_patch_size, 1, 1, 1)
100
+ elif self.batch_size % self.temporal_patch_size == 1:
101
+ repeat_image = pixel_value[-1:, ...].repeat(2, 1, 1, 1)
102
+ patches = torch.cat((pixel_value, repeat_image), dim=0)
103
+ else:
104
+ patches = pixel_value
105
+ grid_t, grid_h, grid_w = grid_thw[0][0], grid_thw[0][1], grid_thw[0][2]
106
+ patches = patches.reshape(grid_t, self.temporal_patch_size, self.channel,
107
+ grid_h//self.merge_size, self.merge_size, self.patch_size, grid_w//self.merge_size, self.merge_size, self.patch_size)
108
+ patches = patches.permute(0, 3, 6, 4, 7, 2, 1, 5, 8)
109
+ flatten_patches = patches.reshape(grid_t * grid_h * grid_w, self.channel * self.temporal_patch_size * self.patch_size * self.patch_size)
110
+
111
+ return self.vpm(flatten_patches, grid_thw)
112
+
113
+ class qwen3_vl_vision(torch.nn.Module):
114
+ def __init__(self, vlm, batch_size):
115
+ super(qwen3_vl_vision, self).__init__()
116
+ self.merge_size = 2
117
+ self.temporal_patch_size = 2
118
+ self.patch_size = 16
119
+ self.channel = 3
120
+ self.vpm = vlm.visual
121
+ self.batch_size = batch_size
122
+
123
+ def forward(self, pixel_value, grid_thw):
124
+ if self.batch_size == 1:
125
+ patches = pixel_value.repeat(self.temporal_patch_size, 1, 1, 1)
126
+ elif self.batch_size % self.temporal_patch_size == 1:
127
+ repeat_image = pixel_value[-1:, ...].repeat(2, 1, 1, 1)
128
+ patches = torch.cat((pixel_value, repeat_image), dim=0)
129
+ else:
130
+ patches = pixel_value
131
+ grid_t, grid_h, grid_w = grid_thw[0][0], grid_thw[0][1], grid_thw[0][2]
132
+ patches = patches.reshape(grid_t, self.temporal_patch_size, self.channel,
133
+ grid_h//self.merge_size, self.merge_size, self.patch_size, grid_w//self.merge_size, self.merge_size, self.patch_size)
134
+ patches = patches.permute(0, 3, 6, 4, 7, 2, 1, 5, 8)
135
+ flatten_patches = patches.reshape(grid_t * grid_h * grid_w, self.channel * self.temporal_patch_size * self.patch_size * self.patch_size)
136
+
137
+ return self.vpm(flatten_patches, grid_thw)
138
+
139
+ class smolvlm_vision(torch.nn.Module):
140
+ def __init__(self, vlm):
141
+ super(smolvlm_vision, self).__init__()
142
+ self.vpm = vlm.model.vision_model
143
+ self.connector = vlm.model.connector
144
+
145
+ def forward(self, pixel_values):
146
+ # Get sequence from the vision encoder
147
+ image_hidden_states = self.vpm(pixel_values).last_hidden_state
148
+ # Modality projection & resampling
149
+ image_hidden_states = self.connector(image_hidden_states)
150
+ print("image_features:", image_hidden_states.shape)
151
+ return image_hidden_states
152
+
153
+ class vila1_5_3b_vision(torch.nn.Module):
154
+ def __init__(self, vlm):
155
+ super(vila1_5_3b_vision, self).__init__()
156
+ self.vlm = vlm
157
+
158
+ def forward(self, pixel_values):
159
+ # Get sequence from the vision encoder
160
+ out = self.vlm.encode_images(pixel_values)
161
+ return out
162
+
163
+
164
+ class deepseekocr_vision(torch.nn.Module):
165
+ def __init__(self, model):
166
+ super(deepseekocr_vision, self).__init__()
167
+ self.sam_model = model.sam_model
168
+ self.vision_model = model.vision_model
169
+ self.view_seperator = model.view_seperator
170
+ self.image_newline = model.image_newline
171
+ self.projector = model.projector
172
+
173
+ def forward(self, pixel_value):
174
+ global_features_1 = self.sam_model(pixel_value)
175
+ global_features_2 = self.vision_model(pixel_value, global_features_1)
176
+ global_features = torch.cat((global_features_2[:, 1:], global_features_1.flatten(2).permute(0, 2, 1)), dim=-1)
177
+ global_features = self.projector(global_features)
178
+ print('=====================')
179
+ print('BASE: ', global_features.shape)
180
+ print('NO PATCHES')
181
+ print('=====================')
182
+ _, hw, n_dim = global_features.shape
183
+ h = w = int(hw ** 0.5)
184
+ global_features = global_features.view(h, w, n_dim)
185
+ global_features = torch.cat(
186
+ [global_features, self.image_newline[None, None, :].expand(h, 1, n_dim)], dim=1
187
+ )
188
+ global_features = global_features.view(-1, n_dim)
189
+ global_local_features = torch.cat([global_features, self.view_seperator[None, :]], dim=0)
190
+ return global_local_features
191
+
192
+ if __name__ == "__main__":
193
+ argparse = argparse.ArgumentParser()
194
+ argparse.add_argument('--path', type=str, default='CKPT/MiniCPM-V-2_6', help='model path', required=False)
195
+ argparse.add_argument('--model_name', type=str, default='minicpm-v-2_6', help='model name', required=False)
196
+ argparse.add_argument('--batch_size', type=int, default=1, help='batch size', required=False)
197
+ argparse.add_argument('--height', type=int, default=448, help='image height', required=False)
198
+ argparse.add_argument('--width', type=int, default=448, help='image width', required=False)
199
+ argparse.add_argument('--device', type=str, default="cpu", help='cpu or cuda', required=False)
200
+
201
+ args = argparse.parse_args()
202
+
203
+ path = args.path
204
+ model_name = args.model_name
205
+ savepath = os.path.join("./onnx", model_name + "_vision.onnx")
206
+ device_type = args.device
207
+ os.makedirs(os.path.dirname(savepath), exist_ok=True)
208
+
209
+ if model_name == 'minicpm-v-2_6':
210
+ model = AutoModel.from_pretrained(
211
+ path, trust_remote_code=True, dtype=torch.float32,
212
+ )
213
+ model = model.to(device=device_type, dtype=torch.float32)
214
+ model.eval()
215
+ model = minicpm_v_2_6_vision(model, args.batch_size, args.height, args.width)
216
+ pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
217
+ out = model(pixel_values)
218
+ print("Output shape:", out.shape)
219
+ torch.onnx.export(model,
220
+ pixel_values,
221
+ savepath,
222
+ input_names=['pixel'],
223
+ opset_version=18)
224
+ elif model_name == 'qwen2_5-vl-3b':
225
+ from transformers import Qwen2_5_VLForConditionalGeneration
226
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
227
+ path,
228
+ dtype=torch.float32, # 注意此处的数据类型,由于 rknn 目前仅支持 float32 ,因此需要指定;若是在加载权重时限制了数据类型,需要自行修改config.json中的 "use_flash_attn" 参数为 false
229
+ low_cpu_mem_usage=True, _attn_implementation="eager",
230
+ trust_remote_code=True).eval().to(device_type)
231
+ pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
232
+ grid_thw = torch.tensor([[args.batch_size // 2 if args.batch_size% 2 == 0 else args.batch_size // 2 + 1, args.height//14, args.width//14]], dtype=torch.int64)
233
+ model.eval()
234
+ model = qwen2_5_vl_3b_vision(model, args.batch_size)
235
+ out = model(pixel_values, grid_thw)
236
+ print("Output shape:", out.shape)
237
+ torch.onnx.export(model,
238
+ (pixel_values, grid_thw),
239
+ savepath,
240
+ input_names=['pixel', 'grid_thw'],
241
+ dynamic_axes={'pixel': {2: 'height', 3: 'width'}},
242
+ opset_version=15)
243
+ elif model_name == 'qwen3-vl':
244
+ from transformers import Qwen3VLForConditionalGeneration
245
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
246
+ path,
247
+ dtype=torch.float32, # 注意此处的数据类型,由于 rknn 目前仅支持 float32 ,因此需要指定;若是在加载权重时限制了数据类型,需要自行修改config.json中的 "use_flash_attn" 参数为 false
248
+ low_cpu_mem_usage=True, _attn_implementation="eager",
249
+ trust_remote_code=True).eval().to(device_type)
250
+
251
+ # Fix resolution and grid
252
+ HEIGHT = 224
253
+ WIDTH = 224
254
+ BATCH = 1
255
+
256
+ pixel_values = torch.randn(
257
+ BATCH, 3, HEIGHT, WIDTH,
258
+ device=model.device,
259
+ dtype=torch.float32
260
+ )
261
+
262
+ grid_thw = torch.tensor(
263
+ [[1, HEIGHT // 16, WIDTH // 16]],
264
+ dtype=torch.int64
265
+ )
266
+
267
+ #pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
268
+ #grid_thw = torch.tensor([[args.batch_size // 2 if args.batch_size% 2 == 0 else args.batch_size // 2 + 1, args.height//16, args.width//16]], dtype=torch.int64)
269
+ model.eval()
270
+ model = qwen3_vl_vision(model, args.batch_size)
271
+ out = model(pixel_values, grid_thw)
272
+ print("Output shape:", out[0].shape)
273
+ torch.onnx.export(model,
274
+ (pixel_values, grid_thw),
275
+ savepath,
276
+ input_names=['pixel', 'grid_thw'],
277
+ #dynamic_axes={'pixel': {2: 'height', 3: 'width'}},
278
+ opset_version=18
279
+ )
280
+ elif model_name == 'smolvlm':
281
+ from transformers import SmolVLMForConditionalGeneration
282
+ model = SmolVLMForConditionalGeneration.from_pretrained(
283
+ path,
284
+ dtype=torch.float32,
285
+ _attn_implementation="eager",
286
+ ).to(device_type)
287
+ pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
288
+ print("pixel_values:", pixel_values.shape)
289
+ model = smolvlm_vision(model)
290
+ model = model.to(torch.float32).eval()
291
+ out = model(pixel_values)
292
+ torch.onnx.export(model,
293
+ pixel_values,
294
+ savepath,
295
+ input_names=['pixel'],
296
+ dynamic_axes={'pixel': {2: 'height', 3: 'width'}},
297
+ opset_version=18)
298
+ elif model_name == 'internvl3-1b':
299
+ model = AutoModel.from_pretrained(
300
+ path,
301
+ torch_dtype=torch.float32,
302
+ low_cpu_mem_usage=True,
303
+ trust_remote_code=True).eval().to(device_type)
304
+ pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
305
+ model.forward = model.extract_feature
306
+ model = model.to(torch.float32).eval()
307
+ torch.onnx.export(model, pixel_values, savepath, input_names=['pixel'])
308
+ elif model_name == 'deepseekocr':
309
+ model = AutoModel.from_pretrained(
310
+ path,
311
+ _attn_implementation='eager',
312
+ torch_dtype=torch.float32,
313
+ low_cpu_mem_usage=True,
314
+ trust_remote_code=True).eval().to(device_type)
315
+ pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
316
+ model = deepseekocr_vision(model.model)
317
+ model = model.to(torch.float32).eval()
318
+ torch.onnx.export(model, pixel_values, savepath, input_names=['pixel'], opset_version=18)
319
+ else:
320
+ raise ValueError(f"Unsupported model name: {model_name}")
321
+ exit(1)
322
+
323
+ print(f"Exported to {savepath}")