Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

.gitattributes +3 -0
NuMarkdown-8B-Thinking.rkllm +3 -0
README.md +266 -0
bar plot.png +0 -0
ex1.png +3 -0
export_vision.py +263 -0
export_vision_rknn.py +58 -0
matrix.png +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+ex1.png filter=lfs diff=lfs merge=lfs -text
+*.rkllm filter=lfs diff=lfs merge=lfs -text

NuMarkdown-8B-Thinking.rkllm ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:558b96520c8ca9037fc7695c26886cffcf7498aacc47f16642be7854aa05ae1a
+size 15272523868

README.md ADDED Viewed

	@@ -0,0 +1,266 @@

+---
+license: mit
+base_model: NuMarkdown-8B-Thinking
+tags:
+- OCR
+- vision-language
+- VLM
+- Reasoning
+- document-to-markdown
+- qwen2.5
+- markdown
+- extraction
+- RAG
+- rkllm
+- onnx
+model_name: NuMarkdown-8B-Thinking
+library_name: transformers
+pipeline_tag: image-to-text
+---
+<p align="center">
+    <a href="https://nuextract.ai/">
+          <img src="numind.svg" width="400" height="400"/>
+    </a>
+</p>
+<p align="center">
+        🖥️ <a href="https://nuextract.ai/">API / Platform</a>&nbsp&nbsp | &nbsp&nbsp🗣️ <a href="https://discord.gg/3tsEtJNCDe">Discord</a>&nbsp&nbsp | &nbsp&nbsp🔗 <a href="https://github.com/numindai/NuMarkdown">GitHub</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/numind/NuMarkdown-8b-Thinking">Demo</a>
+</p>
+---
+# Reasoning comes to OCR  🧠✨📄🤘
+**NuMarkdown-8B-Thinking** is the first reasoning OCR VLM. It is specifically trained to convert documents into clean Markdown files, well suited for RAG applications. It generates thinking tokens to figure out the layout of the document before generating the Markdown file.
+It is particularly good at understanding documents with weird layouts and complex tables. The number of thinking tokens can vary from 20% to 500% of the final answer, depending on the task difficulty.
+**NuMarkdown-8B-Thinking** is a fine-tune of **Qwen 2.5-VL-7B** on synthetic Doc &rarr; Reasoning &rarr; Markdown examples, followed by an RL phase (GRPO) with a layout-centric reward.
+Try it out in [the 🤗 space!](https://huggingface.co/spaces/numind/NuMarkdown-8b-Thinking)
+## Results
+**NuMarkdown-8B-Thinking** is outperforming generic non-reasoning models like GPT-4o and specialized OCR models like OCRFlux.
+It is competitive against large reasoning closed-source models like Gemini 2.5.
+### Arena ranking against popular alternatives (using trueskill-2 ranking system, with around 500 model-anonymized votes):
+<p align="center">
+| Rank | Model                                   | μ     | σ    | μ − 3σ |
+| ---- | --------------------------------------- | ----- | ---- | ------ |
+| 🥇 1 | **gemini-flash-reasoning**              | 26.75 | 0.80 | 24.35  |
+| 🥈 2 | **NuMarkdown-reasoning**                | 26.10 | 0.79 | 23.72  |
+| 🥉 3 | **NuMarkdown-reasoning-w/o\_grpo** | 25.32 | 0.80 | 22.93  |
+| 4    | **OCRFlux-3B**                          | 24.63 | 0.80 | 22.22  |
+| 5    | **gpt-4o**                              | 24.48 | 0.80 | 22.08  |
+| 6    | **gemini-flash-w/o\_reasoning**         | 24.11 | 0.79 | 21.74  |
+| 7    | **RolmoOCR**                            | 23.53 | 0.82 | 21.07  |
+</p>
+*We plan to realease a markdown arena, similar to llmArena, for complex document-to-markdown tasks to provide a tool to evaluate different solutions.*
+### Win/Draw/Lose-rate against others models (image-only):
+<p align="center">
+<img src="bar plot.png" width="700"/>
+</p>
+## Training
+1. **SFT**: Single epoch supervised fine-tuning on synthetic reasoning traces generated from public PDFs.
+2. **RL (GRPO)**: RL phase using a layout-centric reward with difficult image examples.
+## Example:
+<p align="center">
+<img src="ex1.png" width="500"/>
+</p>
+```
+<think>
+1. **Analyze the overall structure:** The document consists of two main sections, each containing a header, a set of bullet points, a title, a table, and a footer. The layout is consistent within each section but different between the two.
+2. **Handle the first section header and bullet points:**
+   - The top left corner has "Generalitat de Catalunya Departament d'Educació Institut Gal·lecs". This is a consistent header and should be transcribed as a level 3 heading.
+   - The top right corner has "Curs 2021-22". This is a distinct header and should be transcribed as a level 3 heading.
+   - Below the header, there are four bullet points. These are standard list items. I will use Markdown's unordered list syntax (`- `).
+3. **Process the first section's main content - Table 1:**
+   - **Identify table structure:** This is a complex table with a clear header row: "MATERIAL", "TÍTOL", "AUTOR", "EDITORIAL", "ISBN".
+   - **Determine columns:** There are 5 columns.
+   - **Handle merged cells:**
+     - "Llengua Catalana" spans two rows.
+     - "Llengua Castellana" spans three rows.
+     - "Anglès" spans two rows.
+   - **Populate table data:** Transcribe all text, including multi-line content within cells (e.g., "Babulinka Books\nEditorialCol·lecció: Pettes\nJoies per a Grans Lectors.\n2013").
+   - **Markdown table creation:** Use standard Markdown table syntax, ensuring correct alignment and handling of multi-line content within cells.
+4. **Handle the first section's footer:**
+   - The footer contains "Codí: 04mp02", "Responsable: Coordinador Qualitat", "Versió: 5", "Full d'Informació a l'alumnat i famílies", "Aquest document pot quedar obsolet una vegada imprès", and "Pàgina 1 de 2". This is a simple text block at the bottom. I will transcribe it as a separate paragraph or list, maintaining the original layout if possible, but a simple paragraph is sufficient for a footer.
+5. **Handle the second section header and bullet points:**
+   - The top left corner has "Generalitat de Catalunya Departament d'Educació Institut Gal·lecs". This is a consistent header and should be transcribed as a level 3 heading.
+   - The top right corner has "Curs 2021-22". This is a distinct header and should be transcribed as a level 3 heading.
+   - Below the header, there are no bullet points, but there is a table.
+6. **Process the second section's main content - Table 2:**
+   - **Identify table structure:** This is a simpler table with a clear header row: "MATERIAL", "TÍTOL", "AUTOR", "EDITORIAL", "ISBN".
+   - **Determine columns:** There are 5 columns.
+   - **Populate table data:** Transcribe all text, including multi-line content within cells (e.g., "Projecte MOTIMATS-ONMAT (llicència digital) Tekman Books\nES GESTIONA A TRAVÉS DE L'AMPA AL SETEMBRE").
+   - **Markdown table creation:** Use standard Markdown table syntax.
+7. **Handle the second section's footer:**
+   - The footer contains "Codí: 04mp02", "Responsable: Coordinador Qualitat", "Versió: 5", "Full d'Informació a l'alumnat i famílies", "Aquest document pot quedar obsolet una vegada imprès", and "Pàgina 2 de 2". This is a simple text block at the bottom. I will transcribe it as a separate paragraph or list, maintaining the original layout if possible, but a simple paragraph is sufficient for a footer.
+8. **Final review:** Check for any missed text, incorrect formatting, or issues with table alignment. Ensure all original content is present and accurately represented in GitHub-Flavored Markdown. The image placeholders are not needed as there are no images in the document.
+</think>
+<answer>
+### Generalitat de Catalunya
+### Departament d'Educació
+### Institut Gal·lecs
+### Curs 2021-22
+- Els llibres de color blau indiquen que es manté respecte al curs anterior.
+- Els llibres de color groc indiquen que es tracta d'un canvi per a aquest curs.
+- Els llibres de color vermell indiquen que no s'han de comprar perquè van a càrrec del centre.
+- Matèries sense llibre: Dossiers, llicències i el Kit de tecnologia s'hauran de pagar i de recollir a través de l'AMPA a partir de setembre.
+# 1 ESO
+| MATERIAL | TÍTOL | AUTOR | EDITORIAL | ISBN |
+|---|---|---|---|---|
+| Llengua Catalana | Punt Volat | | Castellnou (Didacta +) | 9788417803124 |
+| | Duna, Diari d'un estiu. | Muriel Villanueva | Babulinka Books<br>EditorialCol·lecció: Pettes<br>Joies per a Grans Lectors.<br>2013 | 9788494159077 |
+| | El nen que xatejava amb Jack Sparrow. | Francesc Puigpelat | Bromera<br>Col·lecció: L'Elefant. 2015 | 9788490264072 |
+| Llengua Castellana | Proyecto Asterisco | | Castellnou (Didacta +) | 9788417803186 |
+| | Manzanas rojas | Luis Matilla | Ed. Anaya | 978846673989 |
+| | Fàbulas de Esopo | Jerry Pinkney | Vicens Vives | 978843671648 |
+| Anglès | Think Ahead ESO 1. Student's book.<br>Think Ahead ESO 1. Workbook (cat). | | Burlington Books<br>Burlington Books | 9788925300662<br>9789925300686 |
+Codí: 04mp02
+Responsable: Coordinador Qualitat
+Versió: 5
+Full d'Informació a l'alumnat i famílies
+Aquest document pot quedar obsolet una vegada imprès
+Pàgina 1 de 2
+### Generalitat de Catalunya
+### Departament d'Educació
+### Institut Gal·lecs
+### Curs 2021-22
+| MATERIAL | TÍTOL | AUTOR | EDITORIAL | ISBN |
+|---|---|---|---|---|
+| FRANCÈS | Nouvelle Génération A1-A2 | | Santillana | 9788490494745 |
+| CIÈNCIES EXPERIMENTALS | Science Bits<br>ES GESTIONA A TRAVÉS DE L'AMPA AL SETEMBRE | | | 9788412213485 (llicència digital) |
+| MATEMÀTIQUES | Projecte MOTIMATS-ONMAT (llicència digital) Tekman Books<br>ES GESTIONA A TRAVÉS DE L'AMPA AL SETEMBRE | | | |
+| TECNOLOGIA | Tecnologia 1 ESO | TEIDE | | 9788430783175 |
+| VISUAL I PLÀSTICA | SENSE LLIBRE-KIT DE MATERIAL | | | |
+| CIÈNCIES SOCIALS | SENSE LLIBRE-dossier | | | |
+Codí: 04mp02
+Responsable: Coordinador Qualitat
+Versió: 5
+Full d'Informació a l'alumnat i famílies
+Aquest document pot quedar obsolet una vegada imprès
+Pàgina 2 de 2
+</answer>
+```
+## Quick start:
+## vLLM:
+```
+vllm serve numind/NuMarkdown-8B-Thinking --trust_remote_code --limit-mm-per-prompt image=1
+```
+```python
+from openai import OpenAI
+import base64
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+def encode_image(image_path):
+    """
+    Encode the image file to base64 string
+    """
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode('utf-8')
+base64_image = encode_image("image.png")
+data_url = f"data:image/jpeg;base64,{base64_image}"
+chat_response = client.chat.completions.create(
+    model="numind/NuMarkdown-8B-Thinking",
+    temperature=0.7,
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": data_url},
+                    "min_pixels": 100 * 28 * 28,
+                    "max_pixels": 5000 * 28 * 28,
+                },
+            ],
+        },
+    ]
+)
+result = chat_response.choices[0].message.content
+reasoning = result.split("<think>")[1].split("</think>")[0]
+answer  = result.split("<answer>")[1].split("</answer>")[0]
+print(answer)
+```
+## 🤗 Transformers:
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+model_id = "numind/NuMarkdown-8B-reasoning"
+processor = AutoProcessor.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+    min_pixels=100*28*28, max_pixels=5000*28*28
+)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    device_map="auto",
+    trust_remote_code=True,
+)
+img = Image.open("image.png").convert("RGB")
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image"},
+    ],
+}]
+prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+model_input = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
+with torch.no_grad():
+    model_output = model.generate(**model_input, temperature = 0.7, max_new_tokens=5000)
+result = processor.decode(model_output[0])
+reasoning = result.split("<think>")[1].split("</think>")[0]
+answer  = result.split("<answer>")[1].split("</answer>")[0]
+print(answer)
+```

bar plot.png ADDED Viewed

ex1.png ADDED Viewed

Git LFS Details

SHA256: 9ab65794a94eae69f761e65fa4731829251b907b60488bab62b47b5c16cc7000
Pointer size: 131 Bytes
Size of remote file: 163 kB

export_vision.py ADDED Viewed

	@@ -0,0 +1,263 @@

+import torch
+import numpy as np
+import os
+import math
+import argparse
+import torch.nn.functional as F
+from transformers import AutoModel
+class minicpm_v_2_6_vision(torch.nn.Module):
+    def __init__(self, vlm, batch_size, in_h, in_w):
+        super(minicpm_v_2_6_vision, self).__init__()
+        self.vpm = vlm.vpm
+        self.resampler = vlm.resampler
+        patch_size = vlm.config.patch_size
+        num_patches_per_side = vlm.vpm.embeddings.num_patches_per_side
+        tgt_sizes = torch.Tensor([[(in_h // patch_size), math.ceil(in_w / patch_size)]]).type(torch.int32)
+        patch_attention_mask = torch.ones(
+            size=(batch_size, in_h // patch_size, in_w // patch_size),
+            dtype=torch.bool, device=vlm.device,
+        )
+        max_im_h, max_im_w = in_h, in_w
+        max_nb_patches_h, max_nb_patches_w = max_im_h // patch_size, max_im_w // patch_size
+        boundaries = torch.arange(1 / num_patches_per_side, 1.0, 1 / num_patches_per_side)
+        position_ids = torch.full(
+            size=(batch_size, max_nb_patches_h * max_nb_patches_w),
+            fill_value=0,
+        )
+        for batch_idx, p_attn_mask in enumerate(patch_attention_mask):
+            if tgt_sizes is not None:
+                nb_patches_h = tgt_sizes[batch_idx][0]
+                nb_patches_w = tgt_sizes[batch_idx][1]
+            else:
+                nb_patches_h = p_attn_mask[:, 0].sum()
+                nb_patches_w = p_attn_mask[0].sum()
+            fractional_coords_h = torch.arange(0, 1 - 1e-6, 1 / nb_patches_h)
+            fractional_coords_w = torch.arange(0, 1 - 1e-6, 1 / nb_patches_w)
+            bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
+            bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
+            pos_ids = (bucket_coords_h[:, None] * num_patches_per_side + bucket_coords_w).flatten()
+            position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
+            position_ids = position_ids.to(vlm.device)
+        self.position_ids = position_ids
+        patch_len = tgt_sizes[:, 0] * tgt_sizes[:, 1]
+        max_patch_len = torch.max(patch_len)
+        key_padding_mask = torch.zeros((batch_size, max_patch_len), dtype=torch.bool, device=vlm.device)
+        pos_embed = []
+        for i in range(batch_size):
+            tgt_h, tgt_w = tgt_sizes[i]
+            pos_embed.append(self.resampler.pos_embed[:tgt_h, :tgt_w, :].reshape((tgt_h * tgt_w, -1)).to(torch.float32))  # patches * D
+            key_padding_mask[i, patch_len[i]:] = True
+        self.pos_embed = torch.nn.utils.rnn.pad_sequence(
+            pos_embed, batch_first=True, padding_value=0.0).permute(1, 0, 2)  # BLD => L * B * D
+    def forward(self, pixel_values):
+        batch_size = pixel_values.size(0)
+        # patch embedding
+        patch_embeds = self.vpm.embeddings.patch_embedding(pixel_values)
+        embeddings = patch_embeds.flatten(2).transpose(1, 2)
+        hidden_states = embeddings + self.vpm.embeddings.position_embedding(self.position_ids)
+        # encoder
+        encoder_outputs = self.vpm.encoder(inputs_embeds=hidden_states)
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.vpm.post_layernorm(last_hidden_state)
+        # resampler
+        x = self.resampler.kv_proj(last_hidden_state)  # B * L * D
+        x = self.resampler.ln_kv(x).permute(1, 0, 2)  # L * B * D
+        q = self.resampler.ln_q(self.resampler.query)  # Q * D
+        out = self.resampler.attn(
+            self.resampler._repeat(q, batch_size),  # Q * B * D
+            x + self.pos_embed,  # L * B * D +  L * B * D
+            x)[0]
+        #  out: Q * B * D
+        x = out.permute(1, 0, 2)  # B * Q * D
+        x = self.resampler.ln_post(x)
+        x = x @ self.resampler.proj
+        return x
+class qwen2_5_vl_3b_vision(torch.nn.Module):
+    def __init__(self, vlm, batch_size):
+        super(qwen2_5_vl_3b_vision, self).__init__()
+        self.merge_size = 2
+        self.temporal_patch_size = 2
+        self.patch_size = 14
+        self.channel = 3
+        self.vpm = vlm.visual
+        self.batch_size = batch_size
+    def forward(self, pixel_value, grid_thw):
+        if self.batch_size == 1:
+            patches = pixel_value.repeat(self.temporal_patch_size, 1, 1, 1)
+        elif self.batch_size % self.temporal_patch_size == 1:
+            repeat_image = pixel_value[-1:, ...].repeat(2, 1, 1, 1)
+            patches = torch.cat((pixel_value, repeat_image), dim=0)
+        else:
+            patches = pixel_value
+        grid_t, grid_h, grid_w = grid_thw[0][0], grid_thw[0][1], grid_thw[0][2]
+        patches = patches.reshape(grid_t, self.temporal_patch_size, self.channel,
+                                  grid_h//self.merge_size, self.merge_size, self.patch_size, grid_w//self.merge_size, self.merge_size, self.patch_size)
+        patches = patches.permute(0, 3, 6, 4, 7, 2, 1, 5, 8)
+        flatten_patches = patches.reshape(grid_t * grid_h * grid_w, self.channel * self.temporal_patch_size * self.patch_size * self.patch_size)
+        return self.vpm(flatten_patches, grid_thw)
+class smolvlm_vision(torch.nn.Module):
+    def __init__(self, vlm):
+        super(smolvlm_vision, self).__init__()
+        self.vpm = vlm.model.vision_model
+        self.connector = vlm.model.connector
+    def forward(self, pixel_values):
+        # Get sequence from the vision encoder
+        image_hidden_states = self.vpm(pixel_values).last_hidden_state
+        # Modality projection & resampling
+        image_hidden_states = self.connector(image_hidden_states)
+        print("image_features:", image_hidden_states.shape)
+        return image_hidden_states
+class vila1_5_3b_vision(torch.nn.Module):
+    def __init__(self, vlm):
+        super(vila1_5_3b_vision, self).__init__()
+        self.vlm = vlm
+    def forward(self, pixel_values):
+        # Get sequence from the vision encoder
+        out = self.vlm.encode_images(pixel_values)
+        return out
+if __name__ == "__main__":
+    argparse = argparse.ArgumentParser()
+    argparse.add_argument('--path', type=str, default='CKPT/MiniCPM-V-2_6', help='model path', required=False)
+    argparse.add_argument('--model_name', type=str, default='minicpm-v-2_6', help='model name', required=False)
+    argparse.add_argument('--batch_size', type=int, default=1, help='batch size', required=False)
+    argparse.add_argument('--height', type=int, default=448, help='image height', required=False)
+    argparse.add_argument('--width', type=int, default=448, help='image width', required=False)
+    argparse.add_argument('--device', type=str, default="cpu", help='cpu or cuda', required=False)
+    args = argparse.parse_args()
+    path = args.path
+    model_name = args.model_name
+    savepath = os.path.join("./onnx", model_name + "_vision.onnx")
+    device_type = args.device
+    os.makedirs(os.path.dirname(savepath), exist_ok=True)
+    if model_name == 'minicpm-v-2_6':
+        model = AutoModel.from_pretrained(
+            path, trust_remote_code=True, torch_dtype=torch.float32,
+        )
+        model = model.to(device=device_type, dtype=torch.float32)
+        model.eval()
+        model = minicpm_v_2_6_vision(model, args.batch_size, args.height, args.width)
+        pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
+        out = model(pixel_values)
+        print("Output shape:", out.shape)
+        torch.onnx.export(model,
+                    pixel_values,
+                    savepath,
+                    input_names=['pixel'],
+                    opset_version=15)
+    elif model_name == 'qwen2_5-vl-3b':
+        from transformers import Qwen2_5_VLForConditionalGeneration
+        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+            path,
+            low_cpu_mem_usage=True,
+            _attn_implementation="eager",
+            trust_remote_code=True
+        )
+        model = model.to(device=device_type, dtype=torch.float32).eval()
+        model = qwen2_5_vl_3b_vision(model, args.batch_size)
+        def get_window_index_static(self, grid_thw):
+            # grid_thw: [1, T, H, W] (int64, static)
+            device = grid_thw.device
+            T, H, W = grid_thw[0]
+            total = T * H * W
+            # window_index: [total]
+            window_index = torch.arange(total, device=device)
+            # cu_window_seqlens: [0, total]
+            cu_window_seqlens = torch.tensor([0, total], device=device)
+            return window_index, cu_window_seqlens
+        # 🔥 APPLY PATCH HERE
+        model.visual.get_window_index = get_window_index_static.__get__(
+            model.visual, type(model.visual)
+        )
+        print(model.vpm.get_window_index)
+        pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
+        #grid_thw = torch.tensor([[args.batch_size // 2 if args.batch_size% 2 == 0 else args.batch_size // 2 + 1, args.height//14, args.width//14]], dtype=torch.int64)
+        # model.eval()
+        out = model(pixel_values, grid_thw)
+        print("Output shape:", out.shape)
+        # FIXED grid
+        grid_thw = torch.tensor([[2, 32, 32]], dtype=torch.int64)  # example
+        torch.onnx.export(
+            model,
+            (pixel_values, grid_thw),
+            savepath,
+            input_names=["pixel", "grid_thw"],
+            opset_version=18,
+            #dynamic_axes=None,  # 🚨 important
+        )
+        # torch.onnx.export(model,
+        #             (pixel_values, grid_thw),
+        #             savepath,
+        #             input_names=['pixel', 'grid_thw'],
+        #             dynamic_axes={'pixel': {2: 'height', 3: 'width'}},
+        #             opset_version=18)
+    elif model_name == 'smolvlm':
+        from transformers import SmolVLMForConditionalGeneration
+        model = SmolVLMForConditionalGeneration.from_pretrained(
+            path,
+            torch_dtype=torch.float32,
+            _attn_implementation="eager",
+        ).to(device_type)
+        pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
+        print("pixel_values:", pixel_values.shape)
+        model = smolvlm_vision(model)
+        model = model.to(torch.float32).eval()
+        out = model(pixel_values)
+        torch.onnx.export(model,
+                    pixel_values,
+                    savepath,
+                    input_names=['pixel'],
+                    dynamic_axes={'pixel': {2: 'height', 3: 'width'}},
+                    opset_version=15)
+    elif model_name == 'internvl3-1b':
+        model = AutoModel.from_pretrained(
+        path,
+        torch_dtype=torch.float32,
+        low_cpu_mem_usage=True,
+        trust_remote_code=True).eval().to(device_type)
+        pixel_values = torch.randn(args.batch_size, 3, args.height, args.width, device=model.device, dtype=torch.float32)
+        model.forward = model.extract_feature
+        model = model.to(torch.float32).eval()
+        torch.onnx.export(model, pixel_values, savepath)
+    else:
+        raise ValueError(f"Unsupported model name: {model_name}")
+        exit(1)
+    print(f"Exported to {savepath}")

export_vision_rknn.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from rknn.api import RKNN
+import numpy as np
+import os
+import argparse
+argparse = argparse.ArgumentParser()
+argparse.add_argument('--path', type=str, default='./onnx/qwen2_5-vl-3b_vision.onnx', help='model path', required=False)
+argparse.add_argument('--model_name', type=str, default='qwen2_5-vl-3b', help='model name', required=False)
+argparse.add_argument('--target-platform', type=str, default='rk3588', help='target platform', required=False)
+argparse.add_argument('--batch_size', type=int, default=1, help='batch size', required=False)
+argparse.add_argument('--height', type=int, default=448, help='image height', required=False)
+argparse.add_argument('--width', type=int, default=448, help='image width', required=False)
+args = argparse.parse_args()
+model_path = args.path
+target_platform = args.target_platform
+modelname = args.model_name
+if 'qwen2' in model_path.lower():
+    mean_value = [[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255]]
+    std_value = [[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255]]
+elif 'internvl3' in model_path.lower():
+    mean_value = [[0.485 * 255, 0.456 * 255, 0.406 * 255]]
+    std_value = [[0.229 * 255, 0.224 * 255, 0.225 * 255]]
+else:
+    mean_value = [[0.5 * 255, 0.5 * 255, 0.5 * 255]]
+    std_value = [[0.5 * 255, 0.5 * 255, 0.5 * 255]]
+if modelname == 'qwen2_5-vl-3b':
+    inputs = ['pixel', 'grid_thw']
+    input_size_list = [[args.batch_size, 3, args.height, args.width], [1,3]]
+    grid_t = args.batch_size//2 if args.batch_size % 2 == 0 else (args.batch_size + 1)//2
+    input_initial_val = [None, np.array([[grid_t, args.height//14, args.width//14]], dtype=np.int64)]
+    op_target = {"/vpm/patch_embed/proj/Conv_output_0_conv_tp_sw": 'cpu'}
+elif modelname == 'qwen3-vl':
+    inputs = ['pixel', 'grid_thw']
+    input_size_list = [[args.batch_size, 3, args.height, args.width], [1,3]]
+    grid_t = args.batch_size//2 if args.batch_size % 2 == 0 else (args.batch_size + 1)//2
+    input_initial_val = [None, np.array([[grid_t, args.height//16, args.width//16]], dtype=np.int64)]
+    op_target = None
+else:
+    inputs = ['pixel']
+    input_size_list = [[args.batch_size, 3, args.height, args.width]]
+    input_initial_val = None
+    op_target = None
+if modelname == 'deepseekocr':
+    disable_rules=['convert_rs_add_rs_to_rs_gather_elements']
+else:
+    disable_rules=[]
+rknn = RKNN(verbose=False)
+rknn.config(disable_rules=disable_rules, target_platform=target_platform, mean_values=mean_value, std_values=std_value, op_target=op_target)
+rknn.load_onnx(model_path, inputs=inputs, input_size_list=input_size_list, input_initial_val=input_initial_val)
+rknn.build(do_quantization=False, dataset=None)
+os.makedirs("rknn", exist_ok=True)
+rknn.export_rknn("./rknn/" + os.path.splitext(os.path.basename(model_path))[0] + "_{}.rknn".format(target_platform))

matrix.png ADDED Viewed