Upload 11 files

Browse files

Files changed (12) hide show

.gitattributes +4 -0
README.md +333 -3
convert_vision_encoder.py +52 -0
export_vision_onnx.py +203 -0
language_model_w8a8.rkllm +3 -0
librkllmrt.so +3 -0
rkllm-convert.py +141 -0
rkllm_binding.py +873 -0
run_rkllm.py +243 -0
test.jpg +3 -0
vision_encoder.rknn +3 -0
ztu_somemodelruntime_rknnlite2.py +1195 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+language_model_w8a8.rkllm filter=lfs diff=lfs merge=lfs -text
+librkllmrt.so filter=lfs diff=lfs merge=lfs -text
+test.jpg filter=lfs diff=lfs merge=lfs -text
+vision_encoder.rknn filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,333 @@
----
-license: agpl-3.0
----

+---
+base_model:
+- OpenGVLab/InternVL3_5-2B-HF
+tags:
+- rknn
+- rkllm
+- internvl
+---
+# InternVL3_5-2B-RKLLM
+## (English README see below)
+在RK3588上运行强大的InternVL3.5-2B视觉大模型!
+- 推理速度(RK3588): 视觉编码器 2.1s(三核并行) + LLM 填充 1s (265 tokens / 261 tps) + 解码 12.1 tps
+- 内存占用(RK3588, 上下文长度1024): 3.9GB
+## 使用方法
+1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.
+2. 开发板的RKNPU2内核驱动版本必须>=0.9.6才能运行这么大的模型.
+   使用root权限运行以下命令检查驱动版本:
+   ```bash
+   > cat /sys/kernel/debug/rknpu/version
+   RKNPU driver: v0.9.8
+   ```
+   如果版本过低, 请更新驱动. 你可能需要更新内核, 或查找官方文档以获取帮助.
+3. 安装依赖
+```bash
+pip install "numpy<2" opencv-python rknn-toolkit-lite2
+```
+4. 运行
+```bash
+python ./run_rkllm.py ./test.jpg ./vision_encoder.rknn ./language_model_w8a8.rkllm 512 1024 3
+```
+参数说明:
+- `512`: max_new_tokens, 最大生成token数.
+- `1024`: max_context_len, 最大上下文长度.
+- `3`: npu_core_num, 使用的NPU核心数.
+如果实测性能不理想, 可以调整CPU调度器让CPU始终运行在最高频率, 并把推理程序绑定到大核(`taskset -c 4-7 python ...`)
+test.jpg:
+![test.jpg](./test.jpg)
+```
+Initializing ONNX Runtime for vision encoder...
+I rknn-toolkit2 version: 2.3.2
+I target set by user is: rk3588
+Vision encoder loaded successfully.
+ONNX Input: pixel_values, ONNX Output: projected_features
+Initializing RKLLM Runtime...
+I rkllm: rkllm-runtime version: 1.2.2, rknpu driver version: 0.9.8, platform: RK3588
+I rkllm: loading rkllm model from ./language_model_w8a8.rkllm
+I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: W8A8
+I rkllm: Enabled cpus: [4, 5, 6, 7]
+I rkllm: Enabled cpus num: 4
+RKLLM initialized successfully.
+Preprocessing image...
+Running vision encoder...
+视觉编码器推理耗时: 2.0876 秒
+Image encoded successfully.
+**********************可输入以下问题对应序号获取回答/或自定义输入********************
+[0] <image>What is in the image?
+[1] <image>这张图片中有什么？
+*************************************************************************
+user: 0
+<image>What is in the image?
+robot: n_image_tokens:  256
+This image depicts a cozy bedroom with a large window, several pieces of furniture, and various decorative items. The room has a vintage feel due to the wallpaper pattern and the wooden furniture.
+The bed occupies the left side of the image, covered with a blue comforter or quilt. Next to the bed is a dresser with a round mirror above it. On top of the dresser are several small objects, including what appears to be a water bottle and some decorative items like plants.
+In front of the window on the right side of the image, there is a chair with a checkered cushion. Behind this chair, there is a bookshelf filled with books and various other items, such as baskets and possibly some knick-knacks. The bookshelf has multiple levels, each holding an assortment of books and decorative objects.
+The window allows natural light to enter the room, illuminating the space and highlighting the greenery outside. There are also potted plants placed around the room, adding a touch of nature and freshness to the interior decor.
+Overall, this bedroom exudes a sense of comfort and personal style, with elements that suggest it is used regularly by someone who values both aesthetics and functionality in their living space.
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Model init time (ms)  4314.30
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Prefill       1013.32          265       3.82                     261.52
+I rkllm:  Generate      20155.65         244       82.61                    12.11
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Peak Memory Usage (GB)
+I rkllm:  3.45
+I rkllm: --------------------------------------------------------------------------------------
+user: 1
+<image>这张图片中有什么？
+robot: n_image_tokens:  256
+这是一间温馨的卧室，房间内有一扇大窗户、几件家具和各种装饰物品。房间因壁纸图案和木质家具而显得复古。
+床位于图像左侧，覆盖着蓝色被套或毯子。床旁边是一个带有圆形镜子的抽屉柜。在抽屉柜上摆放着一些小物件，���括水瓶和一些装饰品，如植物。
+窗户右侧前方有一把带格子坐垫的椅子。椅子后面是一排书架，上面摆满了书籍和其他物品，如篮子和可能的一些小饰品。书架有多层，每层都放着各种书籍和装饰物。
+窗外可以看到绿树，自然光透过窗户照进房间，照亮了空间，并突出了外面的绿色植物。房间里还摆放了一些盆栽植物，为室内增添了自然的气息和清新感。
+总体而言，这间卧室给人一种舒适和个性的感觉，表明它经常被居住者使用，居住者重视生活空间中的美学和功能性。
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Prefill       1287.65          264       4.88                     205.03
+I rkllm:  Generate      19852.10         204       97.31                    10.28
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Peak Memory Usage (GB)
+I rkllm:  3.45
+I rkllm: --------------------------------------------------------------------------------------
+user: ^C
+Exiting...
+Releasing resources...
+RKLLM instance destroyed.
+```
+## 模型转换
+#### 准备工作
+1. 安装rknn-toolkit2以及rkllm-toolkit:
+```bash
+pip install -U rknn-toolkit2
+```
+rkllm-toolkit需要在这里手动下载: https://github.com/airockchip/rknn-llm/tree/main/rkllm-toolkit
+2. 下载此仓库到本地, 但不需要下载`.rkllm`和`.rknn`结尾的模型文件.
+3. 下载InternVL3.5-2B的huggingface模型仓库到本地. ( https://huggingface.co/OpenGVLab/InternVL3_5-2B-HF )
+#### 转换LLM
+将`rkllm-convert.py`拷贝到InternVL3_5-2B-HF的模型文件夹中，执行:
+```bash
+python rkllm-convert.py
+```
+默认是w8a8量化的，你可以自行打开脚本修改量化方式等。
+#### 转换视觉编码器
+1. 导出ONNX
+将`export_vision_onnx.py`拷贝到InternVL3_5-2B-HF的模型文件夹根目录中，然后**在该根目录**下执行:
+```bash
+python ./export_vision_onnx.py
+```
+视觉编码器会导出到`vision_encoder.onnx`.
+2. 转换rknn
+```bash
+python ./convert_vision_encoder.py
+```
+## 已知问题
+- 由于RKLLM的多模态输入的限制, 在整个对话中只能加载一张图片.
+- 没有实现多轮对话.
+- RKLLM的w8a8量化貌似存在不小的精度损失.
+- 没有实现原模型中的高清图像分块输入与视频输入功能. 原因是我懒得做了，以后可以考虑加上.
+## 参考
+- [OpenGVLab/InternVL3_5-2B-HF](https://huggingface.co/OpenGVLab/InternVL3_5-2B-HF)
+----
+# English README
+Run the powerful InternVL3.5-2B large vision model on RK3588!
+- Inference Speed (RK3588): Vision Encoder 2.1s (3-core parallel) + LLM Prefill 1s (265 tokens / 261 tps) + Decode 12.1 tps
+- Memory Usage (RK3588, context length 1024): 3.9GB
+## How to Use
+1. Clone or download this repository locally. The model is large, so ensure you have enough disk space.
+2. The RKNPU2 kernel driver version on your development board must be >=0.9.6 to run this model. Run the following command with root privileges to check the driver version:
+   ```bash
+   > cat /sys/kernel/debug/rknpu/version
+   RKNPU driver: v0.9.8
+   ```
+   If the version is too low, please update the driver. You may need to update the kernel or refer to the official documentation for help.
+3. Install dependencies:
+```bash
+pip install "numpy<2" opencv-python rknn-toolkit-lite2
+```
+4. Run:
+```bash
+python ./run_rkllm.py ./test.jpg ./vision_encoder.rknn ./language_model_w8a8.rkllm 512 1024 3
+```
+Parameter description:
+- `512`: `max_new_tokens`, the maximum number of tokens to generate.
+- `1024`: `max_context_len`, the maximum context length.
+- `3`: `npu_core_num`, the number of NPU cores to use.
+If the performance is not ideal, you can adjust the CPU scheduler to keep the CPU at its highest frequency and bind the inference program to the big cores (`taskset -c 4-7 python ...`).
+Example with `test.jpg`:
+![test.jpg](./test.jpg)
+```
+Initializing ONNX Runtime for vision encoder...
+I rknn-toolkit2 version: 2.3.2
+I target set by user is: rk3588
+Vision encoder loaded successfully.
+ONNX Input: pixel_values, ONNX Output: projected_features
+Initializing RKLLM Runtime...
+I rkllm: rkllm-runtime version: 1.2.2, rknpu driver version: 0.9.8, platform: RK3588
+I rkllm: loading rkllm model from ./language_model_w8a8.rkllm
+I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: W8A8
+I rkllm: Enabled cpus: [4, 5, 6, 7]
+I rkllm: Enabled cpus num: 4
+RKLLM initialized successfully.
+Preprocessing image...
+Running vision encoder...
+视觉编码器推理耗时: 2.0876 秒
+Image encoded successfully.
+**********************可输入以下问题对应序号获取回答/或自定义输入********************
+[0] <image>What is in the image?
+[1] <image>这张图片中有什么？
+*************************************************************************
+user: 0
+<image>What is in the image?
+robot: n_image_tokens:  256
+This image depicts a cozy bedroom with a large window, several pieces of furniture, and various decorative items. The room has a vintage feel due to the wallpaper pattern and the wooden furniture.
+The bed occupies the left side of the image, covered with a blue comforter or quilt. Next to the bed is a dresser with a round mirror above it. On top of the dresser are several small objects, including what appears to be a water bottle and some decorative items like plants.
+In front of the window on the right side of the image, there is a chair with a checkered cushion. Behind this chair, there is a bookshelf filled with books and various other items, such as baskets and possibly some knick-knacks. The bookshelf has multiple levels, each holding an assortment of books and decorative objects.
+The window allows natural light to enter the room, illuminating the space and highlighting the greenery outside. There are also potted plants placed around the room, adding a touch of nature and freshness to the interior decor.
+Overall, this bedroom exudes a sense of comfort and personal style, with elements that suggest it is used regularly by someone who values both aesthetics and functionality in their living space.
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Model init time (ms)  4314.30
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Prefill       1013.32          265       3.82                     261.52
+I rkllm:  Generate      20155.65         244       82.61                    12.11
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Peak Memory Usage (GB)
+I rkllm:  3.45
+I rkllm: --------------------------------------------------------------------------------------
+user: ^C
+Exiting...
+Releasing resources...
+RKLLM instance destroyed.
+```
+## Model Conversion
+#### Prerequisites
+1. Install `rknn-toolkit2` and `rkllm-toolkit`:
+```bash
+pip install -U rknn-toolkit2
+```
+`rkllm-toolkit` needs to be downloaded manually from here: https://github.com/airockchip/rknn-llm/tree/main/rkllm-toolkit
+2. Download this repository locally, but you don't need the `.rkllm` and `.rknn` model files.
+3. Download the InternVL3.5-2B huggingface model repository locally. ( https://huggingface.co/OpenGVLab/InternVL3_5-2B-HF )
+#### Convert LLM
+Copy `rkllm-convert.py` to the InternVL3_5-2B-HF model folder and run:
+```bash
+python rkllm-convert.py
+```
+The default quantization is w8a8. You can modify the script to change quantization methods.
+#### Convert Vision Encoder
+1. Export ONNX
+Copy `export_vision_onnx.py` to the root directory of the InternVL3_5-2B-HF model folder, and then execute it **in that root directory**:
+```bash
+python ./export_vision_onnx.py
+```
+The vision encoder will be exported to `vision_encoder.onnx`.
+2. Convert to RKNN
+```bash
+python ./convert_vision_encoder.py
+```
+## Known Issues
+- Due to limitations in RKLLM's multimodal input, only one image can be loaded throughout the conversation.
+- Multi-turn conversation is not implemented.
+- RKLLM's w8a8 quantization appears to have significant precision loss.
+- The high-resolution image tiling and video input features from the original model are not implemented. The reason is that I'm too lazy to do it, and it can be considered adding it later.
+## References
+- [OpenGVLab/InternVL3_5-2B-HF](https://huggingface.co/OpenGVLab/InternVL3_5-2B-HF)

convert_vision_encoder.py ADDED Viewed

	@@ -0,0 +1,52 @@

+#!/usr/bin/env python3
+# ztu_somemodelruntime_rknn2: vision_encoder
+from rknn.api import RKNN
+import os
+import numpy as np
+def main():
+    # 创建RKNN实例
+    rknn = RKNN(verbose=True)
+    # ONNX模型路径
+    ONNX_MODEL = "vision_encoder.onnx"
+    # 输出RKNN模型路径
+    RKNN_MODEL = "vision_encoder.rknn"
+    # 配置参数
+    print("--> Config model")
+    ret = rknn.config(target_platform="rk3588",
+                      dynamic_input=None)
+    if ret != 0:
+        print('Config model failed!')
+        exit(ret)
+    # 加载ONNX模型
+    print("--> Loading model")
+    ret = rknn.load_onnx(model=ONNX_MODEL,
+                         inputs=['pixel_values'],
+                         input_size_list=[[1, 3, 448, 448]])
+    if ret != 0:
+        print('Load model failed!')
+        exit(ret)
+    # 构建模型
+    print("--> Building model")
+    ret = rknn.build(do_quantization=False)
+    if ret != 0:
+        print('Build model failed!')
+        exit(ret)
+    # 导出RKNN模型
+    print("--> Export RKNN model")
+    ret = rknn.export_rknn(RKNN_MODEL)
+    if ret != 0:
+        print('Export RKNN model failed!')
+        exit(ret)
+    print(f'Done! The converted RKNN model has been saved to: ' + RKNN_MODEL)
+    rknn.release()
+if __name__ == '__main__':
+    main()

export_vision_onnx.py ADDED Viewed

	@@ -0,0 +1,203 @@

+import numpy as np
+import os
+import torch
+import torch.nn as nn
+from transformers import AutoTokenizer, AutoModel
+import torch.nn.functional as F
+from PIL import Image
+import torchvision.transforms as T
+from torchvision.transforms import InterpolationMode
+from transformers.modeling_utils import PreTrainedModel
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+# 加载本地模型
+path = '.'
+save_path = 'vision_encoder.onnx'
+image_file = 'test.jpg'
+def export_vision_InternVL(model_path: str, save_path: str):
+    """
+    Export the vision encoder and projector of Janus-Pro-1B model to ONNX format
+    """
+    # 设置默认数据类型为 float32
+    torch.set_default_dtype(torch.float32)
+    vl_gpt = AutoModel.from_pretrained(model_path,torch_dtype = torch.float32,trust_remote_code=True)
+    # Move model to CPU and convert to float32
+    vl_gpt = vl_gpt.cpu().eval().float()  # 确保模型是 float32
+    # Create a wrapper class for vision encoder + projector
+    class VisionWrapper(nn.Module):
+        def __init__(self, model: PreTrainedModel):
+            super().__init__()
+            self.vision_model = model
+        def forward(self, pixel_values: torch.FloatTensor) -> torch.FloatTensor:
+            # Delegate to the built-in helper so we stay consistent with Transformers' implementation.
+            return self.vision_model.get_image_features(pixel_values=pixel_values)
+    # Create wrapper instance and convert to float32
+    vision_wrapper = VisionWrapper(vl_gpt)
+    vision_wrapper.eval().float()  # 确保包装器也是 float32
+    # Create dummy input with float32
+    batch_size = 1
+    num_channels = 3
+    height = 448  # InternVL2 default image size
+    width = 448
+    # dummy_input = load_image(image_file=image_file, max_num=12).to(torch.float32).cpu()
+    dummy_input = torch.randn(batch_size, num_channels, height, width, dtype=torch.float32)
+    # Export to ONNX with higher opset version
+    torch.onnx.export(
+        vision_wrapper,
+        dummy_input,
+        save_path,
+        export_params=True,
+        opset_version=17,  # 使用高版本 opset 以支持 scaled_dot_product_attention
+        do_constant_folding=True,
+        input_names=['pixel_values'],
+        output_names=['projected_features'],
+        dynamic_axes={
+            'pixel_values': {0: 'batch_size'},
+            'projected_features': {0: 'batch_size'}
+        },
+        # 添加额外的配置
+        # operator_export_type=torch.onnx.OperatorExportTypes.ONNX,
+        # training=torch.onnx.TrainingMode.EVAL,
+        dynamo=True,
+        verbose=False
+    )
+    print(f"Successfully exported vision components to {save_path}")
+    # Verify the exported model
+    import onnxruntime
+    # Create inference session
+    ort_session = onnxruntime.InferenceSession(save_path)
+    # Run inference with dummy input
+    ort_inputs = {
+        'pixel_values': dummy_input.numpy()
+    }
+    ort_outputs = ort_session.run(None, ort_inputs)
+    # Compare with PyTorch output
+    torch_output = vision_wrapper(dummy_input)
+    # Check numerical accuracy with更宽松的容忍度
+    import numpy as np
+    np.testing.assert_allclose(
+        torch_output.detach().numpy(),
+        ort_outputs[0],
+        rtol=1e-1,  # 放宽相对误差容忍度
+        atol=1e-2   # 放宽绝对误差容忍度
+    )
+    print("ONNX model verification successful!")
+    # 打印一些统计信息
+    torch_output_np = torch_output.detach().numpy()
+    onnx_output_np = ort_outputs[0]
+    abs_diff = np.abs(torch_output_np - onnx_output_np)
+    rel_diff = np.abs((torch_output_np - onnx_output_np) / (torch_output_np + 1e-7))
+    print(f"\nValidation Statistics:")
+    print(f"Max absolute difference: {np.max(abs_diff):.6f}")
+    print(f"Mean absolute difference: {np.mean(abs_diff):.6f}")
+    print(f"Max relative difference: {np.max(rel_diff):.6f}")
+    print(f"Mean relative difference: {np.mean(rel_diff):.6f}")
+if __name__ == "__main__":
+    try:
+        import onnx
+        try:
+            onnx_version = onnx.__version__
+        except AttributeError:
+            try:
+                onnx_version = onnx.version.version
+            except AttributeError:
+                onnx_version = "Unknown"
+        print(f"ONNX version: {onnx_version}")
+    except ImportError:
+        print("ONNX not installed")
+    import onnxruntime
+    print(f"ONNX Runtime version: {onnxruntime.__version__}")
+    export_vision_InternVL(path, save_path)

language_model_w8a8.rkllm ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8ea6dbafda740b717233228a91cbb2377d0905a8b00edba9e17489da65e9834e
+size 2375017292

librkllmrt.so ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39d01912e67027de32c527be04684bf813e2a49c2d09ab8f6bcf47b34a43789d
+size 7486400

rkllm-convert.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import torch
+import json
+import os
+from transformers import AutoConfig, Qwen3ForCausalLM, AutoTokenizer
+from rkllm.api import RKLLM
+import argparse
+import shutil
+from pathlib import Path
+from typing import Dict
+import torch
+from safetensors.torch import load_file
+from transformers import AutoConfig, AutoModelForCausalLM
+TOKENIZER_FILES = [
+    "tokenizer.json",
+    "tokenizer_config.json",
+    "special_tokens_map.json",
+    "added_tokens.json",
+    "vocab.json",
+    "merges.txt",
+    "chat_template.jinja",
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--source",
+        type=Path,
+        default=".",
+        help="Path to the InternVL (HF-format) checkpoint directory, e.g. /path/to/InternVL3_5-2B-HF",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default="llm/",
+        help="Directory where the extracted Qwen3 checkpoint will be written",
+    )
+    parser.add_argument(
+        "--safe-serialization",
+        action="store_true",
+        default=True,
+        help="Save the exported model using safetensors instead of PyTorch binaries.",
+    )
+    return parser.parse_args()
+def extract_text_state_dict(full_state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+    prefix = "language_model.model."
+    lm_head_prefix = "language_model.lm_head."
+    text_state: Dict[str, torch.Tensor] = {}
+    for key, tensor in full_state.items():
+        if key.startswith(prefix):
+            text_key = "model." + key[len(prefix) :]
+        elif key.startswith(lm_head_prefix):
+            text_key = "lm_head." + key[len(lm_head_prefix) :]
+        else:
+            continue
+        text_state[text_key] = tensor
+    if not text_state:
+        raise ValueError("Did not find any language_model weights in checkpoint; is this an InternVL model?")
+    return text_state
+def copy_tokenizer_files(source_dir: Path, output_dir: Path) -> None:
+    for filename in TOKENIZER_FILES:
+        src = source_dir / filename
+        if src.exists():
+            dst = output_dir / filename
+            shutil.copyfile(src, dst)
+def main() -> None:
+    args = parse_args()
+    source_dir = args.source.expanduser().resolve()
+    output_dir = args.output.expanduser().resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    config = AutoConfig.from_pretrained(source_dir, trust_remote_code=True)
+    text_config = config.text_config
+    weights_path = source_dir / "model.safetensors"
+    if not weights_path.exists():
+        raise FileNotFoundError(f"Could not find {weights_path}; expected a safetensors checkpoint")
+    all_weights = load_file(weights_path)
+    text_state = extract_text_state_dict(all_weights)
+    sample_tensor = next(iter(text_state.values()))
+    target_dtype = sample_tensor.dtype
+    text_model = AutoModelForCausalLM.from_config(text_config)
+    text_model = text_model.to(dtype=target_dtype, device=torch.device("cpu"))
+    missing, unexpected = text_model.load_state_dict(text_state, strict=False)
+    if missing or unexpected:
+        raise RuntimeError(
+            "State dict mismatch when loading text weights: "
+            f"missing={missing}, unexpected={unexpected}"
+        )
+    text_config.save_pretrained(output_dir)
+    text_model.generation_config.save_pretrained(output_dir)
+    text_model.save_pretrained(output_dir, safe_serialization=args.safe_serialization)
+    copy_tokenizer_files(source_dir, output_dir)
+    print(f"Exported Qwen3 model saved to {output_dir}")
+    modelpath = output_dir
+    llm = RKLLM()
+    ret = llm.load_huggingface(model=modelpath, model_lora=None, device='cpu')
+    if ret != 0:
+        print('Load model failed!')
+        exit(ret)
+    qparams = None
+    ret = llm.build(do_quantization=True, optimization_level=1, quantized_dtype='w8a8',
+                    quantized_algorithm='normal', target_platform='rk3588', num_npu_core=3, extra_qparams=qparams)
+    if ret != 0:
+        print('Build model failed!')
+        exit(ret)
+    # Export rkllm model
+    ret = llm.export_rkllm("./language_model_w8a8.rkllm")
+    if ret != 0:
+        print('Export model failed!')
+        exit(ret)
+if __name__ == "__main__":
+    main()

rkllm_binding.py ADDED Viewed

	@@ -0,0 +1,873 @@

+import ctypes
+import enum
+import os
+# Define constants from the header
+CPU0 = (1 << 0)  # 0x01
+CPU1 = (1 << 1)  # 0x02
+CPU2 = (1 << 2)  # 0x04
+CPU3 = (1 << 3)  # 0x08
+CPU4 = (1 << 4)  # 0x10
+CPU5 = (1 << 5)  # 0x20
+CPU6 = (1 << 6)  # 0x40
+CPU7 = (1 << 7)  # 0x80
+# --- Enums ---
+class LLMCallState(enum.IntEnum):
+    RKLLM_RUN_NORMAL = 0
+    RKLLM_RUN_WAITING = 1
+    RKLLM_RUN_FINISH = 2
+    RKLLM_RUN_ERROR = 3
+class RKLLMInputType(enum.IntEnum):
+    RKLLM_INPUT_PROMPT = 0
+    RKLLM_INPUT_TOKEN = 1
+    RKLLM_INPUT_EMBED = 2
+    RKLLM_INPUT_MULTIMODAL = 3
+class RKLLMInferMode(enum.IntEnum):
+    RKLLM_INFER_GENERATE = 0
+    RKLLM_INFER_GET_LAST_HIDDEN_LAYER = 1
+    RKLLM_INFER_GET_LOGITS = 2
+# --- Structures ---
+class RKLLMExtendParam(ctypes.Structure):
+    base_domain_id: ctypes.c_int32
+    embed_flash: ctypes.c_int8
+    enabled_cpus_num: ctypes.c_int8
+    enabled_cpus_mask: ctypes.c_uint32
+    n_batch: ctypes.c_uint8
+    use_cross_attn: ctypes.c_int8
+    reserved: ctypes.c_uint8 * 104
+    _fields_ = [
+        ("base_domain_id", ctypes.c_int32),     # 基础域ID
+        ("embed_flash", ctypes.c_int8),         # 是否从闪存查询词嵌入向量（1启用，0禁用）
+        ("enabled_cpus_num", ctypes.c_int8),    # 推理启用的CPU数量
+        ("enabled_cpus_mask", ctypes.c_uint32), # 指示启用哪些CPU的位掩码
+        ("n_batch", ctypes.c_uint8),            # 一次前向传播中并发处理的输入样本数，设置>1启用批量推理，默认为1
+        ("use_cross_attn", ctypes.c_int8),      # 是否启用交叉注意力（非零启用，0禁用）
+        ("reserved", ctypes.c_uint8 * 104)     # 保留字段
+    ]
+class RKLLMParam(ctypes.Structure):
+    model_path: ctypes.c_char_p
+    max_context_len: ctypes.c_int32
+    max_new_tokens: ctypes.c_int32
+    top_k: ctypes.c_int32
+    n_keep: ctypes.c_int32
+    top_p: ctypes.c_float
+    temperature: ctypes.c_float
+    repeat_penalty: ctypes.c_float
+    frequency_penalty: ctypes.c_float
+    presence_penalty: ctypes.c_float
+    mirostat: ctypes.c_int32
+    mirostat_tau: ctypes.c_float
+    mirostat_eta: ctypes.c_float
+    skip_special_token: ctypes.c_bool
+    is_async: ctypes.c_bool
+    img_start: ctypes.c_char_p
+    img_end: ctypes.c_char_p
+    img_content: ctypes.c_char_p
+    extend_param: RKLLMExtendParam
+    _fields_ = [
+        ("model_path", ctypes.c_char_p),         # 模型文件路径
+        ("max_context_len", ctypes.c_int32),     # 上下文窗口最大token数
+        ("max_new_tokens", ctypes.c_int32),      # 最大生成新token数
+        ("top_k", ctypes.c_int32),               # Top-K采样参数
+        ("n_keep", ctypes.c_int32),              # 上下文窗口移动时保留的kv缓存数量
+        ("top_p", ctypes.c_float),               # Top-P（nucleus）采样参数
+        ("temperature", ctypes.c_float),         # 采样温度，影响token选择的随机性
+        ("repeat_penalty", ctypes.c_float),      # 重复token惩罚
+        ("frequency_penalty", ctypes.c_float),   # 频繁token惩罚
+        ("presence_penalty", ctypes.c_float),    # 输入中已存在token的惩罚
+        ("mirostat", ctypes.c_int32),            # Mirostat采样策略标志（0表示禁用）
+        ("mirostat_tau", ctypes.c_float),        # Mirostat采样Tau参数
+        ("mirostat_eta", ctypes.c_float),        # Mirostat采样Eta参数
+        ("skip_special_token", ctypes.c_bool),   # 是否跳过特殊token
+        ("is_async", ctypes.c_bool),             # 是否异步推理
+        ("img_start", ctypes.c_char_p),          # 多模态输入中图像的起始位置
+        ("img_end", ctypes.c_char_p),            # 多模态输入中图像的结束位置
+        ("img_content", ctypes.c_char_p),        # 图像内容指针
+        ("extend_param", RKLLMExtendParam)       # 扩展参数
+    ]
+class RKLLMLoraAdapter(ctypes.Structure):
+    lora_adapter_path: ctypes.c_char_p
+    lora_adapter_name: ctypes.c_char_p
+    scale: ctypes.c_float
+    _fields_ = [
+        ("lora_adapter_path", ctypes.c_char_p),
+        ("lora_adapter_name", ctypes.c_char_p),
+        ("scale", ctypes.c_float)
+    ]
+class RKLLMEmbedInput(ctypes.Structure):
+    embed: ctypes.POINTER(ctypes.c_float)
+    n_tokens: ctypes.c_size_t
+    _fields_ = [
+        ("embed", ctypes.POINTER(ctypes.c_float)),
+        ("n_tokens", ctypes.c_size_t)
+    ]
+class RKLLMTokenInput(ctypes.Structure):
+    input_ids: ctypes.POINTER(ctypes.c_int32)
+    n_tokens: ctypes.c_size_t
+    _fields_ = [
+        ("input_ids", ctypes.POINTER(ctypes.c_int32)),
+        ("n_tokens", ctypes.c_size_t)
+    ]
+class RKLLMMultiModelInput(ctypes.Structure):
+    prompt: ctypes.c_char_p
+    image_embed: ctypes.POINTER(ctypes.c_float)
+    n_image_tokens: ctypes.c_size_t
+    n_image: ctypes.c_size_t
+    image_width: ctypes.c_size_t
+    image_height: ctypes.c_size_t
+    _fields_ = [
+        ("prompt", ctypes.c_char_p),
+        ("image_embed", ctypes.POINTER(ctypes.c_float)),
+        ("n_image_tokens", ctypes.c_size_t),
+        ("n_image", ctypes.c_size_t),
+        ("image_width", ctypes.c_size_t),
+        ("image_height", ctypes.c_size_t)
+    ]
+class RKLLMCrossAttnParam(ctypes.Structure):
+    """
+    交叉注意力参数结构体
+    该结构体用于在解码器中执行交叉注意力时使用。
+    它提供编码器输出（键/值缓存）、位置索引和注意力掩码。
+    - encoder_k_cache必须存储在连续内存中，布局为：
+      [num_layers][num_tokens][num_kv_heads][head_dim]
+    - encoder_v_cache必须存储在连续内存中，布局为：
+      [num_layers][num_kv_heads][head_dim][num_tokens]
+    """
+    encoder_k_cache: ctypes.POINTER(ctypes.c_float)
+    encoder_v_cache: ctypes.POINTER(ctypes.c_float)
+    encoder_mask: ctypes.POINTER(ctypes.c_float)
+    encoder_pos: ctypes.POINTER(ctypes.c_int32)
+    num_tokens: ctypes.c_int
+    _fields_ = [
+        ("encoder_k_cache", ctypes.POINTER(ctypes.c_float)),  # 编码器键缓存指针（大小：num_layers * num_tokens * num_kv_heads * head_dim）
+        ("encoder_v_cache", ctypes.POINTER(ctypes.c_float)),  # 编码器值缓存指针（大小：num_layers * num_kv_heads * head_dim * num_tokens）
+        ("encoder_mask", ctypes.POINTER(ctypes.c_float)),     # 编码器注意力掩码指针（大小：num_tokens的数组）
+        ("encoder_pos", ctypes.POINTER(ctypes.c_int32)),      # 编码器token位置指针（大小：num_tokens的数组）
+        ("num_tokens", ctypes.c_int)                          # 编码器序列中的token数量
+    ]
+class RKLLMPerfStat(ctypes.Structure):
+    """
+    性能统计结构体
+    用于保存预填充和生成阶段的性能统计信息。
+    """
+    prefill_time_ms: ctypes.c_float
+    prefill_tokens: ctypes.c_int
+    generate_time_ms: ctypes.c_float
+    generate_tokens: ctypes.c_int
+    memory_usage_mb: ctypes.c_float
+    _fields_ = [
+        ("prefill_time_ms", ctypes.c_float),   # 预填充阶段总耗时（毫秒）
+        ("prefill_tokens", ctypes.c_int),      # 预填充阶段处理的token数量
+        ("generate_time_ms", ctypes.c_float),  # 生成阶段总耗时（毫秒）
+        ("generate_tokens", ctypes.c_int),     # 生成阶段处理的token数量
+        ("memory_usage_mb", ctypes.c_float)    # 推理期间VmHWM常驻内存使用量（MB）
+    ]
+class _RKLLMInputUnion(ctypes.Union):
+    prompt_input: ctypes.c_char_p
+    embed_input: RKLLMEmbedInput
+    token_input: RKLLMTokenInput
+    multimodal_input: RKLLMMultiModelInput
+    _fields_ = [
+        ("prompt_input", ctypes.c_char_p),
+        ("embed_input", RKLLMEmbedInput),
+        ("token_input", RKLLMTokenInput),
+        ("multimodal_input", RKLLMMultiModelInput)
+    ]
+class RKLLMInput(ctypes.Structure):
+    """
+    LLM输入结构体
+    通过联合体表示不同类型的LLM输入。
+    """
+    role: ctypes.c_char_p
+    enable_thinking: ctypes.c_bool
+    input_type: ctypes.c_int
+    _union_data: _RKLLMInputUnion
+    _fields_ = [
+        ("role", ctypes.c_char_p),              # 消息角色："user"（用户输入）、"tool"（函数结果）
+        ("enable_thinking", ctypes.c_bool),     # 控制Qwen3模型是否启用"思考模式"
+        ("input_type", ctypes.c_int),           # 枚举类型，指定输入类型（如prompt、token、embed、multimodal）
+        ("_union_data", _RKLLMInputUnion)       # 联合体数据
+    ]
+    # Properties to make accessing union members easier
+    @property
+    def prompt_input(self) -> bytes: # Assuming c_char_p maps to bytes
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_PROMPT:
+            return self._union_data.prompt_input
+        raise AttributeError("Not a prompt input")
+    @prompt_input.setter
+    def prompt_input(self, value: bytes): # Assuming c_char_p maps to bytes
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_PROMPT:
+            self._union_data.prompt_input = value
+        else:
+            raise AttributeError("Not a prompt input")
+    @property
+    def embed_input(self) -> RKLLMEmbedInput:
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_EMBED:
+            return self._union_data.embed_input
+        raise AttributeError("Not an embed input")
+    @embed_input.setter
+    def embed_input(self, value: RKLLMEmbedInput):
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_EMBED:
+            self._union_data.embed_input = value
+        else:
+            raise AttributeError("Not an embed input")
+    @property
+    def token_input(self) -> RKLLMTokenInput:
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_TOKEN:
+            return self._union_data.token_input
+        raise AttributeError("Not a token input")
+    @token_input.setter
+    def token_input(self, value: RKLLMTokenInput):
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_TOKEN:
+            self._union_data.token_input = value
+        else:
+            raise AttributeError("Not a token input")
+    @property
+    def multimodal_input(self) -> RKLLMMultiModelInput:
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_MULTIMODAL:
+            return self._union_data.multimodal_input
+        raise AttributeError("Not a multimodal input")
+    @multimodal_input.setter
+    def multimodal_input(self, value: RKLLMMultiModelInput):
+        if self.input_type == RKLLMInputType.RKLLM_INPUT_MULTIMODAL:
+            self._union_data.multimodal_input = value
+        else:
+            raise AttributeError("Not a multimodal input")
+class RKLLMLoraParam(ctypes.Structure): # For inference
+    lora_adapter_name: ctypes.c_char_p
+    _fields_ = [
+        ("lora_adapter_name", ctypes.c_char_p)
+    ]
+class RKLLMPromptCacheParam(ctypes.Structure): # For inference
+    save_prompt_cache: ctypes.c_int # bool-like
+    prompt_cache_path: ctypes.c_char_p
+    _fields_ = [
+        ("save_prompt_cache", ctypes.c_int), # bool-like
+        ("prompt_cache_path", ctypes.c_char_p)
+    ]
+class RKLLMInferParam(ctypes.Structure):
+    mode: ctypes.c_int
+    lora_params: ctypes.POINTER(RKLLMLoraParam)
+    prompt_cache_params: ctypes.POINTER(RKLLMPromptCacheParam)
+    keep_history: ctypes.c_int # bool-like
+    _fields_ = [
+        ("mode", ctypes.c_int), # Enum will be passed as int, changed RKLLMInferMode to ctypes.c_int
+        ("lora_params", ctypes.POINTER(RKLLMLoraParam)),
+        ("prompt_cache_params", ctypes.POINTER(RKLLMPromptCacheParam)),
+        ("keep_history", ctypes.c_int) # bool-like
+    ]
+class RKLLMResultLastHiddenLayer(ctypes.Structure):
+    hidden_states: ctypes.POINTER(ctypes.c_float)
+    embd_size: ctypes.c_int
+    num_tokens: ctypes.c_int
+    _fields_ = [
+        ("hidden_states", ctypes.POINTER(ctypes.c_float)),
+        ("embd_size", ctypes.c_int),
+        ("num_tokens", ctypes.c_int)
+    ]
+class RKLLMResultLogits(ctypes.Structure):
+    logits: ctypes.POINTER(ctypes.c_float)
+    vocab_size: ctypes.c_int
+    num_tokens: ctypes.c_int
+    _fields_ = [
+        ("logits", ctypes.POINTER(ctypes.c_float)),
+        ("vocab_size", ctypes.c_int),
+        ("num_tokens", ctypes.c_int)
+    ]
+class RKLLMResult(ctypes.Structure):
+    """
+    LLM推理结果结构体
+    表示LLM推理的结果，包含生成的文本、token ID、隐藏层状态、logits和性能统计。
+    """
+    text: ctypes.c_char_p
+    token_id: ctypes.c_int32
+    last_hidden_layer: RKLLMResultLastHiddenLayer
+    logits: RKLLMResultLogits
+    perf: RKLLMPerfStat
+    _fields_ = [
+        ("text", ctypes.c_char_p),                                  # 生成的文本结果
+        ("token_id", ctypes.c_int32),                              # 生成的token ID
+        ("last_hidden_layer", RKLLMResultLastHiddenLayer),         # 最后一层的隐藏状态（如果请求的话）
+        ("logits", RKLLMResultLogits),                             # 模型输出的logits
+        ("perf", RKLLMPerfStat)                                    # 性能统计（预填充和生成）
+    ]
+# --- Typedefs ---
+LLMHandle = ctypes.c_void_p
+# --- Callback Function Type ---
+LLMResultCallback = ctypes.CFUNCTYPE(
+    ctypes.c_int,  # 返回类型：int，表示处理状态
+    ctypes.POINTER(RKLLMResult),  # LLM结果指针
+    ctypes.c_void_p,              # 用户数据指针
+    ctypes.c_int                  # LLM调用状态（LLMCallState枚举值）
+)
+"""
+回调函数类型定义
+用于处理LLM结果的回调函数。
+参数：
+- result: 指向LLM结果的指针
+- userdata: 回调的用户数据指针
+- state: LLM调用状态（例如：完成、错误）
+返回值：
+- 0: 正常继续推理
+- 1: 暂停推理。如果用户想要修改或干预结果（例如编辑输出、注入新提示），
+     返回1以暂停当前推理。稍后，使用更新的内容调用rkllm_run来恢复推理。
+"""
+class RKLLMRuntime:
+    def __init__(self, library_path="./librkllmrt.so"):
+        try:
+            self.lib = ctypes.CDLL(library_path)
+        except OSError as e:
+            raise OSError(f"Failed to load RKLLM library from {library_path}. "
+                          f"Ensure it's in your LD_LIBRARY_PATH or provide the full path. Error: {e}")
+        self._setup_functions()
+        self.llm_handle = LLMHandle()
+        self._c_callback = None # To keep the callback object alive
+    def _setup_functions(self):
+        # RKLLMParam rkllm_createDefaultParam();
+        self.lib.rkllm_createDefaultParam.restype = RKLLMParam
+        self.lib.rkllm_createDefaultParam.argtypes = []
+        # int rkllm_init(LLMHandle* handle, RKLLMParam* param, LLMResultCallback callback);
+        self.lib.rkllm_init.restype = ctypes.c_int
+        self.lib.rkllm_init.argtypes = [
+            ctypes.POINTER(LLMHandle),
+            ctypes.POINTER(RKLLMParam),
+            LLMResultCallback
+        ]
+        # int rkllm_load_lora(LLMHandle handle, RKLLMLoraAdapter* lora_adapter);
+        self.lib.rkllm_load_lora.restype = ctypes.c_int
+        self.lib.rkllm_load_lora.argtypes = [LLMHandle, ctypes.POINTER(RKLLMLoraAdapter)]
+        # int rkllm_load_prompt_cache(LLMHandle handle, const char* prompt_cache_path);
+        self.lib.rkllm_load_prompt_cache.restype = ctypes.c_int
+        self.lib.rkllm_load_prompt_cache.argtypes = [LLMHandle, ctypes.c_char_p]
+        # int rkllm_release_prompt_cache(LLMHandle handle);
+        self.lib.rkllm_release_prompt_cache.restype = ctypes.c_int
+        self.lib.rkllm_release_prompt_cache.argtypes = [LLMHandle]
+        # int rkllm_destroy(LLMHandle handle);
+        self.lib.rkllm_destroy.restype = ctypes.c_int
+        self.lib.rkllm_destroy.argtypes = [LLMHandle]
+        # int rkllm_run(LLMHandle handle, RKLLMInput* rkllm_input, RKLLMInferParam* rkllm_infer_params, void* userdata);
+        self.lib.rkllm_run.restype = ctypes.c_int
+        self.lib.rkllm_run.argtypes = [
+            LLMHandle,
+            ctypes.POINTER(RKLLMInput),
+            ctypes.POINTER(RKLLMInferParam),
+            ctypes.c_void_p # userdata
+        ]
+        # int rkllm_run_async(LLMHandle handle, RKLLMInput* rkllm_input, RKLLMInferParam* rkllm_infer_params, void* userdata);
+        # Assuming async also takes userdata for the callback context
+        self.lib.rkllm_run_async.restype = ctypes.c_int
+        self.lib.rkllm_run_async.argtypes = [
+            LLMHandle,
+            ctypes.POINTER(RKLLMInput),
+            ctypes.POINTER(RKLLMInferParam),
+            ctypes.c_void_p # userdata
+        ]
+        # int rkllm_abort(LLMHandle handle);
+        self.lib.rkllm_abort.restype = ctypes.c_int
+        self.lib.rkllm_abort.argtypes = [LLMHandle]
+        # int rkllm_is_running(LLMHandle handle);
+        self.lib.rkllm_is_running.restype = ctypes.c_int # 0 if running, non-zero otherwise
+        self.lib.rkllm_is_running.argtypes = [LLMHandle]
+        # int rkllm_clear_kv_cache(LLMHandle handle, int keep_system_prompt, int* start_pos, int* end_pos);
+        self.lib.rkllm_clear_kv_cache.restype = ctypes.c_int
+        self.lib.rkllm_clear_kv_cache.argtypes = [
+            LLMHandle,
+            ctypes.c_int,
+            ctypes.POINTER(ctypes.c_int),  # start_pos
+            ctypes.POINTER(ctypes.c_int)   # end_pos
+        ]
+        # int rkllm_get_kv_cache_size(LLMHandle handle, int* cache_sizes);
+        self.lib.rkllm_get_kv_cache_size.restype = ctypes.c_int
+        self.lib.rkllm_get_kv_cache_size.argtypes = [LLMHandle, ctypes.POINTER(ctypes.c_int)]
+        # int rkllm_set_chat_template(LLMHandle handle, const char* system_prompt, const char* prompt_prefix, const char* prompt_postfix);
+        self.lib.rkllm_set_chat_template.restype = ctypes.c_int
+        self.lib.rkllm_set_chat_template.argtypes = [
+            LLMHandle,
+            ctypes.c_char_p,
+            ctypes.c_char_p,
+            ctypes.c_char_p
+        ]
+        # int rkllm_set_function_tools(LLMHandle handle, const char* system_prompt, const char* tools, const char* tool_response_str);
+        self.lib.rkllm_set_function_tools.restype = ctypes.c_int
+        self.lib.rkllm_set_function_tools.argtypes = [
+            LLMHandle,
+            ctypes.c_char_p,  # system_prompt
+            ctypes.c_char_p,  # tools
+            ctypes.c_char_p   # tool_response_str
+        ]
+        # int rkllm_set_cross_attn_params(LLMHandle handle, RKLLMCrossAttnParam* cross_attn_params);
+        self.lib.rkllm_set_cross_attn_params.restype = ctypes.c_int
+        self.lib.rkllm_set_cross_attn_params.argtypes = [LLMHandle, ctypes.POINTER(RKLLMCrossAttnParam)]
+    def create_default_param(self) -> RKLLMParam:
+        """Creates a default RKLLMParam structure."""
+        return self.lib.rkllm_createDefaultParam()
+    def init(self, param: RKLLMParam, callback_func) -> int:
+        """
+        Initializes the LLM.
+        :param param: RKLLMParam structure.
+        :param callback_func: A Python function that matches the signature:
+                              def my_callback(result_ptr, userdata_ptr, state_enum):
+                                  result = result_ptr.contents # RKLLMResult
+                                  # Process result
+                                  # userdata can be retrieved if passed during run, or ignored
+                                  # state = LLMCallState(state_enum)
+        :return: 0 for success, non-zero for failure.
+        """
+        if not callable(callback_func):
+            raise ValueError("callback_func must be a callable Python function.")
+        # Keep a reference to the ctypes callback object to prevent it from being garbage collected
+        self._c_callback = LLMResultCallback(callback_func)
+        ret = self.lib.rkllm_init(ctypes.byref(self.llm_handle), ctypes.byref(param), self._c_callback)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_init failed with error code {ret}")
+        return ret
+    def load_lora(self, lora_adapter: RKLLMLoraAdapter) -> int:
+        """Loads a Lora adapter."""
+        ret = self.lib.rkllm_load_lora(self.llm_handle, ctypes.byref(lora_adapter))
+        if ret != 0:
+            raise RuntimeError(f"rkllm_load_lora failed with error code {ret}")
+        return ret
+    def load_prompt_cache(self, prompt_cache_path: str) -> int:
+        """Loads a prompt cache from a file."""
+        c_path = prompt_cache_path.encode('utf-8')
+        ret = self.lib.rkllm_load_prompt_cache(self.llm_handle, c_path)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_load_prompt_cache failed for {prompt_cache_path} with error code {ret}")
+        return ret
+    def release_prompt_cache(self) -> int:
+        """Releases the prompt cache from memory."""
+        ret = self.lib.rkllm_release_prompt_cache(self.llm_handle)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_release_prompt_cache failed with error code {ret}")
+        return ret
+    def destroy(self) -> int:
+        """Destroys the LLM instance and releases resources."""
+        if self.llm_handle and self.llm_handle.value: # Check if handle is not NULL
+            ret = self.lib.rkllm_destroy(self.llm_handle)
+            self.llm_handle = LLMHandle() # Reset handle
+            if ret != 0:
+                # Don't raise here as it might be called in __del__
+                print(f"Warning: rkllm_destroy failed with error code {ret}")
+            return ret
+        return 0 # Already destroyed or not initialized
+    def run(self, rkllm_input: RKLLMInput, rkllm_infer_params: RKLLMInferParam, userdata=None) -> int:
+        """Runs an LLM inference task synchronously."""
+        # userdata can be a ctypes.py_object if you want to pass Python objects,
+        # then cast to c_void_p. Or simply None.
+        if userdata is not None:
+            # Store the userdata object to keep it alive during the call
+            self._userdata_ref = userdata
+            c_userdata = ctypes.cast(ctypes.pointer(ctypes.py_object(userdata)), ctypes.c_void_p)
+        else:
+            c_userdata = None
+        ret = self.lib.rkllm_run(self.llm_handle, ctypes.byref(rkllm_input), ctypes.byref(rkllm_infer_params), c_userdata)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_run failed with error code {ret}")
+        return ret
+    def run_async(self, rkllm_input: RKLLMInput, rkllm_infer_params: RKLLMInferParam, userdata=None) -> int:
+        """Runs an LLM inference task asynchronously."""
+        if userdata is not None:
+            # Store the userdata object to keep it alive during the call
+            self._userdata_ref = userdata
+            c_userdata = ctypes.cast(ctypes.pointer(ctypes.py_object(userdata)), ctypes.c_void_p)
+        else:
+            c_userdata = None
+        ret = self.lib.rkllm_run_async(self.llm_handle, ctypes.byref(rkllm_input), ctypes.byref(rkllm_infer_params), c_userdata)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_run_async failed with error code {ret}")
+        return ret
+    def abort(self) -> int:
+        """Aborts an ongoing LLM task."""
+        ret = self.lib.rkllm_abort(self.llm_handle)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_abort failed with error code {ret}")
+        return ret
+    def is_running(self) -> bool:
+        """Checks if an LLM task is currently running. Returns True if running."""
+        # The C API returns 0 if running, non-zero otherwise.
+        # This is a bit counter-intuitive for a boolean "is_running".
+        return self.lib.rkllm_is_running(self.llm_handle) == 0
+    def clear_kv_cache(self, keep_system_prompt: bool, start_pos: list = None, end_pos: list = None) -> int:
+        """
+        清除键值缓存
+        此函数用于清除部分或全部KV缓存。
+        参数：
+        - keep_system_prompt: 是否在缓存中保留系统提示（True保留，False清除）
+                              如果提供了特定范围[start_pos, end_pos)，此标志将被忽略
+        - start_pos: 要清除的KV缓存范围的起始位置数组（包含），每个批次一个
+        - end_pos: 要清除的KV缓存范围的结束位置数组（不包含），每个批次一个
+                   如果start_pos和end_pos都设置为None，将清除整个缓存，keep_system_prompt将生效
+                   如果start_pos[i] < end_pos[i]，只有指定的范围会被清除，keep_system_prompt将被忽略
+        注意：start_pos或end_pos只有在keep_history == 0且生成已通过在回调中返回1暂停时才有效
+        返回：0表示缓存清除成功，非零表示失败
+        """
+        # 准备C数组参数
+        c_start_pos = None
+        c_end_pos = None
+        if start_pos is not None and end_pos is not None:
+            if len(start_pos) != len(end_pos):
+                raise ValueError("start_pos和end_pos数组长度必须相同")
+            # 创建C数组
+            c_start_pos = (ctypes.c_int * len(start_pos))(*start_pos)
+            c_end_pos = (ctypes.c_int * len(end_pos))(*end_pos)
+        ret = self.lib.rkllm_clear_kv_cache(
+            self.llm_handle,
+            ctypes.c_int(1 if keep_system_prompt else 0),
+            c_start_pos,
+            c_end_pos
+        )
+        if ret != 0:
+            raise RuntimeError(f"rkllm_clear_kv_cache失败，错误代码：{ret}")
+        return ret
+    def set_chat_template(self, system_prompt: str, prompt_prefix: str, prompt_postfix: str) -> int:
+        """Sets the chat template for the LLM."""
+        c_system = system_prompt.encode('utf-8') if system_prompt else b""
+        c_prefix = prompt_prefix.encode('utf-8') if prompt_prefix else b""
+        c_postfix = prompt_postfix.encode('utf-8') if prompt_postfix else b""
+        ret = self.lib.rkllm_set_chat_template(self.llm_handle, c_system, c_prefix, c_postfix)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_set_chat_template failed with error code {ret}")
+        return ret
+    def get_kv_cache_size(self, n_batch: int) -> list:
+        """
+        获取给定LLM句柄的键值缓存当前大小
+        此函数返回当前存储在模型KV缓存中的位置总数。
+        参数：
+        - n_batch: 批次数量，用于确定返回数组的大小
+        返回：
+        - list: 每个批次的缓存大小列表
+        """
+        # 预分配数组以存储每个批次的缓存大小
+        cache_sizes = (ctypes.c_int * n_batch)()
+        ret = self.lib.rkllm_get_kv_cache_size(self.llm_handle, cache_sizes)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_get_kv_cache_size失败，错误代码：{ret}")
+        # 转换为Python列表
+        return [cache_sizes[i] for i in range(n_batch)]
+    def set_function_tools(self, system_prompt: str, tools: str, tool_response_str: str) -> int:
+        """
+        为LLM设置函数调用配置，包括系统提示、工具定义和工具响应token
+        参数：
+        - system_prompt: 定义语言模型上下文或行为的系统提示
+        - tools: JSON格式的字符串，定义可用的函数，包括它们的名称、描述和参数
+        - tool_response_str: 用于识别对话中函数调用结果的唯一标签。它作为标记标签，
+                            允许分词器将工具输出与正常对话轮次分开识别
+        返回：0表示配置设置成功，非零表示错误
+        """
+        c_system = system_prompt.encode('utf-8') if system_prompt else b""
+        c_tools = tools.encode('utf-8') if tools else b""
+        c_tool_response = tool_response_str.encode('utf-8') if tool_response_str else b""
+        ret = self.lib.rkllm_set_function_tools(self.llm_handle, c_system, c_tools, c_tool_response)
+        if ret != 0:
+            raise RuntimeError(f"rkllm_set_function_tools失败，错误代码：{ret}")
+        return ret
+    def set_cross_attn_params(self, cross_attn_params: RKLLMCrossAttnParam) -> int:
+        """
+        为LLM解码器设置交叉注意力参数
+        参数：
+        - cross_attn_params: 包含用于交叉注意力的编码器相关输入数据的结构体
+                            （详见RKLLMCrossAttnParam说明）
+        返回：0表示参数设置成功，非零表示错误
+        """
+        ret = self.lib.rkllm_set_cross_attn_params(self.llm_handle, ctypes.byref(cross_attn_params))
+        if ret != 0:
+            raise RuntimeError(f"rkllm_set_cross_attn_params失败，错误代码：{ret}")
+        return ret
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.destroy()
+    def __del__(self):
+        self.destroy() # Ensure resources are freed if object is garbage collected
+# --- Example Usage (Illustrative) ---
+if __name__ == "__main__":
+    # This is a placeholder for how you might use it.
+    # You'll need a valid .rkllm model and librkllmrt.so in your path.
+    # Global list to store results from callback for demonstration
+    results_buffer = []
+    def my_python_callback(result_ptr, userdata_ptr, state_enum):
+        """
+        回调函数，由C库调用来处理LLM结果
+        参数：
+        - result_ptr: 指向LLM结果的指针
+        - userdata_ptr: 用户数据指针
+        - state_enum: LLM调用状态枚举值
+        返回：
+        - 0: 继续推理
+        - 1: 暂停推理
+        """
+        global results_buffer
+        state = LLMCallState(state_enum)
+        result = result_ptr.contents
+        current_text = ""
+        if result.text: # 检查char_p是否不为NULL
+            current_text = result.text.decode('utf-8', errors='ignore')
+        print(f"回调: State={state.name}, TokenID={result.token_id}, Text='{current_text}'")
+        # 显示性能统计信息
+        if result.perf.prefill_tokens > 0 or result.perf.generate_tokens > 0:
+            print(f"  性能统计: 预填充={result.perf.prefill_tokens}tokens/{result.perf.prefill_time_ms:.1f}ms, "
+                  f"生成={result.perf.generate_tokens}tokens/{result.perf.generate_time_ms:.1f}ms, "
+                  f"内存={result.perf.memory_usage_mb:.1f}MB")
+        results_buffer.append(current_text)
+        if state == LLMCallState.RKLLM_RUN_FINISH:
+            print("推理完成。")
+        elif state == LLMCallState.RKLLM_RUN_ERROR:
+            print("推理错误。")
+        # 返回0继续推理，返回1暂停推理
+        return 0
+    # --- Attempt to use the wrapper ---
+    try:
+        print("Initializing RKLLMRuntime...")
+        # Adjust library_path if librkllmrt.so is not in default search paths
+        # e.g., library_path="./path/to/librkllmrt.so"
+        rk_llm = RKLLMRuntime()
+        print("Creating default parameters...")
+        params = rk_llm.create_default_param()
+        # --- Configure parameters ---
+        # THIS IS CRITICAL: model_path must point to an actual .rkllm file
+        # For this example to run, you need a model file.
+        # Let's assume a dummy path for now, this will fail at init if not valid.
+        model_file = "dummy_model.rkllm"
+        if not os.path.exists(model_file):
+            print(f"Warning: Model file '{model_file}' does not exist. Init will likely fail.")
+            # Create a dummy file for the example to proceed further, though init will still fail
+            # with a real library unless it's a valid model.
+            with open(model_file, "w") as f:
+                f.write("dummy content")
+        params.model_path = model_file.encode('utf-8')
+        params.max_context_len = 512
+        params.max_new_tokens = 128
+        params.top_k = 1 # Greedy
+        params.temperature = 0.7
+        params.repeat_penalty = 1.1
+        # ... set other params as needed
+        print(f"Initializing LLM with model: {params.model_path.decode()}...")
+        # This will likely fail if dummy_model.rkllm is not a valid model recognized by the library
+        try:
+            rk_llm.init(params, my_python_callback)
+            print("LLM Initialized.")
+        except RuntimeError as e:
+            print(f"Error during LLM initialization: {e}")
+            print("This is expected if 'dummy_model.rkllm' is not a valid model.")
+            print("Replace 'dummy_model.rkllm' with a real model path to test further.")
+            exit()
+        # --- Prepare input ---
+        print("准备输入...")
+        rk_input = RKLLMInput()
+        rk_input.role = b"user"  # 设置角色为用户输入
+        rk_input.enable_thinking = False  # 禁用思考模式（适用于Qwen3模型）
+        rk_input.input_type = RKLLMInputType.RKLLM_INPUT_PROMPT
+        prompt_text = "将以下英文文本翻译成中文：'Hello, world!'"
+        c_prompt = prompt_text.encode('utf-8')
+        rk_input._union_data.prompt_input = c_prompt # 直接访问联合体成员
+        # --- Prepare inference parameters ---
+        print("Preparing inference parameters...")
+        infer_params = RKLLMInferParam()
+        infer_params.mode = RKLLMInferMode.RKLLM_INFER_GENERATE
+        infer_params.keep_history = 1 # True
+        # infer_params.lora_params = None # or set up RKLLMLoraParam if using LoRA
+        # infer_params.prompt_cache_params = None # or set up RKLLMPromptCacheParam
+        # --- Run inference ---
+        print(f"Running inference with prompt: '{prompt_text}'")
+        results_buffer.clear()
+        try:
+            rk_llm.run(rk_input, infer_params) # Userdata is None by default
+            print("\n--- Full Response ---")
+            print("".join(results_buffer))
+            print("---------------------\n")
+        except RuntimeError as e:
+            print(f"Error during LLM run: {e}")
+        # --- Example: Set chat template (if model supports it) ---
+        # print("Setting chat template...")
+        # try:
+        #     rk_llm.set_chat_template("You are a helpful assistant.", "<user>: ", "<assistant>: ")
+        #     print("Chat template set.")
+        # except RuntimeError as e:
+        #     print(f"Error setting chat template: {e}")
+        # --- Example: Clear KV Cache ---
+        # print("Clearing KV cache (keeping system prompt if any)...")
+        # try:
+        #     rk_llm.clear_kv_cache(keep_system_prompt=True)
+        #     print("KV cache cleared.")
+        # except RuntimeError as e:
+        #     print(f"Error clearing KV cache: {e}")
+        # --- 示例：获取KV缓存大小 ---
+        # print("获取KV缓存大小...")
+        # try:
+        #     cache_sizes = rk_llm.get_kv_cache_size(n_batch=1)  # 假设批次大小为1
+        #     print(f"当前KV缓存大小: {cache_sizes}")
+        # except RuntimeError as e:
+        #     print(f"获取KV缓存大小错误: {e}")
+        # --- 示例：设置函数工具 ---
+        # print("设置函数调用工具...")
+        # try:
+        #     system_prompt = "你是一个有用的助手，可以调用提供的函��来帮助用户。"
+        #     tools = '''[{
+        #         "name": "get_weather",
+        #         "description": "获取指定城市的天气信息",
+        #         "parameters": {
+        #             "type": "object",
+        #             "properties": {
+        #                 "city": {"type": "string", "description": "城市名称"}
+        #             },
+        #             "required": ["city"]
+        #         }
+        #     }]'''
+        #     tool_response_str = "<tool_response>"
+        #     rk_llm.set_function_tools(system_prompt, tools, tool_response_str)
+        #     print("函数工具设置成功。")
+        # except RuntimeError as e:
+        #     print(f"设置函数工具错误: {e}")
+        # --- 示例：清除KV缓存（带范围参数） ---
+        # print("使用范围参数清除KV缓存...")
+        # try:
+        #     # 清除位置10到20的缓存
+        #     start_positions = [10]  # 批次0的起始位置
+        #     end_positions = [20]    # 批次0的结束位置
+        #     rk_llm.clear_kv_cache(keep_system_prompt=True, start_pos=start_positions, end_pos=end_positions)
+        #     print("范围KV缓存清除完成。")
+        # except RuntimeError as e:
+        #     print(f"清除范围KV缓存错误: {e}")
+    except OSError as e:
+        print(f"OSError: {e}. Could not load the RKLLM library.")
+        print("Please ensure 'librkllmrt.so' is in your LD_LIBRARY_PATH or provide the full path.")
+    except Exception as e:
+        print(f"An unexpected error occurred: {e}")
+    finally:
+        if 'rk_llm' in locals() and rk_llm.llm_handle and rk_llm.llm_handle.value:
+            print("Destroying LLM instance...")
+            rk_llm.destroy()
+            print("LLM instance destroyed.")
+        if os.path.exists(model_file) and model_file == "dummy_model.rkllm":
+             os.remove(model_file) # Clean up dummy file
+    print("Example finished.")

run_rkllm.py ADDED Viewed

	@@ -0,0 +1,243 @@

+import faulthandler
+faulthandler.enable()
+import sys
+import os
+os.environ["RKLLM_LOG_LEVEL"] = "1"
+import ctypes
+import argparse
+import cv2
+import numpy as np
+import ztu_somemodelruntime_rknnlite2 as ort
+from rkllm_binding import (
+    RKLLMRuntime,
+    RKLLMParam,
+    RKLLMInput,
+    RKLLMInferParam,
+    LLMCallState,
+    RKLLMInputType,
+    RKLLMInferMode,
+    RKLLMResult
+)
+# Constants aligned with InternVL config
+IMAGE_HEIGHT = 448
+IMAGE_WIDTH = 448
+IMAGE_SEQ_LENGTH = 256
+MULTIMODAL_HIDDEN_DIM = 2048
+IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
+IMAGENET_STD = np.array([0.229, 0.224, 0.225], dtype=np.float32)
+def expand2square(img, background_color):
+    """
+    Expand the image into a square and fill it with the specified background color.
+    """
+    height, width, _ = img.shape
+    if width == height:
+        return img.copy()
+    size = max(width, height)
+    square_img = np.full((size, size, 3), background_color, dtype=np.uint8)
+    x_offset = (size - width) // 2
+    y_offset = (size - height) // 2
+    square_img[y_offset:y_offset+height, x_offset:x_offset+width] = img
+    return square_img
+def llm_callback(result_ptr, userdata_ptr, state_enum):
+    """
+    Callback function to handle LLM results.
+    """
+    state = LLMCallState(state_enum)
+    result = result_ptr.contents
+    if state == LLMCallState.RKLLM_RUN_NORMAL:
+        if result.text:
+            print(result.text.decode('utf-8', errors='ignore'), end='', flush=True)
+    elif state == LLMCallState.RKLLM_RUN_FINISH:
+        print("\n", flush=True)
+    elif state == LLMCallState.RKLLM_RUN_ERROR:
+        print("\nrun error", flush=True)
+    return 0
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run RKLLM visual language model inference based on the C++ example."
+    )
+    parser.add_argument("image_path", type=str, help="Path to the input image.")
+    parser.add_argument("encoder_model_path", type=str, help="Path to the ONNX vision encoder model.")
+    parser.add_argument("llm_model_path", type=str, help="Path to the .rkllm language model.")
+    parser.add_argument("max_new_tokens", type=int, help="Maximum number of new tokens to generate.")
+    parser.add_argument("max_context_len", type=int, help="Maximum context length.")
+    # The rknn_core_num is not directly used by onnxruntime in the same way,
+    # but we keep it for API consistency with the C++ example.
+    # ONNX Runtime will manage its own threading and execution providers.
+    parser.add_argument("rknn_core_num", type=int, help="Sets the number of npu cores used in vision encoder.")
+    args = parser.parse_args()
+    # --- 1. Initialize Image Encoder (ONNX Runtime) ---
+    print("Initializing ONNX Runtime for vision encoder...")
+    try:
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = args.rknn_core_num
+        ort_session = ort.InferenceSession(args.encoder_model_path, sess_options=sess_options)
+    except Exception as e:
+        print(f"Failed to load ONNX model: {e}")
+        sys.exit(1)
+    print("Vision encoder loaded successfully.")
+    input_name = ort_session.get_inputs()[0].name
+    output_name = ort_session.get_outputs()[0].name
+    print(f"ONNX Input: {input_name}, ONNX Output: {output_name}")
+    # --- 2. Initialize LLM ---
+    print("Initializing RKLLM Runtime...")
+    rk_llm = RKLLMRuntime()
+    param = rk_llm.create_default_param()
+    param.model_path = args.llm_model_path.encode('utf-8')
+    param.top_k = 1
+    param.max_new_tokens = args.max_new_tokens
+    param.max_context_len = args.max_context_len
+    param.skip_special_token = True
+    param.img_start = b"<img>"
+    param.img_end = b"</img>\n"
+    param.img_content = b""
+    param.extend_param.base_domain_id = 1
+    try:
+        rk_llm.init(param, llm_callback)
+        print("RKLLM initialized successfully.")
+    except RuntimeError as e:
+        print(f"RKLLM init failed: {e}")
+        sys.exit(1)
+    # --- 3. Image Preprocessing ---
+    print("Preprocessing image...")
+    img = cv2.imread(args.image_path)
+    if img is None:
+        print(f"Failed to read image from {args.image_path}")
+        sys.exit(1)
+    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+    background_color = (127.5, 127.5, 127.5) # Keep close to official preprocessing
+    square_img = expand2square(img, background_color)
+    resized_img = cv2.resize(square_img, (IMAGE_WIDTH, IMAGE_HEIGHT), interpolation=cv2.INTER_LINEAR)
+    # Normalize and prepare for ONNX model
+    input_tensor = resized_img.astype(np.float32)
+    # Normalize using InternVL vision config statistics
+    input_tensor = (input_tensor / 255.0 - IMAGENET_MEAN) / IMAGENET_STD
+    # Convert to NCHW format
+    input_tensor = np.transpose(input_tensor, (2, 0, 1))  # HWC -> CHW
+    input_tensor = np.expand_dims(input_tensor, axis=0)  # Add batch dimension -> (1, 3, 448, 448)
+    # --- 4. Run Image Encoder ---
+    print("Running vision encoder...")
+    import time
+    start_time = time.time()
+    try:
+        img_vec_output = ort_session.run([output_name], {input_name: input_tensor.astype(np.float32)})[0]
+        if img_vec_output.ndim != 3:
+            raise RuntimeError(f"Unexpected encoder output shape {img_vec_output.shape}, expected (batch, tokens, hidden)")
+        if img_vec_output.shape[-1] != MULTIMODAL_HIDDEN_DIM:
+            print(f"Warning: hidden dim {img_vec_output.shape[-1]} differs from expected {MULTIMODAL_HIDDEN_DIM}")
+        if img_vec_output.shape[1] != IMAGE_SEQ_LENGTH:
+            print(f"Warning: token count {img_vec_output.shape[1]} differs from expected {IMAGE_SEQ_LENGTH}")
+        elapsed_time = time.time() - start_time
+        print(f"视觉编码器推理耗时: {elapsed_time:.4f} 秒")
+        # The output from C++ is a flat float array. Let's flatten the ONNX output.
+        img_vec = img_vec_output.flatten().astype(np.float32)
+    except Exception as e:
+        print(f"Failed to run vision encoder inference: {e}")
+        rk_llm.destroy()
+        sys.exit(1)
+    print("Image encoded successfully.")
+    # --- 5. Interactive Chat Loop ---
+    rkllm_infer_params = RKLLMInferParam()
+    rkllm_infer_params.mode = RKLLMInferMode.RKLLM_INFER_GENERATE
+    rkllm_infer_params.keep_history = 1
+    # Set chat template
+    # Looks the default template parsed by RKLLM gives better result than this one, don't know why.
+    # rk_llm.set_chat_template(
+    #     system_prompt="<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n",
+    #     prompt_prefix="<|im_start|>user\n",
+    #     prompt_postfix="<|im_end|>\n<|im_start|>assistant\n"
+    # )
+    pre_input = [
+        "<image>What is in the image?",
+        "<image>这张图片中有什么？"
+    ]
+    print("\n**********************可输入以下问题对应序号获取回答/或自定义输入********************\n")
+    for i, p in enumerate(pre_input):
+        print(f"[{i}] {p}")
+    print("\n*************************************************************************\n")
+    try:
+        while True:
+            print("\nuser: ", end="", flush=True)
+            input_str = sys.stdin.readline().strip()
+            if not input_str:
+                continue
+            if input_str == "exit":
+                break
+            if input_str == "clear":
+                try:
+                    rk_llm.clear_kv_cache(keep_system_prompt=True)
+                    print("KV cache cleared.")
+                except RuntimeError as e:
+                    print(f"Failed to clear KV cache: {e}")
+                continue
+            try:
+                idx = int(input_str)
+                if 0 <= idx < len(pre_input):
+                    input_str = pre_input[idx]
+                    print(input_str)
+            except (ValueError, IndexError):
+                pass # Use the raw string if not a valid index
+            rkllm_input = RKLLMInput()
+            rkllm_input.role = b"user"
+            print("robot: ", end="", flush=True)
+            if "<image>" in input_str:
+                rkllm_input.input_type = RKLLMInputType.RKLLM_INPUT_MULTIMODAL
+                # Setup multimodal input
+                rkllm_input.multimodal_input.prompt = input_str.encode('utf-8')
+                rkllm_input.multimodal_input.image_embed = img_vec.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
+                rkllm_input.multimodal_input.n_image_tokens = img_vec_output.shape[1]
+                print("n_image_tokens: ", rkllm_input.multimodal_input.n_image_tokens)
+                rkllm_input.multimodal_input.n_image = 1
+                rkllm_input.multimodal_input.image_height = IMAGE_HEIGHT
+                rkllm_input.multimodal_input.image_width = IMAGE_WIDTH
+            else:
+                rkllm_input.input_type = RKLLMInputType.RKLLM_INPUT_PROMPT
+                rkllm_input.prompt_input = input_str.encode('utf-8')
+            try:
+                rk_llm.run(rkllm_input, rkllm_infer_params)
+            except RuntimeError as e:
+                print(f"\nError during rkllm_run: {e}")
+    except KeyboardInterrupt:
+        print("\nExiting...")
+    finally:
+        print("Releasing resources...")
+        rk_llm.destroy()
+        print("RKLLM instance destroyed.")
+if __name__ == "__main__":
+    main()

test.jpg ADDED Viewed

Git LFS Details

SHA256: a4cd7f45ac1ce27eaafb254b23af7c0b18a064be08870ceaaf03b2147f2ce550
Pointer size: 131 Bytes
Size of remote file: 156 kB

vision_encoder.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:69670e8e48938fcd2543c5cae22e7789d5783b525a34a7013edeba724744c461
+size 674706120

ztu_somemodelruntime_rknnlite2.py ADDED Viewed

	@@ -0,0 +1,1195 @@

+# 模块级常量和函数
+from rknnlite.api import RKNNLite
+import numpy as np
+import os
+import warnings
+import logging
+from typing import List, Dict, Union, Optional
+try:
+    import onnxruntime as ort
+    HAS_ORT = True
+except ImportError:
+    HAS_ORT = False
+    warnings.warn("onnxruntime未安装,只能使用RKNN后端", ImportWarning)
+# 配置日志
+logger = logging.getLogger("somemodelruntime_rknnlite2")
+logger.setLevel(logging.ERROR)  # 默认只输出错误信息
+if not logger.handlers:
+    handler = logging.StreamHandler()
+    handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
+    logger.addHandler(handler)
+# ONNX Runtime日志级别到Python logging级别的映射
+_LOGGING_LEVEL_MAP = {
+    0: logging.DEBUG,    # Verbose
+    1: logging.INFO,     # Info
+    2: logging.WARNING,  # Warning
+    3: logging.ERROR,    # Error
+    4: logging.CRITICAL  # Fatal
+}
+# 检查环境变量中的日志级别设置
+try:
+    env_log_level = os.getenv('ZTU_MODELRT_RKNNL2_LOG_LEVEL')
+    if env_log_level is not None:
+        log_level = int(env_log_level)
+        if log_level in _LOGGING_LEVEL_MAP:
+            logger.setLevel(_LOGGING_LEVEL_MAP[log_level])
+            logger.info(f"从环境变量设置日志级别: {log_level}")
+        else:
+            logger.warning(f"环境变量ZTU_MODELRT_RKNNL2_LOG_LEVEL的值无效: {log_level}, 应该是0-4之间的整数")
+except ValueError:
+    logger.warning(f"环境变量ZTU_MODELRT_RKNNL2_LOG_LEVEL的值无效: {env_log_level}, 应该是0-4之间的整数")
+def set_default_logger_severity(level: int) -> None:
+    """
+    Sets the default logging severity. 0:Verbose, 1:Info, 2:Warning, 3:Error, 4:Fatal
+    Args:
+        level: 日志级别(0-4)
+    """
+    if level not in _LOGGING_LEVEL_MAP:
+        raise ValueError(f"无效的日志级别: {level}, 应该是0-4之间的整数")
+    logger.setLevel(_LOGGING_LEVEL_MAP[level])
+def set_default_logger_verbosity(level: int) -> None:
+    """
+    Sets the default logging verbosity level. To activate the verbose log,
+    you need to set the default logging severity to 0:Verbose level.
+    Args:
+        level: 日志级别(0-4)
+    """
+    set_default_logger_severity(level)
+# RKNN tensor type到numpy dtype的映射
+RKNN_DTYPE_MAP = {
+    0: np.float32,  # RKNN_TENSOR_FLOAT32
+    1: np.float16,  # RKNN_TENSOR_FLOAT16
+    2: np.int8,     # RKNN_TENSOR_INT8
+    3: np.uint8,    # RKNN_TENSOR_UINT8
+    4: np.int16,    # RKNN_TENSOR_INT16
+    5: np.uint16,   # RKNN_TENSOR_UINT16
+    6: np.int32,    # RKNN_TENSOR_INT32
+    7: np.uint32,   # RKNN_TENSOR_UINT32
+    8: np.int64,    # RKNN_TENSOR_INT64
+    9: bool,        # RKNN_TENSOR_BOOL
+    10: np.int8,    # RKNN_TENSOR_INT4 (用int8表示)
+}
+def get_available_providers() -> List[str]:
+    """
+    获取可用的设备提供者列表(为保持接口兼容性的占位函数)
+    Returns:
+        list: 可用的设备提供者列表,总是返回["CPUExecutionProvider", "somemodelruntime_rknnlite2_ExecutionProvider"]
+    """
+    return ["CPUExecutionProvider", "somemodelruntime_rknnlite2_ExecutionProvider"]
+def get_device() -> str:
+    """
+    获取当前设备
+    Returns:
+        str: 当前设备
+    """
+    return "RKNN2"
+def get_version_info() -> Dict[str, str]:
+    """
+    获取版本信息
+    Returns:
+        dict: 包含API和驱动版本信息的字典
+    """
+    runtime = RKNNLite()
+    version = runtime.get_sdk_version()
+    return {
+        "api_version": version.split('\n')[2].split(': ')[1].split(' ')[0],
+        "driver_version": version.split('\n')[3].split(': ')[1]
+    }
+class IOTensor:
+    """输入/输出张量的信息封装类"""
+    def __init__(self, name, shape, type=None):
+        self.name = name.decode() if isinstance(name, bytes) else name
+        self.shape = shape
+        self.type = type
+    def __str__(self):
+        return f"IOTensor(name='{self.name}', shape={self.shape}, type={self.type})"
+class SessionOptions:
+    """会话选项类"""
+    def __init__(self):
+        self.enable_profiling = False  # 是否使用性能分析
+        self.intra_op_num_threads = 1  # 设置RKNN的线程数, 对应rknn的core_mask
+        self.log_severity_level = -1 # 另一个设置日志级别的参数
+        self.log_verbosity_level = -1 # 另一个设置日志级别的参数
+class InferenceSession:
+    """
+    RKNNLite运行时封装类,API风格类似ONNX Runtime
+    """
+    def __new__(cls, model_path: str, sess_options: Optional[SessionOptions] = None, **kwargs):
+        processed_path = InferenceSession._process_model_path(model_path, sess_options)
+        if isinstance(processed_path, str) and processed_path.lower().endswith('.onnx'):
+            logger.info("使用ONNX Runtime加载模型")
+            if not HAS_ORT:
+                raise RuntimeError("未安装onnxruntime,无法加载ONNX模型")
+            return ort.InferenceSession(processed_path, sess_options=sess_options, **kwargs)
+        else:
+            # 如果不是 ONNX 模型，则调用父类的 __new__ 创建 InferenceSession 实例
+            instance = super().__new__(cls)
+            # 保存处理后的路径
+            instance._processed_path = processed_path
+            return instance
+    def __init__(self, model_path: str, sess_options: Optional[SessionOptions] = None, **kwargs):
+        """
+        初始化运行时并加载模型
+        Args:
+            model_path: 模型文件路径(.rknn或.onnx)
+            sess_options: 会话选项
+            **kwargs: 其他初始化参数
+        """
+        options = sess_options or SessionOptions()
+        # 只在未设置环境变量时使用SessionOptions中的日志级别
+        if os.getenv('ZTU_MODELRT_RKNNL2_LOG_LEVEL') is None:
+            if options.log_severity_level != -1:
+                set_default_logger_severity(options.log_severity_level)
+            if options.log_verbosity_level != -1:
+                set_default_logger_verbosity(options.log_verbosity_level)
+        # 使用__new__中处理好的路径
+        model_path = getattr(self, '_processed_path', model_path)
+        if isinstance(model_path, str) and model_path.lower().endswith('.onnx'):
+            # 避免重复加载 ONNX 模型
+            return
+        # ... 现有的 RKNN 模型加载和初始化代码 ...
+        self.model_path = model_path
+        if not os.path.exists(self.model_path):
+            logger.error(f"模型文件不存在: {self.model_path}")
+            raise FileNotFoundError(f"模型文件不存在: {self.model_path}")
+        self.runtime = RKNNLite(verbose=options.enable_profiling)
+        logger.debug(f"正在加载模型: {self.model_path}")
+        ret = self.runtime.load_rknn(self.model_path)
+        if ret != 0:
+            logger.error(f"加载RKNN模型失败: {self.model_path}")
+            raise RuntimeError(f'加载RKNN模型失败: {self.model_path}')
+        logger.debug("模型加载成功")
+        if options.intra_op_num_threads == 1:
+            core_mask = RKNNLite.NPU_CORE_AUTO
+        elif options.intra_op_num_threads == 2:
+            core_mask = RKNNLite.NPU_CORE_0_1
+        elif options.intra_op_num_threads == 3:
+            core_mask = RKNNLite.NPU_CORE_0_1_2
+        else:
+            raise ValueError(f"intra_op_num_threads的值无效: {options.intra_op_num_threads}, 只能是1,2或3")
+        logger.debug("正在初始化运行时环境")
+        ret = self.runtime.init_runtime(core_mask=core_mask)
+        if ret != 0:
+            logger.error("初始化运行时环境失败")
+            raise RuntimeError('初始化运行时环境失败')
+        logger.debug("运行时环境初始化成功")
+        # 在 runtime 初始化后，按环境变量自动注册自定义算子插件库
+        try:
+            # 注册用户指定路径插件（逗号/分号分隔）
+            env_custom = os.getenv('ZTU_MODELRT_RKNN2_REG_CUSTOM_OP_LIB', '').strip()
+            if env_custom:
+                paths = [seg.strip() for seg in re.split(r"[,;:]", env_custom) if seg.strip()]
+                ok = 0
+                for p in paths:
+                    if self.register_custom_op_lib(p):
+                        ok += 1
+                if ok > 0:
+                    logger.info(f"已注册 {ok}/{len(paths)} 个自定义算子插件")
+                        # 注册系统目录下插件
+            if os.getenv('ZTU_MODELRT_RKNN2_REG_SYSTEM_CUSTOM_OP_LIB', '1') == '1':
+                cnt = self.register_system_custom_op_lib()
+                if cnt > 0:
+                    logger.info(f"已从系统目录注册 {cnt} 个自定义算子插件")
+        except Exception as e:
+            logger.warning(f"自动注册自定义算子插件失败: {e}")
+        # 可选：按环境变量注册内置(基于Python)捆绑算子
+        if os.getenv('ZTU_MODELRT_RKNN2_REG_BUNDLED_OPS', '0') == '1':
+            logger.info("根据环境变量注册捆绑算子")
+            self.register_bundled_ops()
+        self._init_io_info()
+        self.options = options
+    def get_performance_info(self) -> Dict[str, float]:
+        """
+        获取性能信息
+        Returns:
+            dict: 包含性能信息的字典
+        """
+        if not self.options.perf_debug:
+            raise RuntimeError("性能分析未启用,请在SessionOptions中设置perf_debug=True")
+        perf = self.runtime.rknn_runtime.get_run_perf()
+        return {
+            "run_duration": perf.run_duration / 1000.0  # 转换为毫秒
+        }
+    def set_core_mask(self, core_mask: int) -> None:
+        """
+        设置NPU核心使用模式
+        Args:
+            core_mask: NPU核心掩码,使用NPU_CORE_*常量
+        """
+        ret = self.runtime.rknn_runtime.set_core_mask(core_mask)
+        if ret != 0:
+            raise RuntimeError("设置NPU核心��式失败")
+    @staticmethod
+    def _process_model_path(model_path, sess_options):
+        """
+        处理模型路径,支持.onnx和.rknn文件
+        Args:
+            model_path: 模型文件路径
+        """
+        # 如果是ONNX文件,检查是否需要自动加载RKNN
+        if model_path.lower().endswith('.onnx'):
+            logger.info("检测到ONNX模型文件")
+            # 获取需要跳过自动加载的模型列表
+            skip_models = os.getenv('ZTU_MODELRT_RKNNL2_SKIP', '').strip()
+            if skip_models:
+                skip_list = [m.strip() for m in skip_models.split(',')]
+                # 获取模型文件名(不含路径)用于匹配
+                model_name = os.path.basename(model_path)
+                if model_name.lower() in [m.lower() for m in skip_list]:
+                    logger.info(f"模型{model_name}在跳过列表中,将使用ONNX Runtime")
+                    return model_path
+            # 构造RKNN文件路径
+            rknn_path = os.path.splitext(model_path)[0] + '.rknn'
+            if os.path.exists(rknn_path):
+                logger.info(f"找到对应的RKNN模型,将使用RKNN: {rknn_path}")
+                return rknn_path
+            else:
+                logger.info("未找到对应的RKNN模型,将使用ONNX Runtime")
+                return model_path
+        return model_path
+    def _convert_nhwc_to_nchw(self, shape):
+        """将NHWC格式的shape转换为NCHW格式"""
+        if len(shape) == 4:
+            # NHWC -> NCHW
+            n, h, w, c = shape
+            return [n, c, h, w]
+        return shape
+    def _init_io_info(self):
+        """初始化模型的输入输出信息"""
+        runtime = self.runtime.rknn_runtime
+        # 获取输入输出数量
+        n_input, n_output = runtime.get_in_out_num()
+        # 获取输入信息
+        self.input_tensors = []
+        for i in range(n_input):
+            attr = runtime.get_tensor_attr(i)
+            shape = [attr.dims[j] for j in range(attr.n_dims)]
+            # 对四维输入进行NHWC到NCHW的转换
+            shape = self._convert_nhwc_to_nchw(shape)
+            # 获取dtype
+            dtype = RKNN_DTYPE_MAP.get(attr.type, None)
+            tensor = IOTensor(attr.name, shape, dtype)
+            self.input_tensors.append(tensor)
+        # 获取输出信息
+        self.output_tensors = []
+        for i in range(n_output):
+            attr = runtime.get_tensor_attr(i, is_output=True)
+            shape = runtime.get_output_shape(i)
+            # 获取dtype
+            dtype = RKNN_DTYPE_MAP.get(attr.type, None)
+            tensor = IOTensor(attr.name, shape, dtype)
+            self.output_tensors.append(tensor)
+    def get_inputs(self):
+        """
+        获取模型输入信息
+        Returns:
+            list: 包含输入信息的列表
+        """
+        return self.input_tensors
+    def get_outputs(self):
+        """
+        获取模型输出信息
+        Returns:
+            list: 包含输出信息的列表
+        """
+        return self.output_tensors
+    def run(self, output_names=None, input_feed=None, data_format="nchw", **kwargs):
+        """
+        执行模型推理
+        Args:
+            output_names: 输出节点名称列表,指定需要返回哪些输出
+            input_feed: 输入数据字典或列表
+            data_format: 输入数据格式,"nchw"或"nhwc"
+            **kwargs: 其他运行时参数
+        Returns:
+            list: 模型输出结果列表,如果指定了output_names则只返回指定的输出
+        """
+        if input_feed is None:
+            logger.error("input_feed不能为None")
+            raise ValueError("input_feed不能为None")
+        # 准备输入数据
+        if isinstance(input_feed, dict):
+            # 如果是字典,按照模型输入顺序排列
+            inputs = []
+            input_map = {tensor.name: i for i, tensor in enumerate(self.input_tensors)}
+            for tensor in self.input_tensors:
+                if tensor.name not in input_feed:
+                    raise ValueError(f"缺少输入: {tensor.name}")
+                inputs.append(input_feed[tensor.name])
+        elif isinstance(input_feed, (list, tuple)):
+            # 如果是列表,确保长度匹配
+            if len(input_feed) != len(self.input_tensors):
+                raise ValueError(f"输入数量不匹配: 期望{len(self.input_tensors)}, 实际{len(input_feed)}")
+            inputs = list(input_feed)
+        else:
+            logger.error("input_feed必须是字典或列表类型")
+            raise ValueError("input_feed必须是字典或列表类型")
+        # 执行推理
+        try:
+            logger.debug("开始执行推理")
+            all_outputs = self.runtime.inference(inputs=inputs, data_format=data_format)
+            # 如果没有指定output_names,返回所有输出
+            if output_names is None:
+                return all_outputs
+            # 获取指定的输出
+            output_map = {tensor.name: i for i, tensor in enumerate(self.output_tensors)}
+            selected_outputs = []
+            for name in output_names:
+                if name not in output_map:
+                    raise ValueError(f"未找到输出节点: {name}")
+                selected_outputs.append(all_outputs[output_map[name]])
+            return selected_outputs
+        except Exception as e:
+            logger.error(f"推理执行失败: {str(e)}")
+            raise RuntimeError(f"推理执行失败: {str(e)}")
+    def close(self):
+        """
+        关闭会话,释放资源
+        """
+        if self.runtime is not None:
+            logger.info("正在释放运行时资源")
+            self.runtime.release()
+            self.runtime = None
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+    def end_profiling(self) -> Optional[str]:
+        """
+        结束性能分析的存根方法
+        Returns:
+            Optional[str]: None
+        """
+        warnings.warn("end_profiling()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return None
+    def get_profiling_start_time_ns(self) -> int:
+        """
+        获取性能分析开始时间的存根方法
+        Returns:
+            int: 0
+        """
+        warnings.warn("get_profiling_start_time_ns()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return 0
+    def get_modelmeta(self) -> Dict[str, str]:
+        """
+        获取模型元数据的存根方法
+        Returns:
+            Dict[str, str]: 空字典
+        """
+        warnings.warn("get_modelmeta()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return {}
+    def get_session_options(self) -> SessionOptions:
+        """
+        获取会话选项
+        Returns:
+            SessionOptions: 当前会话选项
+        """
+        return self.options
+    def get_providers(self) -> List[str]:
+        """
+        获取当前使用的providers的存根方法
+        Returns:
+            List[str]: ["CPUExecutionProvider"]
+        """
+        warnings.warn("get_providers()是存根方法,始终返回CPUExecutionProvider", RuntimeWarning, stacklevel=2)
+        return ["CPUExecutionProvider"]
+    def get_provider_options(self) -> Dict[str, Dict[str, str]]:
+        """
+        获取provider选项的存根方法
+        Returns:
+            Dict[str, Dict[str, str]]: 空字典
+        """
+        warnings.warn("get_provider_options()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return {}
+    def get_session_config(self) -> Dict[str, str]:
+        """
+        获取会话配置的存根方法
+        Returns:
+            Dict[str, str]: 空字典
+        """
+        warnings.warn("get_session_config()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return {}
+    def get_session_state(self) -> Dict[str, str]:
+        """
+        获取会话状态的存根方法
+        Returns:
+            Dict[str, str]: 空字典
+        """
+        warnings.warn("get_session_state()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return {}
+    def set_session_config(self, config: Dict[str, str]) -> None:
+        """
+        设置会话配置的存根方法
+        Args:
+            config: 会话配置字典
+        """
+        warnings.warn("set_session_config()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+    def get_memory_info(self) -> Dict[str, int]:
+        """
+        获取内存使用信息的存根方法
+        Returns:
+            Dict[str, int]: 空字典
+        """
+        warnings.warn("get_memory_info()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return {}
+    def set_memory_pattern(self, enable: bool) -> None:
+        """
+        设置内存模式的存根方法
+        Args:
+            enable: 是否启用内存模式
+        """
+        warnings.warn("set_memory_pattern()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+    def disable_memory_pattern(self) -> None:
+        """
+        禁用内存模式的存根方法
+        """
+        warnings.warn("disable_memory_pattern()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+    def get_optimization_level(self) -> int:
+        """
+        获取优化级别的存根方法
+        Returns:
+            int: 0
+        """
+        warnings.warn("get_optimization_level()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return 0
+    def set_optimization_level(self, level: int) -> None:
+        """
+        设置优化级别的存根方法
+        Args:
+            level: 优化级别
+        """
+        warnings.warn("set_optimization_level()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+    def get_model_metadata(self) -> Dict[str, str]:
+        """
+        获取模型元数据的存根方法(与get_modelmeta不同的接口)
+        Returns:
+            Dict[str, str]: 空字典
+        """
+        warnings.warn("get_model_metadata()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return {}
+    def get_model_path(self) -> str:
+        """
+        获取模型路径
+        Returns:
+            str: 模型文件路径
+        """
+        return self.model_path
+    def get_input_type_info(self) -> List[Dict[str, str]]:
+        """
+        获取输入类型信息的存根方法
+        Returns:
+            List[Dict[str, str]]: 空列表
+        """
+        warnings.warn("get_input_type_info()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return []
+    def get_output_type_info(self) -> List[Dict[str, str]]:
+        """
+        获取输出类型信息的存根方法
+        Returns:
+            List[Dict[str, str]]: 空列表
+        """
+        warnings.warn("get_output_type_info()是存根方法,不提供实际功能", RuntimeWarning, stacklevel=2)
+        return []
+    ################### 自定义算子 ###################
+    def _init_custom_op_types(self):
+        """初始化自定义算子的类型定义"""
+        # 常量
+        self._RKNN_TENSOR_FLOAT32 = 0
+        self._RKNN_TENSOR_UINT8 = 3
+        self._RKNN_TENSOR_INT64 = 8
+        self._RKNN_TARGET_TYPE_CPU = 1
+        # 结构体定义
+        class RKNN_TensorAttr(ctypes.Structure):
+            _fields_ = [
+                ("index", ctypes.c_uint32),
+                ("n_dims", ctypes.c_uint32),
+                ("dims", ctypes.c_uint32 * RKNN_MAX_DIMS),
+                ("name", ctypes.c_char * RKNN_MAX_NAME_LEN),
+                ("n_elems", ctypes.c_uint32),
+                ("size", ctypes.c_uint32),
+                ("fmt", ctypes.c_int),
+                ("type", ctypes.c_int),
+                ("qnt_type", ctypes.c_int),
+                ("fl", ctypes.c_int8),
+                ("zp", ctypes.c_int32),
+                ("scale", ctypes.c_float),
+                ("w_stride", ctypes.c_uint32),
+                ("size_with_stride", ctypes.c_uint32),
+                ("pass_through", ctypes.c_uint8),
+                ("h_stride", ctypes.c_uint32),
+            ]
+        class RKNN_TensorMem(ctypes.Structure):
+            _fields_ = [
+                ("virt_addr", ctypes.c_void_p),
+                ("phys_addr", ctypes.c_uint64),
+                ("fd", ctypes.c_int32),
+                ("offset", ctypes.c_int32),
+                ("size", ctypes.c_uint32),
+                ("flags", ctypes.c_uint32),
+                ("priv_data", ctypes.c_void_p),
+            ]
+        class RKNN_CustomOpTensor(ctypes.Structure):
+            _fields_ = [
+                ("attr", RKNN_TensorAttr),
+                ("mem", RKNN_TensorMem),
+            ]
+        class RKNN_GPUOpContext(ctypes.Structure):
+            _fields_ = [
+                ("cl_context", ctypes.c_void_p),
+                ("cl_command_queue", ctypes.c_void_p),
+                ("cl_kernel", ctypes.c_void_p),
+            ]
+        InternalCtxType = (
+            ctypes.c_uint64 if ctypes.sizeof(ctypes.c_void_p) == 8 else ctypes.c_uint32
+        )
+        class RKNN_CustomOpContext(ctypes.Structure):
+            _fields_ = [
+                ("target", ctypes.c_int),
+                ("internal_ctx", InternalCtxType),
+                ("gpu_ctx", RKNN_GPUOpContext),
+                ("priv_data", ctypes.c_void_p),
+            ]
+        class RKNN_CustomOpAttr(ctypes.Structure):
+            _fields_ = [
+                ("name", ctypes.c_char * RKNN_MAX_NAME_LEN),
+                ("dtype", ctypes.c_int),
+                ("n_elems", ctypes.c_uint32),
+                ("data", ctypes.c_void_p),
+            ]
+        CB_SIG = ctypes.CFUNCTYPE(
+            ctypes.c_int,
+            ctypes.POINTER(RKNN_CustomOpContext),
+            ctypes.POINTER(RKNN_CustomOpTensor),
+            ctypes.c_uint32,
+            ctypes.POINTER(RKNN_CustomOpTensor),
+            ctypes.c_uint32,
+        )
+        DESTROY_SIG = ctypes.CFUNCTYPE(
+            ctypes.c_int, ctypes.POINTER(RKNN_CustomOpContext)
+        )
+        class RKNN_CustomOp(ctypes.Structure):
+            _fields_ = [
+                ("version", ctypes.c_uint32),
+                ("target", ctypes.c_int),
+                ("op_type", ctypes.c_char * RKNN_MAX_NAME_LEN),
+                ("cl_kernel_name", ctypes.c_char * RKNN_MAX_NAME_LEN),
+                ("cl_kernel_source", ctypes.c_char_p),
+                ("cl_source_size", ctypes.c_uint64),
+                ("cl_build_options", ctypes.c_char * RKNN_MAX_NAME_LEN),
+                ("init", CB_SIG),
+                ("prepare", CB_SIG),
+                ("compute", CB_SIG),
+                ("compute_native", CB_SIG),
+                ("destroy", DESTROY_SIG),
+            ]
+        # 保存类型定义
+        self._RKNN_TensorAttr = RKNN_TensorAttr
+        self._RKNN_TensorMem = RKNN_TensorMem
+        self._RKNN_CustomOpTensor = RKNN_CustomOpTensor
+        self._RKNN_CustomOpContext = RKNN_CustomOpContext
+        self._RKNN_CustomOpAttr = RKNN_CustomOpAttr
+        self._RKNN_CustomOp = RKNN_CustomOp
+        self._CB_SIG = CB_SIG
+        self._DESTROY_SIG = DESTROY_SIG
+    def _create_attr_readers(self, get_op_attr):
+        """创建属性读取函数"""
+        def read_attr_int64(op_ctx_ptr, key: str, default: int = 0) -> int:
+            attr = self._RKNN_CustomOpAttr()
+            get_op_attr(op_ctx_ptr, key.encode("utf-8"), ctypes.byref(attr))
+            if attr.n_elems == 1 and attr.dtype == self._RKNN_TENSOR_INT64 and attr.data:
+                return ctypes.c_int64.from_address(attr.data).value
+            return default
+        def read_attr_float32(op_ctx_ptr, key: str, default: float = 0) -> float:
+            attr = self._RKNN_CustomOpAttr()
+            get_op_attr(op_ctx_ptr, key.encode("utf-8"), ctypes.byref(attr))
+            if attr.n_elems == 1 and attr.dtype == self._RKNN_TENSOR_FLOAT32 and attr.data:
+                return ctypes.c_float.from_address(attr.data).value
+            return default
+        def read_attr_str(op_ctx_ptr, key: str, default: str = "") -> str:
+            attr = self._RKNN_CustomOpAttr()
+            get_op_attr(op_ctx_ptr, key.encode("utf-8"), ctypes.byref(attr))
+            if attr.n_elems > 0 and attr.dtype == self._RKNN_TENSOR_UINT8 and attr.data:
+                buf = (ctypes.c_ubyte * attr.n_elems).from_address(attr.data)
+                try:
+                    return bytes(buf).decode("utf-8", errors="ignore").strip('"')
+                except Exception:
+                    return default
+            return default
+        return read_attr_int64, read_attr_str, read_attr_float32
+    def _build_py_custom_op(self,
+                            op_type: str,
+                            n_inputs: int,
+                            n_outputs: int,
+                            on_init,
+                            on_compute):
+        """通用的Python自定义算子构造器
+        Args:
+            op_type: 算子类型名(字符串)
+            n_inputs: 输入个数
+            n_outputs: 输出个数
+            on_init: 回调,签名 on_init(op_ctx_p, read_attr_int64, read_attr_str) -> state
+            on_compute: 回调,签名 on_compute(op_ctx_p, inputs_p, outputs_p, state) -> int(0成功)
+        Returns:
+            (RKNN_CustomOp对象, 回调tuple)
+        """
+        @self._CB_SIG
+        def _py_init(op_ctx_p, inputs_p, n_inputs_p, outputs_p, n_outputs_p):
+            try:
+                # 允许无需提前读取属性
+                runtime = self.runtime.rknn_base.rknn_runtime
+                read_attr_int64, read_attr_str, read_attr_float32 = self._create_attr_readers(runtime.lib.rknn_custom_op_get_op_attr)
+                user_state = on_init(op_ctx_p, read_attr_int64, read_attr_str, read_attr_float32)
+                # 为该实例分配唯一ID, 并写入priv_data
+                if not hasattr(self, "_custom_op_states"):
+                    self._custom_op_states = {}
+                if not hasattr(self, "_next_custom_op_id"):
+                    self._next_custom_op_id = 1
+                inst_id = int(self._next_custom_op_id)
+                self._next_custom_op_id += 1
+                # 保存Python侧状态
+                self._custom_op_states[inst_id] = user_state
+                # 将实例ID写入priv_data
+                try:
+                    op_ctx_p.contents.priv_data = ctypes.c_void_p(inst_id)
+                except Exception:
+                    # 回退: 直接写入整数
+                    op_ctx_p.contents.priv_data = inst_id
+                return 0
+            except Exception as e:
+                logger.error(f"{op_type} init失败: {e}")
+                return -1
+        @self._CB_SIG
+        def _py_prepare(op_ctx_p, inputs_p, n_inputs_p, outputs_p, n_outputs_p):
+            return 0
+        @self._CB_SIG
+        def _py_compute(op_ctx_p, inputs_p, n_inputs_p, outputs_p, n_outputs_p):
+            try:
+                if n_inputs_p != n_inputs or n_outputs_p != n_outputs:
+                    return -1
+                # 通过priv_data取回该实例的状态
+                try:
+                    inst_id = int(op_ctx_p.contents.priv_data) if op_ctx_p.contents.priv_data else 0
+                except Exception:
+                    inst_id = 0
+                user_state = None
+                if hasattr(self, "_custom_op_states") and inst_id in self._custom_op_states:
+                    user_state = self._custom_op_states.get(inst_id)
+                else:
+                    logger.error(f"{op_type} compute失败: 找不到实例状态, inst_id={inst_id}")
+                    return -1
+                return on_compute(op_ctx_p, inputs_p, outputs_p, user_state)
+            except Exception as e:
+                logger.error(f"{op_type} compute失败: {e}")
+                import traceback
+                logger.error(f"{op_type} compute失败: {traceback.format_exc()}")
+                return -1
+        @self._DESTROY_SIG
+        def _py_destroy(op_ctx_p):
+            try:
+                # 清理该实例的状态
+                try:
+                    inst_id = int(op_ctx_p.contents.priv_data) if op_ctx_p.contents.priv_data else 0
+                except Exception:
+                    inst_id = 0
+                if hasattr(self, "_custom_op_states") and inst_id in self._custom_op_states:
+                    del self._custom_op_states[inst_id]
+                # 将priv_data清空
+                try:
+                    op_ctx_p.contents.priv_data = ctypes.c_void_p(0)
+                except Exception:
+                    op_ctx_p.contents.priv_data = 0
+                return 0
+            except Exception:
+                return -1
+        op = self._RKNN_CustomOp()
+        op.version = 1
+        op.target = self._RKNN_TARGET_TYPE_CPU
+        op.op_type = op_type.encode("utf-8")
+        op.cl_kernel_name = b""
+        op.cl_kernel_source = None
+        op.cl_source_size = 0
+        op.cl_build_options = b""
+        op.init = _py_init
+        op.prepare = _py_prepare
+        op.compute = _py_compute
+        op.compute_native = self._CB_SIG()  # NULL
+        op.destroy = _py_destroy
+        return op, (_py_init, _py_prepare, _py_compute, _py_destroy)
+    def _tensor_to_numpy(self, rknn_tensor):
+        """将 RKNN_CustomOpTensor 转换为 Numpy 数组视图"""
+        # 确定Numpy数据类型
+        # 您可以扩展这个映射
+        dtype_map = {
+            self._RKNN_TENSOR_FLOAT32: (ctypes.c_float, np.float32),
+            self._RKNN_TENSOR_UINT8: (ctypes.c_uint8, np.uint8),
+            self._RKNN_TENSOR_INT64: (ctypes.c_int64, np.int64),
+        }
+        c_type, np_dtype = dtype_map.get(rknn_tensor.attr.type, (None, None))
+        if c_type is None:
+            raise TypeError(f"不支持的RKNN张量类型: {rknn_tensor.attr.type}")
+        # 获取内存地址和形状
+        addr = (rknn_tensor.mem.virt_addr or 0) + int(rknn_tensor.mem.offset)
+        ptr = ctypes.cast(addr, ctypes.POINTER(c_type))
+        shape = tuple(rknn_tensor.attr.dims[i] for i in range(rknn_tensor.attr.n_dims))
+        # 创建Numpy数组视图
+        return np.ctypeslib.as_array(ptr, shape=shape)
+    def _create_onnxscript_op_creator(self,
+                                      op_type: str,
+                                      # 现在接收一个"函数模板构造器"
+                                      onnxscript_func_builder,
+                                      n_inputs: int,
+                                      n_outputs: int,
+                                      attributes: dict = {},
+                                      constants: dict = {}):
+        """
+        一个高阶工厂函数，用于创建基于ONNXScript的自定义算子构造器。
+        它在 on_init 阶段动态生成最终的 onnxscript 计算函数。
+        Args:
+            op_type (str): 算子类型名。
+            onnxscript_func_builder: 一个函数，它接收所有属性和常量作为关键字参数，
+                                     并返回一个编译好的 onnxscript 函数。
+                                     例如: def builder(mean, scale):
+                                               @onnxscript.script()
+                                               def compute(like):
+                                                  return opset.RandomNormalLike(like, mean=mean, scale=scale)
+                                               return compute
+            attributes (dict): 从模型中读取的属性字典。
+            constants (dict): 编译时常量字典。
+            n_inputs (int): 输入个数。
+            n_outputs (int): 输出个数。
+        """
+        def creator_func():
+            def on_init(op_ctx_p, read_i64, read_s, read_f32):
+                # 1. 读取所有动态属性
+                attr_values = {}
+                for name, (attr_type, default) in attributes.items():
+                    if attr_type == 'int64':
+                        attr_values[name] = read_i64(op_ctx_p, name, default)
+                    elif attr_type == 'str':
+                        attr_values[name] = read_s(op_ctx_p, name, default)
+                    elif attr_type == 'float32':
+                        attr_values[name] = read_f32(op_ctx_p, name, default)
+                    else:
+                        raise ValueError(f"不支持的属性类型: {attr_type}")
+                # 2. 合并常量和属性
+                final_kwargs = {**constants, **attr_values}
+                # 3. 动态构建 onnxscript 函数！ <<<<< 核心修改
+                #    这确保了所有属性值都作为常量被闭包捕获
+                compute_func = onnxscript_func_builder(**final_kwargs)
+                # 4. 将最终生成的、已编译的函数存入 state
+                return {"compute_func": compute_func}
+            def on_compute(op_ctx_p, inputs_p, outputs_p, state):
+                compute_func = state["compute_func"]
+                input_nps = [self._tensor_to_numpy(inputs_p[i]) for i in range(n_inputs)]
+                output_nps = [self._tensor_to_numpy(outputs_p[i]) for i in range(n_outputs)]
+                results = compute_func(*input_nps)
+                if n_outputs == 1:
+                    result_val = results[0] if isinstance(results, tuple) else results
+                    output_nps[0][...] = result_val
+                else:
+                    for i in range(n_outputs):
+                        output_nps[i][...] = results[i]
+                return 0
+            return self._build_py_custom_op(
+                op_type=op_type,
+                n_inputs=n_inputs,
+                n_outputs=n_outputs,
+                on_init=on_init,
+                on_compute=on_compute
+            )
+        return creator_func
+    def _create_gridsample_op(self):
+        import onnxscript
+        from onnxscript import opset17 as opset
+        def grid_sample_builder(align_corners, mode, padding_mode):
+            @onnxscript.script()
+            def grid_sample_compute(X, G):
+                return opset.GridSample(X, G, align_corners=align_corners, mode=mode, padding_mode=padding_mode)
+            return grid_sample_compute
+        grid_sample_creator = self._create_onnxscript_op_creator(
+            op_type="GridSample",
+            onnxscript_func_builder=grid_sample_builder, # << 传入 builder
+            attributes={
+                "align_corners": ("int64", 0),
+                "mode": ("str", "bilinear"),
+                "padding_mode": ("str", "zeros"),
+            },
+            n_inputs = 2,
+            n_outputs = 1
+        )
+        return grid_sample_creator
+    def _create_scatterelements_op(self):
+        import onnxscript
+        from onnxscript import opset17 as opset
+        @onnxscript.script()
+        def scatter_elements_compute(data, indices, updates):
+            indices_i64 = opset.Cast(indices, to=onnxscript.INT64.dtype)
+            return opset.ScatterElements(data, indices_i64, updates)
+        scatter_elements_creator = self._create_onnxscript_op_creator(
+            op_type="ScatterElements",
+            onnxscript_func_builder=lambda: scatter_elements_compute,
+            n_inputs = 3,
+            n_outputs = 1
+        )
+        return scatter_elements_creator
+    def _create_randomnormallike_op(self):
+        import onnxscript
+        from onnxscript import opset17 as opset
+        def random_normal_like_builder(mean, scale):
+            @onnxscript.script()
+            def random_normal_like_compute(like):
+                return opset.RandomNormalLike(like, mean=mean, scale=scale)
+            return random_normal_like_compute
+        # 3. 使用新的工厂函数
+        random_normal_like_creator = self._create_onnxscript_op_creator(
+            op_type="RandomNormalLike",
+            onnxscript_func_builder=random_normal_like_builder, # << 传入 builder
+            attributes={
+                "mean": ("float32", 0.0),
+                "scale": ("float32", 1.0),
+            },
+            n_inputs = 1,
+            n_outputs = 1
+        )
+        return random_normal_like_creator
+    def _create_einsum_op(self):
+        import onnxscript
+        from onnxscript import opset17 as opset
+        def einsum_builder(equation):
+            @onnxscript.script()
+            def einsum_compute(in1, in2):
+                return opset.Einsum(in1, in2, equation=equation)
+            return einsum_compute
+        # 3. 使用新的工厂函数
+        einsum_creator = self._create_onnxscript_op_creator(
+            op_type="Einsum",
+            onnxscript_func_builder=einsum_builder, # << 传入 builder
+            attributes={
+                "equation": ("str", ""),
+            },
+            n_inputs = 2,
+            n_outputs = 1
+        )
+        return einsum_creator
+    def register_bundled_ops(self) -> None:
+        """注册自定义操作"""
+        if getattr(self, "_custom_ops_registered", False):
+            return
+        runtime = self.runtime.rknn_base.rknn_runtime
+        lib = runtime.lib
+        ctx = runtime.context
+        try:
+            _ = lib.rknn_register_custom_ops
+            _ = lib.rknn_custom_op_get_op_attr
+        except AttributeError as e:
+            logger.debug(f"SDK不支持自定义算子注册: {e}")
+            return
+        self._init_custom_op_types()
+        # 注意：插件库注册已在模型加载后由环境变量控制，不在此处重复触发
+        # 算子创建函数的列表现在更加清晰
+        op_creator_factories = [
+            self._create_gridsample_op,
+            self._create_scatterelements_op,
+            self._create_randomnormallike_op,
+            self._create_einsum_op,
+            # self._create_my_custom_add_op, # 添加新算子非常简单
+        ]
+        ops_to_register = []
+        all_callbacks = []
+        for factory in op_creator_factories:
+            try:
+                # 调用工厂获得真正的构造器
+                creator_func = factory()
+                # 调用构造器生成算子实例
+                op, callbacks = creator_func()
+                ops_to_register.append(op)
+                all_callbacks.extend(callbacks)
+                logger.debug(f"成功创建自定义算子: {op.op_type.decode()}")
+            except Exception as e:
+                logger.warning(f"创建自定义算子失败: {e}", exc_info=True)
+        if not ops_to_register:
+            logger.debug("没有可注册的自定义算子")
+            return
+        # 创建一个ctypes数组以包含所有要注册的算子, 然后一次性注册
+        num_ops = len(ops_to_register)
+        op_array = (self._RKNN_CustomOp * num_ops)(*ops_to_register)
+        ret = lib.rknn_register_custom_ops(ctx, op_array, num_ops)
+        if ret != 0:
+            logger.error(f"注册自定义算子失败, ret={ret} (可能是误报, 继续执行...)")
+            # raise RuntimeError(f"rknn_register_custom_ops 失败, ret={ret}")
+        logger.info(f"成功注册 {len(ops_to_register)} 个自定义算子")
+        self._custom_ops_registered = True
+        self._registered_ops = ops_to_register
+        self._op_callbacks = all_callbacks
+    def _load_and_register_plugin_op(self, so_path: str) -> bool:
+        """加载单个插件库并注册其中的自定义算子。
+        要求插件实现 get_rknn_custom_op()，返回 rknn_custom_op*。
+        我们将该 C 指针直接传递给 rknn_register_custom_ops，避免复制。
+        """
+        if not os.path.isfile(so_path):
+            logger.warning(f"插件库不存在: {so_path}")
+            return False
+        runtime = self.runtime.rknn_base.rknn_runtime
+        lib = runtime.lib
+        ctx = runtime.context
+        # 根据平台位宽设置 rknn_context 的 ctypes 类型
+        ContextCType = ctypes.c_uint64 if ctypes.sizeof(ctypes.c_void_p) == 8 else ctypes.c_uint32
+        # 设置 rknn_register_custom_ops(ctx, op_ptr, num) 签名。第二参数按 void* 传递，避免结构体布局不一致
+        try:
+            lib.rknn_register_custom_ops.argtypes = [ContextCType, ctypes.c_void_p, ctypes.c_uint32]
+            lib.rknn_register_custom_ops.restype = ctypes.c_int
+        except Exception:
+            pass
+        # 加载插件
+        try:
+            handle = ctypes.CDLL(so_path)
+        except Exception as e:
+            logger.error(f"dlopen 失败: {so_path}, err={e}")
+            return False
+        # 获取 get_rknn_custom_op 符号
+        try:
+            get_sym = getattr(handle, "get_rknn_custom_op")
+        except AttributeError:
+            logger.error(f"插件缺少符号 get_rknn_custom_op: {so_path}")
+            return False
+        # 返回类型直接使用 void*，避免 Python 解析第三方结构体
+        try:
+            get_sym.argtypes = []
+        except Exception:
+            pass
+        get_sym.restype = ctypes.c_void_p
+        op_void_ptr = get_sym()
+        if not op_void_ptr:
+            logger.error(f"get_rknn_custom_op 返回空指针: {so_path}")
+            return False
+        # 直接使用原生指针注册（零拷贝）
+        ctx_val = ContextCType(runtime.context)
+        ret = lib.rknn_register_custom_ops(ctx_val, ctypes.c_void_p(op_void_ptr), 1)
+        if ret != 0:
+            logger.error(f"rknn_register_custom_ops 失败, ret={ret}, so={so_path} (可能是误报, 继续执行...)")
+            # return False
+        # 保留句柄，避免被垃圾回收卸载
+        if not hasattr(self, "_plugin_handles"):
+            self._plugin_handles = []
+        self._plugin_handles.append(handle)
+        logger.info(f"成功注册插件自定义算子: {so_path}")
+        return True
+    def register_plugin_ops(self, plugin_paths: List[str]) -> int:
+        """按给定路径列表注册插件库中的自定义算子。返回成功数量。"""
+        if not plugin_paths:
+            return 0
+        success = 0
+        for path in plugin_paths:
+            try:
+                if self._load_and_register_plugin_op(path):
+                    success += 1
+            except Exception as e:
+                logger.error(f"注册插件失败: {path}, err={e}")
+        return success
+    # 对外API：注册单个自定义算子插件库
+    def register_custom_op_lib(self, path: str) -> bool:
+        return self._load_and_register_plugin_op(path)
+    # 对外API：扫描并注册 Linux 系统目录下所有插件库（Android 不处理）
+    def register_system_custom_op_lib(self) -> int:
+        if os.name != 'posix':
+            return 0
+        # 仅 Linux：RKNN 官方默认目录
+        system_dir = "/usr/lib/rknpu/op_plugins/"
+        if not os.path.isdir(system_dir):
+            return 0
+        try:
+            entries = os.listdir(system_dir)
+        except Exception:
+            return 0
+        so_list = []
+        for name in entries:
+            # 官方要求文件名以 librkcst_ 开头
+            if name.startswith("librkcst_") and name.endswith('.so'):
+                so_list.append(os.path.join(system_dir, name))
+        return self.register_plugin_ops(so_list)