--- license: mit language: - en base_model: - google/siglip2-base-patch16-224 pipeline_tag: zero-shot-classification --- # SigLIP2 SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). The Original repo is https://huggingface.co/google/siglip2-base-patch16-224. This model of SigLIP has been converted to run on the Axera NPU using **w8a16** quantization. This model has been optimized with the following LoRA: Compatible with Pulsar2 version: 5.1 ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through - [The repo of AXera Platform](https://github.com/AXERA-TECH/SigLIP2.axera), which you can get the detial of guide - [Pulsar2 Link, How to Convert ONNX to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/pulsar2/introduction.html) ## Support Platform - AX650 - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) - [M.2 Accelerator card](https://docs.m5stack.com/zh_CN/ai_hardware/LLM-8850_Card) | Models | latency | | -------------| ------------- | | Image Encoder| 11.1ms | | Text Encoder | 4.56ms | ## How to use Download all files from this repository to the device ``` root@ax650:~/siglip2-base-patch16-224# tree -L 2 . ├── 000000039769.jpg ├── README.md ├── ax650 │   ├── siglip2-base-patch16-224_text.axmodel │   └── siglip2-base-patch16-224_vision.axmodel ├── config.json ├── model_convert │   ├── imagenet-calib.tar │   ├── siglip2-base-patch16-224_text.json │   └── siglip2-base-patch16-224_vision.json ├── onnx │   ├── siglip2-base-patch16-224_text.onnx │   └── siglip2-base-patch16-224_vision.onnx ├── python │   ├── axmodel_infer.py │   ├── export_onnx.py │   ├── onnx_infer.py │   ├── requirements.txt │   └── test.py └── tokenizer ├── config.json ├── preprocessor_config.json ├── special_tokens_map.json ├── tokenizer.json └── tokenizer_config.json 5 directories, 20 files ``` ### python env requirement #### pyaxengine https://github.com/AXERA-TECH/pyaxengine ``` wget https://github.com/AXERA-TECH/pyaxengine/releases/download/0.1.3rc0/axengine-0.1.3-py3-none-any.whl pip install axengine-0.1.3-py3-none-any.whl ``` #### others ``` pip install -r python/requirements.txt ``` ## Inputs **Test** ``` "a photo of 2 cats", "a photo of 2 dogs" ``` **Image** ![](000000039769.jpg) ## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) ``` root@ax650:~/siglip2-base-patch16-224# python3 python/axmodel_infer.py [INFO] Available providers: ['AxEngineExecutionProvider'] [INFO] Using provider: AxEngineExecutionProvider [INFO] Chip type: ChipType.MC50 [INFO] VNPU type: VNPUType.DISABLED [INFO] Engine version: 2.12.0s [INFO] Model type: 2 (triple core) [INFO] Compiler version: 5.1-patch1 430ee3be [INFO] Using provider: AxEngineExecutionProvider [INFO] Model type: 2 (triple core) [INFO] Compiler version: 5.1-patch1 430ee3be [[1.0596762e-01 1.9978019e-05]] 10.6% that image 0 is 'a photo of 2 cats' ```