---
license: mit
language:
- en
base_model:
- google/siglip2-base-patch16-224
pipeline_tag: zero-shot-classification
---

# SigLIP2

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.
You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

The Original repo is https://huggingface.co/google/siglip2-base-patch16-224.

This model of SigLIP has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 5.1

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through 


- [The repo of AXera Platform](https://github.com/AXERA-TECH/SigLIP2.axera), which you can get the detial of guide

- [Pulsar2 Link, How to Convert ONNX to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/pulsar2/introduction.html) 


## Support Platform

- AX650
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://docs.m5stack.com/zh_CN/ai_hardware/LLM-8850_Card)
  
 
| Models       | latency       |
| -------------| ------------- | 
| Image Encoder| 11.1ms        |
| Text Encoder | 4.56ms        |

## How to use

Download all files from this repository to the device

```
root@ax650:~/siglip2-base-patch16-224# tree -L 2
.
├── 000000039769.jpg
├── README.md
├── ax650
│   ├── siglip2-base-patch16-224_text.axmodel
│   └── siglip2-base-patch16-224_vision.axmodel
├── config.json
├── model_convert
│   ├── imagenet-calib.tar
│   ├── siglip2-base-patch16-224_text.json
│   └── siglip2-base-patch16-224_vision.json
├── onnx
│   ├── siglip2-base-patch16-224_text.onnx
│   └── siglip2-base-patch16-224_vision.onnx
├── python
│   ├── axmodel_infer.py
│   ├── export_onnx.py
│   ├── onnx_infer.py
│   ├── requirements.txt
│   └── test.py
└── tokenizer
    ├── config.json
    ├── preprocessor_config.json
    ├── special_tokens_map.json
    ├── tokenizer.json
    └── tokenizer_config.json

5 directories, 20 files

```

### python env requirement

#### pyaxengine

https://github.com/AXERA-TECH/pyaxengine

```
wget https://github.com/AXERA-TECH/pyaxengine/releases/download/0.1.3rc0/axengine-0.1.3-py3-none-any.whl
pip install axengine-0.1.3-py3-none-any.whl
```

#### others

```
pip install -r python/requirements.txt
```

## Inputs

**Test**
```
"a photo of 2 cats", "a photo of 2 dogs"
```

**Image**
![](000000039769.jpg)

## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro)

```
root@ax650:~/siglip2-base-patch16-224# python3 python/axmodel_infer.py
[INFO] Available providers:  ['AxEngineExecutionProvider']
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1 430ee3be
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1 430ee3be
[[1.0596762e-01 1.9978019e-05]]
10.6% that image 0 is 'a photo of 2 cats'
```