| --- |
| license: apache-2.0 |
| base_model: google/siglip2-base-patch16-224 |
| tags: |
| - onnx |
| - vision |
| - image-text-matching |
| - nebula |
| --- |
| |
| # siglip2-base-patch16-224 (ONNX) |
|
|
| This is [Google's SigLIP 2 base/224](https://huggingface.co/google/siglip2-base-patch16-224) exported to ONNX format for CPU inference, used by [Nebula](https://github.com/diegohh0411/nebula) for local, offline image search. |
|
|
| ## What's inside |
|
|
| | File | Description | |
| |---|---| |
| | `model.onnx` | Combined vision + text encoder (~110 MB) | |
| | `tokenizer.json` | SigLIP tokenizer | |
|
|
| ## Model inputs & outputs |
|
|
| The single `model.onnx` file contains both encoders. You can run either independently by passing a dummy tensor for the unused branch. |
|
|
| **Inputs** |
|
|
| | Name | Shape | dtype | |
| |---|---|---| |
| | `pixel_values` | `[image_batch, 3, 224, 224]` | float32 | |
| | `input_ids` | `[text_batch, seq_len]` | int64 | |
|
|
| **Outputs** |
|
|
| | Name | Shape | dtype | Description | |
| |---|---|---|---| |
| | `image_embeds` | `[image_batch, 768]` | float32 | L2-normalizable image embedding | |
| | `text_embeds` | `[text_batch, 768]` | float32 | L2-normalizable text embedding | |
| | `logits_per_image` | `[image_batch, text_batch]` | float32 | Cosine similarity scores | |
| | `logits_per_text` | `[text_batch, image_batch]` | float32 | Cosine similarity scores (transposed) | |
|
|
| ## How it was exported |
|
|
| ```bash |
| optimum-cli export onnx \ |
| --model google/siglip2-base-patch16-224 \ |
| --task zero-shot-image-classification \ |
| --opset 18 \ |
| ./models/ |
| ``` |
|
|
| Requires `optimum[onnxruntime]` and `transformers`. |
|
|
| ## License |
|
|
| Inherits [Apache 2.0](https://huggingface.co/google/siglip2-base-patch16-224) from the original Google SigLIP 2 model. |
|
|