| | --- |
| | license: apache-2.0 |
| | tags: |
| | - vision |
| | --- |
| | |
| | # SigLIP 2 Base |
| |
|
| | [SigLIP 2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) |
| | extends the pretraining objective of |
| | [SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) |
| | with prior, independently developed techniques into a unified recipe, for improved semantic |
| | understanding, localization, and dense features. |
| |
|
| | ## Intended uses |
| |
|
| | You can use the raw model for tasks like zero-shot image classification and |
| | image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). |
| |
|
| | Here is how to use this model to perform zero-shot image classification: |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | # load pipeline |
| | ckpt = "google/siglip2-base-patch16-naflex" |
| | image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification") |
| | |
| | # load image and candidate labels |
| | url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| | candidate_labels = ["2 cats", "a plane", "a remote"] |
| | |
| | # run inference |
| | outputs = image_classifier(image, candidate_labels) |
| | print(outputs) |
| | ``` |
| |
|
| | You can encode an image using the Vision Tower like so: |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModel, AutoProcessor |
| | from transformers.image_utils import load_image |
| | |
| | # load the model and processor |
| | ckpt = "google/siglip2-base-patch16-naflex" |
| | model = AutoModel.from_pretrained(ckpt, device_map="auto").eval() |
| | processor = AutoProcessor.from_pretrained(ckpt) |
| | |
| | # load the image |
| | image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg") |
| | inputs = processor(images=[image], return_tensors="pt").to(model.device) |
| | |
| | # run infernece |
| | with torch.no_grad(): |
| | image_embeddings = model.get_image_features(**inputs) |
| | |
| | print(image_embeddings.shape) |
| | ``` |
| |
|
| | For more code examples, we refer to the [siglip2 documentation](https://huggingface.co/transformers/main/model_doc/siglip2.html#). |
| |
|
| | ## Training procedure |
| |
|
| | SigLIP 2 adds some clever training objectives on top of SigLIP: |
| |
|
| | 1. Decoder loss |
| | 2. Global-local and masked prediction loss |
| | 3. Aspect ratio and resolution adaptibility |
| |
|
| | ### Training data |
| |
|
| | SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794). |
| |
|
| | ### Compute |
| |
|
| | The model was trained on up to 2048 TPU-v5e chips. |
| |
|
| | ## Evaluation results |
| |
|
| | Evaluation of SigLIP 2 is shown below (taken from the paper). |
| |
|
| | [Evaluation Table](TODO) |
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | TODO |
| | ``` |
| |
|