Instructions to use google/siglip2-so400m-patch14-384 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/siglip2-so400m-patch14-384 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="google/siglip2-so400m-patch14-384") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("google/siglip2-so400m-patch14-384", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Difference in architecture between the inference-time model and the implementation in the transformers code
#7
by DavidNguyen - opened
I checked the code and noticed a difference between the two implementations.
(1) HuggingFace’s Siglip2VisionEmbeddings in the source code
self.patch_embedding = nn.Linear(
in_features=config.num_channels * self.patch_size * self.patch_size,
out_features=self.embed_dim,
)
This suggests that the model expects flattened patches and projects them with a Linear layer.
(2) However, the actual loaded model shows:
(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14))
This is very different: it's using a Conv2d layer instead of Linear.
Explanation of the difference
HuggingFace’s codebase contains two vision embedding classes:
Siglip2VisionEmbeddings— used for the SigLIP2 architectureSiglipVisionEmbeddings(older SigLIP) — used in the actual checkpoint you loaded
Although the model name is siglip2-so400m-patch14-384, the checkpoint internally uses:
SiglipVisionTransformer
└── SiglipVisionEmbeddings (Conv2d)
This means:
- The public model card says SigLIP2,
- but the released checkpoint uses the SigLIP v1 patch embedding design,
- which relies on Conv2d with kernel = patch size.
I do not understand why these differences appear.