Difference in architecture between the inference-time model and the implementation in the transformers code

by DavidNguyen - opened Dec 11, 2025

I checked the code and noticed a difference between the two implementations.

(1) HuggingFace’s Siglip2VisionEmbeddings in the source code

self.patch_embedding = nn.Linear(
    in_features=config.num_channels * self.patch_size * self.patch_size,
    out_features=self.embed_dim,
)

This suggests that the model expects flattened patches and projects them with a Linear layer.

(2) However, the actual loaded model shows:

(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14))

This is very different: it's using a Conv2d layer instead of Linear.

Explanation of the difference

HuggingFace’s codebase contains two vision embedding classes:

Siglip2VisionEmbeddings — used for the SigLIP2 architecture
SiglipVisionEmbeddings (older SigLIP) — used in the actual checkpoint you loaded

Although the model name is siglip2-so400m-patch14-384, the checkpoint internally uses:

SiglipVisionTransformer
    └── SiglipVisionEmbeddings (Conv2d)

This means:

I do not understand why these differences appear.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment