Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision

Difference in architecture between the inference-time model and the implementation in the transformers code

#7
by DavidNguyen - opened

I checked the code and noticed a difference between the two implementations.

(1) HuggingFace’s Siglip2VisionEmbeddings in the source code

self.patch_embedding = nn.Linear(
    in_features=config.num_channels * self.patch_size * self.patch_size,
    out_features=self.embed_dim,
)

This suggests that the model expects flattened patches and projects them with a Linear layer.

(2) However, the actual loaded model shows:

(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14))

This is very different: it's using a Conv2d layer instead of Linear.

Explanation of the difference

HuggingFace’s codebase contains two vision embedding classes:

  1. Siglip2VisionEmbeddings — used for the SigLIP2 architecture
  2. SiglipVisionEmbeddings (older SigLIP) — used in the actual checkpoint you loaded

Although the model name is siglip2-so400m-patch14-384, the checkpoint internally uses:

SiglipVisionTransformer
    └── SiglipVisionEmbeddings (Conv2d)

This means:

  • The public model card says SigLIP2,
  • but the released checkpoint uses the SigLIP v1 patch embedding design,
  • which relies on Conv2d with kernel = patch size.

I do not understand why these differences appear.

Sign up or log in to comment