Difference in architecture between the inference-time model and the implementation in the transformers code
#7
by
DavidNguyen
- opened
I checked the code and noticed a difference between the two implementations.
(1) HuggingFace’s Siglip2VisionEmbeddings in the source code
self.patch_embedding = nn.Linear(
in_features=config.num_channels * self.patch_size * self.patch_size,
out_features=self.embed_dim,
)
This suggests that the model expects flattened patches and projects them with a Linear layer.
(2) However, the actual loaded model shows:
(patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14))
This is very different: it's using a Conv2d layer instead of Linear.
Explanation of the difference
HuggingFace’s codebase contains two vision embedding classes:
Siglip2VisionEmbeddings— used for the SigLIP2 architectureSiglipVisionEmbeddings(older SigLIP) — used in the actual checkpoint you loaded
Although the model name is siglip2-so400m-patch14-384, the checkpoint internally uses:
SiglipVisionTransformer
└── SiglipVisionEmbeddings (Conv2d)
This means:
- The public model card says SigLIP2,
- but the released checkpoint uses the SigLIP v1 patch embedding design,
- which relies on Conv2d with kernel = patch size.
I do not understand why these differences appear.