Error loading the SigLIP2 Vision model
Description
Running the code snippet provided in the documentation for Siglip2VisionModel yields a RuntimeError due to shape mismatch when loading the model checkpoint.
Steps to Reproduce
- Install transformers library
- Run the following code:
from transformers import Siglip2VisionModel
model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
Expected Behavior
The model should load successfully without errors.
Actual Behavior
The following error is raised:
You are using a model of type siglip_vision_model to instantiate a model of type siglip2_vision_model.
This is not supported for all configurations of models and can yield errors.
[...]
RuntimeError: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([768, 3, 16, 16])
from checkpoint, the shape in current model is torch.Size([768, 768]).
Additional Investigation
Loading a SigLIP2 model from checkpoint effectively attempts to load the model using the wrong class, using SigLIP classes instead of SigLIP2:
from transformers import AutoModel
model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
model.vision_model.__class__
# Output: transformers.models.siglip.modeling_siglip.SiglipVisionTransformer
Root Cause Analysis
I believe this is because this checkpoint, as well as most other SigLIP2 checkpoints, are not defined in src/transformers/models/siglip2/convert_siglip2_to_hf.py but rather in src/transformers/models/siglip/convert_siglip_to_hf.py.
Proposed Solution
Porting the SigLIP2 checkpoints to the SigLIP2 conversion file may fix the error.
Environment
- transformers version: 4.52.4
- PyTorch version: 2.6.0
- Python version: 3.11.11
- Operating System: linux
Additional Context
This affects the google/siglip2-base-patch16-224 checkpoint and affects all checkpoints except google/siglip2-base-patch16-naflexand google/siglip2-so400m-patch16-naflex, that are correctly defined in src/transformers/models/siglip/convert_siglip_to_hf.py.
Hello, I've also encountered the same problem. Could you please tell me how you solved it?
Unfortunately, I haven't. I'm currently using checkpoints outside huggingface. This works for me, using open_clip_torch==3.2.0 and timm==1.0.15:
from open_clip import create_model_from_pretrained # works on open-clip-torch >= 2.31.0, timm >= 1.0.15
siglip2, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-L-16-SigLIP2-512')
Thanks, I just saw this discussion. It says that Siglip2Model works only for -naflex checkpoints, and other checkpoints should be loaded with SiglipModel (the same applies for SiglipVision / Siglip2Vision).