Model Card for Gemma3 4b with 448x448 SigLIP Image Encoder

Gemma3 4B with 448x448 SigLIP Vision Tower for edge deployment.

Model Details

This model is based on google/gemma-3-4b-it, whereas the 896x896 SigLIP in the original Gemma3 architecture is replaced with a 448x448 SigLIP model as it can be found in Efficient-Large-Model/paligemma-siglip-so400m-patch14-448. To fit the 256 image tokens, the average pooling layer at the output of the vision tower is adjusted for the 448x448 SigLIP model which has a smaller output embedding (1024 instead of 4096 patches in the 896x896 version). Details about the implementation can be found in the original Gemma3 Technical Report (Chapter 5.5. Vision encoder) [https://arxiv.org/pdf/2503.19786].

Downloads last month
71
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for simaai/gemma3-siglip448