Model Card for Gemma3 4b with 448x448 SigLIP Image Encoder

Gemma3 4B with 448x448 SigLIP Vision Tower for edge deployment.

Model Details

This model is based on google/gemma-3-4b-it, whereas the 896x896 SigLIP in the original Gemma3 architecture is replaced with a 448x448 SigLIP model as it can be found in Efficient-Large-Model/paligemma-siglip-so400m-patch14-448. To fit the 256 image tokens, the average pooling layer at the output of the vision tower is adjusted for the 448x448 SigLIP model which has a smaller output embedding (1024 instead of 4096 patches in the 896x896 version). Details about the implementation can be found in the original Gemma3 Technical Report (Chapter 5.5. Vision encoder) [https://arxiv.org/pdf/2503.19786].

Downloads last month: 71

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for simaai/gemma3-siglip448

Base model

Efficient-Large-Model/paligemma-siglip-so400m-patch14-448

Finetuned

(8)

this model