--- license: gemma tags: - image-feature-extraction - siglip base_model: google/gemma-3-4b-pt library_name: transformers --- # Gemma 3 Vision Encoder (extracted) This repository contains the SigLIP-family vision encoder extracted from **google/gemma-3-4b-pt**. It also includes the Gemma multimodal projector weights (state_dict) and a small metadata file. ## Contents - `config.json`, `model.safetensors`: the SigLIP vision encoder - `preprocessor_config.json`: the image processor settings used by Gemma 3 - `projector_state_dict.pt`: PyTorch state dict for the Gemma projector - `projector_config.json`: metadata (class, dims, token count if detected) - `NOTICE`: Gemma Terms pointer ## Basic usage (encoder as feature extractor) ```python from transformers import SiglipVisionModel, AutoImageProcessor from PIL import Image import torch repo_id = "/" encoder = SiglipVisionModel.from_pretrained(repo_id).eval() processor = AutoImageProcessor.from_pretrained(repo_id) img = Image.open("test.jpg").convert("RGB") inputs = processor(images=img, return_tensors='pt') with torch.no_grad(): feats = encoder(**inputs).last_hidden_state # (B, Tv, Dv) print(feats.shape) ``` ## Using the projector (Gemma-style multimodal path) The projector here is provided as a **state dict** plus metadata. It is intended for users who are wiring a Gemma-style VLM, where the projector maps the vision sequence to a fixed number of image tokens at the LLM hidden size. Two common paths: 1) **Use with Transformers' Gemma 3 model**: load the full VLM, then load this projector's state_dict into the model's `multi_modal_projector` module. ```python import torch from transformers import Gemma3ForConditionalGeneration repo_id = "/" vlm = Gemma3ForConditionalGeneration.from_pretrained('google/gemma-3-4b-pt', device_map='cpu') sd = torch.load('projector_state_dict.pt', map_location='cpu') # or from the repo checkout vlm.multi_modal_projector.load_state_dict(sd, strict=False) vlm.eval() ``` 2) **Recreate the projector module from the class name**, instantiate it, and load the state dict. The metadata file records the fully qualified class name (FQN). ```python import importlib, json, torch with open("projector_config.json", "r") as f: meta = json.load(f) fqn = meta.get('projector_fqn') # e.g., 'transformers.models.gemma3.modeling_gemma3.Gemma3VisionProjector' mod_name, cls_name = fqn.rsplit('.', 1) cls = getattr(importlib.import_module(mod_name), cls_name) projector = cls(**{k: v for k, v in meta.items() if k.endswith('_dim') or k.endswith('_tokens')}) sd = torch.load('projector_state_dict.pt', map_location='cpu') projector.load_state_dict(sd, strict=False) projector.eval() ``` ## Shapes (for reference) - Vision hidden size Dv: 1152 - Projector output tokens Ti: 256 - Projector class: `transformers.models.gemma3.modeling_gemma3.Gemma3MultiModalProjector` ## License / Terms See the `NOTICE` file. Gemma is provided under and subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms