rnagabh
/

gemma4-vision-encoder

@@ -86,7 +86,7 @@ Unlike the audio encoder (which is identical across E2B and E4B), the vision enc
 ```python
 import torch
-from transformers import Gemma4VisionModel, AutoProcessor
 from PIL import Image
 # Load vision encoder directly from this repo
@@ -97,9 +97,8 @@ vision_model = Gemma4VisionModel.from_pretrained(
 vision_model.to("cuda")
 vision_model.eval()
-# Use the parent model's image processor for correct preprocessing
-processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it")
-image_processor = processor.image_processor
 # Process an image
 img = Image.open("your_image.jpg")
@@ -117,7 +116,7 @@ with torch.no_grad():
     image_embedding = embeddings.float().mean(dim=0)  # (1152,)
 ```
-> **Important:** Always use `Gemma4ImageProcessor` from the parent model for preprocessing.
 > It handles resizing, patchification, position ID generation, and pixel normalization.
 > Manual patchification without this processor will produce significantly degraded results.
@@ -141,12 +140,13 @@ Strong performance across all classes: airplane (0.98 F1), ship (0.98 F1), truck
 |---|---|---|
 | `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
 | `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
 | `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
 ## Limitations
 - **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
-- **Requires parent model's image processor:** Use `Gemma4ImageProcessor` from `google/gemma-4-31B-it` for preprocessing. The model expects pre-patchified `(B, num_patches, 768)` tensors with explicit 2D position IDs — the processor handles this automatically.
 - **Variable aspect ratio support:** The 2D position embeddings enable non-square images. The processor generates correct position IDs for any aspect ratio.
 - **Output shape note:** The pooler strips padding and collapses the batch dimension, returning `(num_valid_tokens, 1152)`. For batched inference, use `num_soft_tokens_per_image` from the processor to split the output back into per-image embeddings.

 ```python
 import torch
+from transformers import Gemma4VisionModel, Gemma4ImageProcessor
 from PIL import Image
 # Load vision encoder directly from this repo
 vision_model.to("cuda")
 vision_model.eval()
+# Load image processor (saved in this repo)
+image_processor = Gemma4ImageProcessor.from_pretrained("rnagabh/gemma4-vision-encoder")
 # Process an image
 img = Image.open("your_image.jpg")
     image_embedding = embeddings.float().mean(dim=0)  # (1152,)
 ```
+> **Important:** Always use the `Gemma4ImageProcessor` included in this repo for preprocessing.
 > It handles resizing, patchification, position ID generation, and pixel normalization.
 > Manual patchification without this processor will produce significantly degraded results.
 |---|---|---|
 | `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
 | `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
+| `preprocessor_config.json` | Image processor config (Gemma4ImageProcessor) | <1 KB |
 | `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
 ## Limitations
 - **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
+- **Requires image processor:** Use the `Gemma4ImageProcessor` included in this repo for preprocessing. The model expects pre-patchified `(B, num_patches, 768)` tensors with explicit 2D position IDs — the processor handles this automatically.
 - **Variable aspect ratio support:** The 2D position embeddings enable non-square images. The processor generates correct position IDs for any aspect ratio.
 - **Output shape note:** The pooler strips padding and collapses the batch dimension, returning `(num_valid_tokens, 1152)`. For batched inference, use `num_soft_tokens_per_image` from the processor to split the output back into per-image embeddings.

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": false,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.0,
+    0.0,
+    0.0
+  ],
+  "image_processor_type": "Gemma4ImageProcessor",
+  "image_seq_length": 280,
+  "image_std": [
+    1.0,
+    1.0,
+    1.0
+  ],
+  "max_soft_tokens": 280,
+  "patch_size": 16,
+  "pooling_kernel_size": 3,
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098
+}