rnagabh commited on
Commit
3a8691c
·
verified ·
1 Parent(s): 721fa09

Initial upload: Gemma 4 vision encoder (569.6M, 27-layer ViT with 2D RoPE)

Browse files
Files changed (2) hide show
  1. README.md +6 -6
  2. preprocessor_config.json +23 -0
README.md CHANGED
@@ -86,7 +86,7 @@ Unlike the audio encoder (which is identical across E2B and E4B), the vision enc
86
 
87
  ```python
88
  import torch
89
- from transformers import Gemma4VisionModel, AutoProcessor
90
  from PIL import Image
91
 
92
  # Load vision encoder directly from this repo
@@ -97,9 +97,8 @@ vision_model = Gemma4VisionModel.from_pretrained(
97
  vision_model.to("cuda")
98
  vision_model.eval()
99
 
100
- # Use the parent model's image processor for correct preprocessing
101
- processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it")
102
- image_processor = processor.image_processor
103
 
104
  # Process an image
105
  img = Image.open("your_image.jpg")
@@ -117,7 +116,7 @@ with torch.no_grad():
117
  image_embedding = embeddings.float().mean(dim=0) # (1152,)
118
  ```
119
 
120
- > **Important:** Always use `Gemma4ImageProcessor` from the parent model for preprocessing.
121
  > It handles resizing, patchification, position ID generation, and pixel normalization.
122
  > Manual patchification without this processor will produce significantly degraded results.
123
 
@@ -141,12 +140,13 @@ Strong performance across all classes: airplane (0.98 F1), ship (0.98 F1), truck
141
  |---|---|---|
142
  | `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
143
  | `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
 
144
  | `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
145
 
146
  ## Limitations
147
 
148
  - **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
149
- - **Requires parent model's image processor:** Use `Gemma4ImageProcessor` from `google/gemma-4-31B-it` for preprocessing. The model expects pre-patchified `(B, num_patches, 768)` tensors with explicit 2D position IDs — the processor handles this automatically.
150
  - **Variable aspect ratio support:** The 2D position embeddings enable non-square images. The processor generates correct position IDs for any aspect ratio.
151
  - **Output shape note:** The pooler strips padding and collapses the batch dimension, returning `(num_valid_tokens, 1152)`. For batched inference, use `num_soft_tokens_per_image` from the processor to split the output back into per-image embeddings.
152
 
 
86
 
87
  ```python
88
  import torch
89
+ from transformers import Gemma4VisionModel, Gemma4ImageProcessor
90
  from PIL import Image
91
 
92
  # Load vision encoder directly from this repo
 
97
  vision_model.to("cuda")
98
  vision_model.eval()
99
 
100
+ # Load image processor (saved in this repo)
101
+ image_processor = Gemma4ImageProcessor.from_pretrained("rnagabh/gemma4-vision-encoder")
 
102
 
103
  # Process an image
104
  img = Image.open("your_image.jpg")
 
116
  image_embedding = embeddings.float().mean(dim=0) # (1152,)
117
  ```
118
 
119
+ > **Important:** Always use the `Gemma4ImageProcessor` included in this repo for preprocessing.
120
  > It handles resizing, patchification, position ID generation, and pixel normalization.
121
  > Manual patchification without this processor will produce significantly degraded results.
122
 
 
140
  |---|---|---|
141
  | `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
142
  | `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
143
+ | `preprocessor_config.json` | Image processor config (Gemma4ImageProcessor) | <1 KB |
144
  | `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
145
 
146
  ## Limitations
147
 
148
  - **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
149
+ - **Requires image processor:** Use the `Gemma4ImageProcessor` included in this repo for preprocessing. The model expects pre-patchified `(B, num_patches, 768)` tensors with explicit 2D position IDs — the processor handles this automatically.
150
  - **Variable aspect ratio support:** The 2D position embeddings enable non-square images. The processor generates correct position IDs for any aspect ratio.
151
  - **Output shape note:** The pooler strips padding and collapses the batch dimension, returning `(num_valid_tokens, 1152)`. For batched inference, use `num_soft_tokens_per_image` from the processor to split the output back into per-image embeddings.
152
 
preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": false,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.0,
8
+ 0.0,
9
+ 0.0
10
+ ],
11
+ "image_processor_type": "Gemma4ImageProcessor",
12
+ "image_seq_length": 280,
13
+ "image_std": [
14
+ 1.0,
15
+ 1.0,
16
+ 1.0
17
+ ],
18
+ "max_soft_tokens": 280,
19
+ "patch_size": 16,
20
+ "pooling_kernel_size": 3,
21
+ "resample": 3,
22
+ "rescale_factor": 0.00392156862745098
23
+ }