Initial upload: Gemma 4 vision encoder (569.6M, 27-layer ViT with 2D RoPE)
Browse files- README.md +6 -6
- preprocessor_config.json +23 -0
README.md
CHANGED
|
@@ -86,7 +86,7 @@ Unlike the audio encoder (which is identical across E2B and E4B), the vision enc
|
|
| 86 |
|
| 87 |
```python
|
| 88 |
import torch
|
| 89 |
-
from transformers import Gemma4VisionModel,
|
| 90 |
from PIL import Image
|
| 91 |
|
| 92 |
# Load vision encoder directly from this repo
|
|
@@ -97,9 +97,8 @@ vision_model = Gemma4VisionModel.from_pretrained(
|
|
| 97 |
vision_model.to("cuda")
|
| 98 |
vision_model.eval()
|
| 99 |
|
| 100 |
-
#
|
| 101 |
-
|
| 102 |
-
image_processor = processor.image_processor
|
| 103 |
|
| 104 |
# Process an image
|
| 105 |
img = Image.open("your_image.jpg")
|
|
@@ -117,7 +116,7 @@ with torch.no_grad():
|
|
| 117 |
image_embedding = embeddings.float().mean(dim=0) # (1152,)
|
| 118 |
```
|
| 119 |
|
| 120 |
-
> **Important:** Always use `Gemma4ImageProcessor`
|
| 121 |
> It handles resizing, patchification, position ID generation, and pixel normalization.
|
| 122 |
> Manual patchification without this processor will produce significantly degraded results.
|
| 123 |
|
|
@@ -141,12 +140,13 @@ Strong performance across all classes: airplane (0.98 F1), ship (0.98 F1), truck
|
|
| 141 |
|---|---|---|
|
| 142 |
| `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
|
| 143 |
| `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
|
|
|
|
| 144 |
| `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
|
| 145 |
|
| 146 |
## Limitations
|
| 147 |
|
| 148 |
- **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
|
| 149 |
-
- **Requires
|
| 150 |
- **Variable aspect ratio support:** The 2D position embeddings enable non-square images. The processor generates correct position IDs for any aspect ratio.
|
| 151 |
- **Output shape note:** The pooler strips padding and collapses the batch dimension, returning `(num_valid_tokens, 1152)`. For batched inference, use `num_soft_tokens_per_image` from the processor to split the output back into per-image embeddings.
|
| 152 |
|
|
|
|
| 86 |
|
| 87 |
```python
|
| 88 |
import torch
|
| 89 |
+
from transformers import Gemma4VisionModel, Gemma4ImageProcessor
|
| 90 |
from PIL import Image
|
| 91 |
|
| 92 |
# Load vision encoder directly from this repo
|
|
|
|
| 97 |
vision_model.to("cuda")
|
| 98 |
vision_model.eval()
|
| 99 |
|
| 100 |
+
# Load image processor (saved in this repo)
|
| 101 |
+
image_processor = Gemma4ImageProcessor.from_pretrained("rnagabh/gemma4-vision-encoder")
|
|
|
|
| 102 |
|
| 103 |
# Process an image
|
| 104 |
img = Image.open("your_image.jpg")
|
|
|
|
| 116 |
image_embedding = embeddings.float().mean(dim=0) # (1152,)
|
| 117 |
```
|
| 118 |
|
| 119 |
+
> **Important:** Always use the `Gemma4ImageProcessor` included in this repo for preprocessing.
|
| 120 |
> It handles resizing, patchification, position ID generation, and pixel normalization.
|
| 121 |
> Manual patchification without this processor will produce significantly degraded results.
|
| 122 |
|
|
|
|
| 140 |
|---|---|---|
|
| 141 |
| `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
|
| 142 |
| `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
|
| 143 |
+
| `preprocessor_config.json` | Image processor config (Gemma4ImageProcessor) | <1 KB |
|
| 144 |
| `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
|
| 145 |
|
| 146 |
## Limitations
|
| 147 |
|
| 148 |
- **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
|
| 149 |
+
- **Requires image processor:** Use the `Gemma4ImageProcessor` included in this repo for preprocessing. The model expects pre-patchified `(B, num_patches, 768)` tensors with explicit 2D position IDs — the processor handles this automatically.
|
| 150 |
- **Variable aspect ratio support:** The 2D position embeddings enable non-square images. The processor generates correct position IDs for any aspect ratio.
|
| 151 |
- **Output shape note:** The pooler strips padding and collapses the batch dimension, returning `(num_valid_tokens, 1152)`. For batched inference, use `num_soft_tokens_per_image` from the processor to split the output back into per-image embeddings.
|
| 152 |
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"do_convert_rgb": true,
|
| 3 |
+
"do_normalize": false,
|
| 4 |
+
"do_rescale": true,
|
| 5 |
+
"do_resize": true,
|
| 6 |
+
"image_mean": [
|
| 7 |
+
0.0,
|
| 8 |
+
0.0,
|
| 9 |
+
0.0
|
| 10 |
+
],
|
| 11 |
+
"image_processor_type": "Gemma4ImageProcessor",
|
| 12 |
+
"image_seq_length": 280,
|
| 13 |
+
"image_std": [
|
| 14 |
+
1.0,
|
| 15 |
+
1.0,
|
| 16 |
+
1.0
|
| 17 |
+
],
|
| 18 |
+
"max_soft_tokens": 280,
|
| 19 |
+
"patch_size": 16,
|
| 20 |
+
"pooling_kernel_size": 3,
|
| 21 |
+
"resample": 3,
|
| 22 |
+
"rescale_factor": 0.00392156862745098
|
| 23 |
+
}
|