Initial upload: Gemma 4 vision encoder (569.6M, 27-layer ViT with 2D RoPE)

Browse files

Files changed (4) hide show

README.md +153 -0
config.json +49 -0
embed_vision.safetensors +3 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+language:
+- en
+- multilingual
+license: apache-2.0
+library_name: transformers
+tags:
+- feature-extraction
+- image-feature-extraction
+- vision
+- vit
+- gemma4
+- google
+- safetensors
+pipeline_tag: image-feature-extraction
+base_model: google/gemma-4-31B-it
+model-index:
+- name: gemma4-vision-encoder
+  results: []
+---
+# Gemma 4 Vision Encoder (27-layer ViT with 2D RoPE)
+Standalone extraction of the vision encoder from Google's [Gemma 4 31B](https://huggingface.co/google/gemma-4-31B-it) multimodal model. This is a 569.6M parameter Vision Transformer with learned 2D positional embeddings, RoPE, QK-norms, and gated MLP — a significant upgrade from the SigLIP encoder used in Gemma 3.
+**License:** Apache 2.0 (inherited from Gemma 4 — no restrictions)
+## Architecture
+| Property | Value |
+|---|---|
+| Total parameters | 569.6M |
+| Architecture | ViT with 2D RoPE + learned positional embeddings |
+| Hidden dimension | 1152 |
+| Encoder layers | 27 |
+| Attention heads | 16 (72 dim per head) |
+| KV heads | 16 (full MHA, no GQA) |
+| MLP | Gated (gate_proj + up_proj + down_proj) |
+| MLP intermediate | 4304 |
+| Activation | GELU (pytorch_tanh variant) |
+| Normalization | RMSNorm (eps=1e-6) |
+| Patch size | 16×16 |
+| Pooling | 3×3 kernel (reduces token count by 9×) |
+| Position embeddings | Learned 2D table (2, 10240, 1152) + RoPE (theta=100) |
+| Q/K norms | Yes |
+| Default output tokens | 280 |
+| Configurable token budgets | 70, 140, 280, 560, 1120 |
+| Input | Pre-patchified: `(batch, num_patches, 768)` where 768 = 3×16×16 |
+| Output | `(num_valid_tokens, 1152)` after pooling + standardization |
+### What's New vs Gemma 3 (SigLIP)
+| | Gemma 3 Vision | Gemma 4 Vision (this model) |
+|---|---|---|
+| Architecture | SigLIP (ViT-SO400M) | Custom ViT with 2D RoPE |
+| Layers | 27 | 27 |
+| Hidden dim | 1152 | 1152 |
+| Position encoding | Learned 1D | **Learned 2D + RoPE** |
+| Attention | Standard | **QK-normed** |
+| MLP | Standard (fc1 + fc2) | **Gated (gate + up + down)** |
+| Aspect ratio | Fixed square (896×896) | **Variable aspect ratio** |
+| Token budget | Fixed 256 | **Configurable (70–1120)** |
+| Pooling | 4×4 average | **3×3** |
+### Not Shared with E2B/E4B
+Unlike the audio encoder (which is identical across E2B and E4B), the vision encoders differ:
+| | E2B/E4B | 31B (this extraction) |
+|---|---|---|
+| Layers | 16 | **27** |
+| Parameters | ~340M | **569.6M** |
+## Usage
+```python
+import torch
+from transformers import Gemma4VisionModel, Gemma4VisionConfig
+from safetensors.torch import load_file
+# Load vision encoder from this repo
+cfg = Gemma4VisionConfig.from_pretrained("rnagabh/gemma4-vision-encoder")
+vision_model = Gemma4VisionModel(cfg)
+state_dict = load_file("path/to/model.safetensors")  # or download from repo
+vision_model.load_state_dict(state_dict, strict=True)
+vision_model = vision_model.to(dtype=torch.bfloat16, device="cuda")
+vision_model.eval()
+# Prepare image: patchify and create position IDs
+# Image must have sides divisible by patch_size (16) AND
+# num_patches must be divisible by pooling_kernel^2 (9)
+# Good sizes: 864 (54 patches/side), 768 (48), 576 (36)
+P = 16
+img_size = 864
+patches_per_side = img_size // P  # 54
+# Patchify: (B, C, H, W) → (B, num_patches, C*P*P)
+img = torch.randn(1, 3, img_size, img_size, dtype=torch.bfloat16, device="cuda")
+patches = img.unfold(2, P, P).unfold(3, P, P)
+patches = patches.contiguous().view(1, 3, -1, P, P)
+patches = patches.permute(0, 2, 1, 3, 4)
+patches = patches.reshape(1, -1, 3 * P * P)  # (1, 2916, 768)
+# Position IDs: (batch, num_patches, 2) as (x, y) coordinates
+ys, xs = torch.meshgrid(
+    torch.arange(patches_per_side),
+    torch.arange(patches_per_side),
+    indexing="ij",
+)
+position_ids = torch.stack([xs.flatten(), ys.flatten()], dim=-1)
+position_ids = position_ids.unsqueeze(0).to(device="cuda")  # (1, 2916, 2)
+with torch.no_grad():
+    output = vision_model(pixel_values=patches, pixel_position_ids=position_ids)
+    embeddings = output.last_hidden_state  # (324, 1152) — pooled tokens
+```
+> **Image size constraints:** The number of patches must be divisible by the pooling kernel² (9).
+> This means each image dimension divided by patch_size (16) must be divisible by 3.
+> Valid image sizes include: 576, 768, 864, 960, 1152, etc.
+> **Output shape:** The batch dimension is collapsed — the pooler strips padding and returns
+> a flat `(num_valid_tokens, hidden_dim)` tensor. For a single 864×864 image, you get
+> `(324, 1152)` — 324 pooled visual tokens at 1152 dimensions.
+## Files in This Repo
+| File | Description | Size |
+|---|---|---|
+| `config.json` | Vision encoder config (Gemma4VisionConfig) | <1 KB |
+| `model.safetensors` | Vision encoder weights (569.6M params, BF16) | 1,139 MB |
+| `embed_vision.safetensors` | Vision→text embedding projection (1152→5376) | 12.4 MB |
+## Limitations
+- **End-to-end trained for LLM decoding:** The encoder was trained to produce features for Gemma 4's text decoder. The 1152-dim output is the pure vision representation; the `embed_vision` projection maps to the 31B's text hidden space (5376-dim).
+- **Requires pre-patchified input:** Unlike standard ViT models that accept raw `(B, C, H, W)` images, this model expects pre-patchified `(B, num_patches, 768)` tensors with explicit position IDs.
+- **Variable aspect ratio support:** The 2D position embeddings enable non-square images, but you must provide correct `pixel_position_ids` for each patch.
+- **No built-in image preprocessing:** You need to handle resizing, normalization (the model does `2*(x-0.5)` internally), and patchification yourself, or use the parent model's processor.
+## Extraction Details
+- Extracted from `google/gemma-4-31B-it` by downloading only the shard containing vision tower weights (`model-00001-of-00002.safetensors`)
+- No full model load required — targeted tensor extraction
+- Weights loaded with `strict=True` — perfect match
+- Forward pass verified: 864×864 image → (324, 1152) output
+- All architecture specs verified against the live model config
+## References
+- [Gemma 4 on HuggingFace](https://huggingface.co/google/gemma-4-31B-it)
+- [Gemma 4 Blog Post](https://huggingface.co/blog/gemma4)
+- [Gemma 4 Architecture Comparison](https://g4.si5.pl/)

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "_name_or_path": "",
+  "architectures": [
+    "Gemma4VisionModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "chunk_size_feed_forward": 0,
+  "default_output_length": 280,
+  "dtype": "bfloat16",
+  "global_head_dim": 72,
+  "head_dim": 72,
+  "hidden_activation": "gelu_pytorch_tanh",
+  "hidden_size": 1152,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 4304,
+  "is_encoder_decoder": false,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1
+  },
+  "max_position_embeddings": 131072,
+  "model_type": "gemma4_vision",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 27,
+  "num_key_value_heads": 16,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "patch_size": 16,
+  "pooling_kernel_size": 3,
+  "position_embedding_size": 10240,
+  "problem_type": null,
+  "return_dict": true,
+  "rms_norm_eps": 1e-06,
+  "rope_parameters": {
+    "rope_theta": 100.0,
+    "rope_type": "default"
+  },
+  "standardize": true,
+  "use_clipped_linears": false,
+  "torch_dtype": "bfloat16",
+  "_source_model": "google/gemma-4-31B-it",
+  "_extraction_note": "Vision tower extracted from 31B model",
+  "_verified_total_params": 569550384
+}

embed_vision.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7de065bb74191a84c11c7a436eba29278ae1934b92c386f6fcfa515b2d5c0c7f
+size 12386408

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:408421234ecaab33c9b641efe86d369d8398252d812c23faccfbd1cb1d744ccb
+size 1139143360