MultiGemma3-270m: Multimodal Vision-Language Model

A Gemma-3 based multimodal model that combines vision and language understanding through a flexible architecture compatible with any TIMM vision encoder.

Architecture Overview

MultiGemma3 is an experimental multimodal model that extends Google's Gemma-3-270m with vision capabilities. The architecture demonstrates how to create multimodal models by integrating any vision encoder from the TIMM library with a language model.

Key Components

1. VisionEncoder (multigemma3.py:11-30)

  • Wraps any TIMM vision model (default: TinyViT-21M)
  • Frozen pretrained weights for stability
  • Outputs vision features of configurable dimension
  • Compatible with any TIMM architecture

2. VisionProjector (multigemma3.py:33-47)

  • Maps vision features to language model embedding space
  • Architecture: Linear → GELU → Dropout → Linear → LayerNorm
  • Trainable component that learns vision-language alignment

3. MultimodalGemma3 (multigemma3.py:50-188)

  • Core multimodal model combining Gemma-3 with vision
  • Handles both vision+text and text-only generation
  • Replaces special "IMG" tokens with projected vision features
  • Supports PEFT/LoRA fine-tuning

Usage

Training

python multigemma3trainer.py \
  --batch_size 4 \
  --lr 1e-4 \
  --epochs 3 \
  --train_samples 5000 \
  --test_samples 1000

Inference

from inference_example import MultimodalGemma3Inference

# Initialize model
model = MultimodalGemma3Inference(device='cuda')

# Process image with text
response = model.predict("path/to/image.jpg", prompt="IMG", max_new_tokens=10)
print(response)

# Text-only generation
text_response = model.generate_text("Hello, how are you?", max_new_tokens=50)
print(text_response)

Command Line Usage

python inference_example.py path/to/image.jpg --prompt "IMG" --max_tokens 10

Technical Details

  • Base Model: Google Gemma-3-270m-it
  • Vision Encoder: TinyViT-21M (configurable to any TIMM model)
  • Training Strategy: LoRA/PEFT fine-tuning with frozen vision encoder
  • Precision: bfloat16 for efficient training and inference
  • Dataset: CIFAR-10 (demonstration dataset)

Architecture Benefits

  1. Modularity: Vision encoder can be swapped with any TIMM model
  2. Efficiency: Only projector and LoRA weights are trained
  3. Flexibility: Supports both multimodal and text-only inference
  4. Scalability: Can be extended to larger vision encoders and language models

Files

  • multigemma3.py: Core model components and architecture
  • multigemma3trainer.py: Training script with CIFAR-10 example
  • inference_example.py: Clean inference implementation
  • config.json, *.safetensors: Saved model weights and configuration
Downloads last month
3
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train openwaifudotcom/multigemma3-270m