MultiGemma3-270m: Multimodal Vision-Language Model

A Gemma-3 based multimodal model that combines vision and language understanding through a flexible architecture compatible with any TIMM vision encoder.

Architecture Overview

MultiGemma3 is an experimental multimodal model that extends Google's Gemma-3-270m with vision capabilities. The architecture demonstrates how to create multimodal models by integrating any vision encoder from the TIMM library with a language model.

Key Components

1. VisionEncoder (`multigemma3.py:11-30`)

Wraps any TIMM vision model (default: TinyViT-21M)
Frozen pretrained weights for stability
Outputs vision features of configurable dimension
Compatible with any TIMM architecture

2. VisionProjector (`multigemma3.py:33-47`)

Maps vision features to language model embedding space
Architecture: Linear → GELU → Dropout → Linear → LayerNorm
Trainable component that learns vision-language alignment

3. MultimodalGemma3 (`multigemma3.py:50-188`)

Core multimodal model combining Gemma-3 with vision
Handles both vision+text and text-only generation
Replaces special "IMG" tokens with projected vision features
Supports PEFT/LoRA fine-tuning

Usage

Training

python multigemma3trainer.py \
  --batch_size 4 \
  --lr 1e-4 \
  --epochs 3 \
  --train_samples 5000 \
  --test_samples 1000

Inference

from inference_example import MultimodalGemma3Inference

# Initialize model
model = MultimodalGemma3Inference(device='cuda')

# Process image with text
response = model.predict("path/to/image.jpg", prompt="IMG", max_new_tokens=10)
print(response)

# Text-only generation
text_response = model.generate_text("Hello, how are you?", max_new_tokens=50)
print(text_response)

Command Line Usage

python inference_example.py path/to/image.jpg --prompt "IMG" --max_tokens 10

Technical Details

Base Model: Google Gemma-3-270m-it
Vision Encoder: TinyViT-21M (configurable to any TIMM model)
Training Strategy: LoRA/PEFT fine-tuning with frozen vision encoder
Precision: bfloat16 for efficient training and inference
Dataset: CIFAR-10 (demonstration dataset)

Architecture Benefits

Modularity: Vision encoder can be swapped with any TIMM model
Efficiency: Only projector and LoRA weights are trained
Flexibility: Supports both multimodal and text-only inference
Scalability: Can be extended to larger vision encoders and language models

Files

multigemma3.py: Core model components and architecture
multigemma3trainer.py: Training script with CIFAR-10 example
inference_example.py: Clean inference implementation
config.json, *.safetensors: Saved model weights and configuration

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

openwaifudotcom
/

multigemma3-270m

MultiGemma3-270m: Multimodal Vision-Language Model

Architecture Overview

Key Components

1. VisionEncoder (`multigemma3.py:11-30`)

2. VisionProjector (`multigemma3.py:33-47`)

3. MultimodalGemma3 (`multigemma3.py:50-188`)

Usage

Training

Inference

Command Line Usage

Technical Details

Architecture Benefits

Files

Dataset used to train openwaifudotcom/multigemma3-270m

MultiGemma3-270m: Multimodal Vision-Language Model

Architecture Overview

Key Components

1. VisionEncoder (multigemma3.py:11-30)

2. VisionProjector (multigemma3.py:33-47)

3. MultimodalGemma3 (multigemma3.py:50-188)

Usage

Training

Inference

Command Line Usage

Technical Details

Architecture Benefits

Files

Dataset used to train openwaifudotcom/multigemma3-270m

1. VisionEncoder (`multigemma3.py:11-30`)

2. VisionProjector (`multigemma3.py:33-47`)

3. MultimodalGemma3 (`multigemma3.py:50-188`)