uoft-cs/cifar10
Viewer • Updated • 60k • 132k • 105
A Gemma-3 based multimodal model that combines vision and language understanding through a flexible architecture compatible with any TIMM vision encoder.
MultiGemma3 is an experimental multimodal model that extends Google's Gemma-3-270m with vision capabilities. The architecture demonstrates how to create multimodal models by integrating any vision encoder from the TIMM library with a language model.
multigemma3.py:11-30)
multigemma3.py:33-47)
multigemma3.py:50-188)
python multigemma3trainer.py \
--batch_size 4 \
--lr 1e-4 \
--epochs 3 \
--train_samples 5000 \
--test_samples 1000
from inference_example import MultimodalGemma3Inference
# Initialize model
model = MultimodalGemma3Inference(device='cuda')
# Process image with text
response = model.predict("path/to/image.jpg", prompt="IMG", max_new_tokens=10)
print(response)
# Text-only generation
text_response = model.generate_text("Hello, how are you?", max_new_tokens=50)
print(text_response)
python inference_example.py path/to/image.jpg --prompt "IMG" --max_tokens 10
multigemma3.py: Core model components and architecturemultigemma3trainer.py: Training script with CIFAR-10 exampleinference_example.py: Clean inference implementationconfig.json, *.safetensors: Saved model weights and configuration