multigemma / README.md
sagar007's picture
Update model card with inference images gallery
5c3e7a3 verified
metadata
license: apache-2.0
language:
  - en
tags:
  - multimodal
  - vision-language
  - gemma
  - clip
  - llava
  - pytorch
  - lightning
datasets:
  - liuhaotian/LLaVA-Instruct-150K
pipeline_tag: image-to-text

Multimodal Gemma-270M

A Multimodal Vision-Language Model combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.

🎯 Model Inference Examples

Here are real inference results from our trained model:

🐱 Animal Detection

Cats on Couch White Cat Sleeping
Cat Prediction White Cat

πŸ• Dog Recognition

Golden Retriever in Park
Dog Prediction

🏠 Room & Scene Understanding

Modern Kitchen Clean Kitchen
Kitchen 1 Kitchen 2

πŸ• Food & Objects

Food Scene Apple on Table
Food Apple

πŸ›Ή Activity & People

Skate Park Family Dining
Skate Park Family

πŸ“Š Training Details

Parameter Value
Training Samples 157,712 (Full LLaVA dataset)
Epochs 3
Final Training Loss 1.333
Final Validation Loss 1.430
Total Parameters 539M
Trainable Parameters 18.6M (3.4%)
GPU NVIDIA A100 40GB
Training Time ~9 hours
Batch Size 20 (effective: 40)
Precision bf16-mixed

πŸ“ˆ Benchmark Results

Benchmark Score
Basic VQA 53.8% (7/13 correct)
POPE Hallucination 20.0%

VQA Breakdown

  • βœ… Animal identification (cats, dogs)
  • βœ… Room identification (kitchen, living room)
  • βœ… Object presence detection
  • ⚠️ Color identification (moderate)
  • ⚠️ Detailed attributes (needs improvement)

πŸ—οΈ Architecture

Component Details
Language Model Google Gemma-3-270M with LoRA adapters
Vision Encoder OpenAI CLIP ViT-Large/14 (frozen, 428M params)
Vision Projector MLP (3.4M params)
LoRA r=16, alpha=32, dropout=0.1

πŸš€ Usage

from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image

# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()

# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)

πŸ“ Files

File Size Description
final_model.ckpt 1.2GB Full model checkpoint
inference_results/ 13.8MB Example predictions with images

πŸ”— Links

πŸ“š References

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

  • Google for Gemma models
  • OpenAI for CLIP
  • LLaVA team for multimodal architecture inspiration
  • PyTorch Lightning team