multigemma / README.md

sagar007

Update model card with inference images gallery

5c3e7a3 verified 16 days ago

preview code

raw

history blame contribute delete

3.73 kB

metadata

license: apache-2.0
language:
  - en
tags:
  - multimodal
  - vision-language
  - gemma
  - clip
  - llava
  - pytorch
  - lightning
datasets:
  - liuhaotian/LLaVA-Instruct-150K
pipeline_tag: image-to-text

Multimodal Gemma-270M

A Multimodal Vision-Language Model combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.

🎯 Model Inference Examples

Here are real inference results from our trained model:

🐱 Animal Detection

Cats on Couch	White Cat Sleeping

🐕 Dog Recognition

Golden Retriever in Park

🏠 Room & Scene Understanding

Modern Kitchen	Clean Kitchen

🍕 Food & Objects

Food Scene	Apple on Table

🛹 Activity & People

Skate Park	Family Dining

📊 Training Details

Parameter	Value
Training Samples	157,712 (Full LLaVA dataset)
Epochs	3
Final Training Loss	1.333
Final Validation Loss	1.430
Total Parameters	539M
Trainable Parameters	18.6M (3.4%)
GPU	NVIDIA A100 40GB
Training Time	~9 hours
Batch Size	20 (effective: 40)
Precision	bf16-mixed

📈 Benchmark Results

Benchmark	Score
Basic VQA	53.8% (7/13 correct)
POPE Hallucination	20.0%

VQA Breakdown

✅ Animal identification (cats, dogs)
✅ Room identification (kitchen, living room)
✅ Object presence detection
⚠️ Color identification (moderate)
⚠️ Detailed attributes (needs improvement)

🏗️ Architecture

Component	Details
Language Model	Google Gemma-3-270M with LoRA adapters
Vision Encoder	OpenAI CLIP ViT-Large/14 (frozen, 428M params)
Vision Projector	MLP (3.4M params)
LoRA	r=16, alpha=32, dropout=0.1

🚀 Usage

from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image

# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()

# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)

📁 Files

File	Size	Description
`final_model.ckpt`	1.2GB	Full model checkpoint
`inference_results/`	13.8MB	Example predictions with images

🔗 Links

GitHub: sagar431/multimodal-gemma-270m
Demo: HuggingFace Space

📚 References

📄 License

Apache 2.0

🙏 Acknowledgments

Google for Gemma models
OpenAI for CLIP
LLaVA team for multimodal architecture inspiration
PyTorch Lightning team