multigemma / README.md
sagar007's picture
Update model card with inference images gallery
5c3e7a3 verified
---
license: apache-2.0
language:
- en
tags:
- multimodal
- vision-language
- gemma
- clip
- llava
- pytorch
- lightning
datasets:
- liuhaotian/LLaVA-Instruct-150K
pipeline_tag: image-to-text
---
# Multimodal Gemma-270M
A **Multimodal Vision-Language Model** combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.
## 🎯 Model Inference Examples
Here are real inference results from our trained model:
### 🐱 Animal Detection
| Cats on Couch | White Cat Sleeping |
|---------------|-------------------|
| ![Cat Prediction](inference_results/sample_001_prediction.png) | ![White Cat](inference_results/sample_009_prediction.png) |
### πŸ• Dog Recognition
| Golden Retriever in Park |
|-------------------------|
| ![Dog Prediction](inference_results/sample_007_prediction.png) |
### 🏠 Room & Scene Understanding
| Modern Kitchen | Clean Kitchen |
|---------------|---------------|
| ![Kitchen 1](inference_results/sample_003_prediction.png) | ![Kitchen 2](inference_results/sample_004_prediction.png) |
### πŸ• Food & Objects
| Food Scene | Apple on Table |
|------------|----------------|
| ![Food](inference_results/sample_002_prediction.png) | ![Apple](inference_results/sample_008_prediction.png) |
### πŸ›Ή Activity & People
| Skate Park | Family Dining |
|------------|---------------|
| ![Skate Park](inference_results/sample_005_prediction.png) | ![Family](inference_results/sample_006_prediction.png) |
---
## πŸ“Š Training Details
| Parameter | Value |
|-----------|-------|
| **Training Samples** | 157,712 (Full LLaVA dataset) |
| **Epochs** | 3 |
| **Final Training Loss** | 1.333 |
| **Final Validation Loss** | 1.430 |
| **Total Parameters** | 539M |
| **Trainable Parameters** | 18.6M (3.4%) |
| **GPU** | NVIDIA A100 40GB |
| **Training Time** | ~9 hours |
| **Batch Size** | 20 (effective: 40) |
| **Precision** | bf16-mixed |
## πŸ“ˆ Benchmark Results
| Benchmark | Score |
|-----------|-------|
| **Basic VQA** | 53.8% (7/13 correct) |
| **POPE Hallucination** | 20.0% |
### VQA Breakdown
- βœ… Animal identification (cats, dogs)
- βœ… Room identification (kitchen, living room)
- βœ… Object presence detection
- ⚠️ Color identification (moderate)
- ⚠️ Detailed attributes (needs improvement)
## πŸ—οΈ Architecture
| Component | Details |
|-----------|---------|
| **Language Model** | Google Gemma-3-270M with LoRA adapters |
| **Vision Encoder** | OpenAI CLIP ViT-Large/14 (frozen, 428M params) |
| **Vision Projector** | MLP (3.4M params) |
| **LoRA** | r=16, alpha=32, dropout=0.1 |
## πŸš€ Usage
```python
from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image
# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()
# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)
```
## πŸ“ Files
| File | Size | Description |
|------|------|-------------|
| `final_model.ckpt` | 1.2GB | Full model checkpoint |
| `inference_results/` | 13.8MB | Example predictions with images |
## πŸ”— Links
- **GitHub**: [sagar431/multimodal-gemma-270m](https://github.com/sagar431/multimodal-gemma-270m)
- **Demo**: [HuggingFace Space](https://huggingface.co/spaces/sagar007/Multimodal-Gemma)
## πŸ“š References
- [LLaVA Paper](https://arxiv.org/abs/2304.08485)
- [Gemma Technical Report](https://arxiv.org/abs/2403.08295)
## πŸ“„ License
Apache 2.0
## πŸ™ Acknowledgments
- Google for Gemma models
- OpenAI for CLIP
- LLaVA team for multimodal architecture inspiration
- PyTorch Lightning team