multigemma / README.md

Update model card with inference images gallery

5c3e7a3 verified 16 days ago

3.73 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- multimodal
	- vision-language
	- gemma
	- clip
	- llava
	- pytorch
	- lightning
	datasets:
	- liuhaotian/LLaVA-Instruct-150K
	pipeline_tag: image-to-text
	---

	# Multimodal Gemma-270M

	A Multimodal Vision-Language Model combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.

	## 🎯 Model Inference Examples

	Here are real inference results from our trained model:

	### 🐱 Animal Detection

	\| Cats on Couch \| White Cat Sleeping \|
	\|---------------\|-------------------\|
	\| ![Cat Prediction](inference_results/sample_001_prediction.png) \| ![White Cat](inference_results/sample_009_prediction.png) \|

	### 🐕 Dog Recognition

	\| Golden Retriever in Park \|
	\|-------------------------\|
	\| ![Dog Prediction](inference_results/sample_007_prediction.png) \|

	### 🏠 Room & Scene Understanding

	\| Modern Kitchen \| Clean Kitchen \|
	\|---------------\|---------------\|
	\| ![Kitchen 1](inference_results/sample_003_prediction.png) \| ![Kitchen 2](inference_results/sample_004_prediction.png) \|

	### 🍕 Food & Objects

	\| Food Scene \| Apple on Table \|
	\|------------\|----------------\|
	\| ![Food](inference_results/sample_002_prediction.png) \| ![Apple](inference_results/sample_008_prediction.png) \|

	### 🛹 Activity & People

	\| Skate Park \| Family Dining \|
	\|------------\|---------------\|
	\| ![Skate Park](inference_results/sample_005_prediction.png) \| ![Family](inference_results/sample_006_prediction.png) \|

	---

	## 📊 Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training Samples \| 157,712 (Full LLaVA dataset) \|
	\| Epochs \| 3 \|
	\| Final Training Loss \| 1.333 \|
	\| Final Validation Loss \| 1.430 \|
	\| Total Parameters \| 539M \|
	\| Trainable Parameters \| 18.6M (3.4%) \|
	\| GPU \| NVIDIA A100 40GB \|
	\| Training Time \| ~9 hours \|
	\| Batch Size \| 20 (effective: 40) \|
	\| Precision \| bf16-mixed \|

	## 📈 Benchmark Results

	\| Benchmark \| Score \|
	\|-----------\|-------\|
	\| Basic VQA \| 53.8% (7/13 correct) \|
	\| POPE Hallucination \| 20.0% \|

	### VQA Breakdown
	- ✅ Animal identification (cats, dogs)
	- ✅ Room identification (kitchen, living room)
	- ✅ Object presence detection
	- ⚠️ Color identification (moderate)
	- ⚠️ Detailed attributes (needs improvement)

	## 🏗️ Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Language Model \| Google Gemma-3-270M with LoRA adapters \|
	\| Vision Encoder \| OpenAI CLIP ViT-Large/14 (frozen, 428M params) \|
	\| Vision Projector \| MLP (3.4M params) \|
	\| LoRA \| r=16, alpha=32, dropout=0.1 \|

	## 🚀 Usage

	```python
	from src.models.multimodal_gemma import MultimodalGemma
	import torch
	from PIL import Image

	# Load model
	model = MultimodalGemma(config)
	checkpoint = torch.load("final_model.ckpt")
	model.load_state_dict(checkpoint["state_dict"])
	model.eval()

	# Inference
	image = Image.open("your_image.jpg")
	prompt = "What do you see in this image?"
	response = model.generate(image, prompt)
	print(response)
	```

	## 📁 Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `final_model.ckpt` \| 1.2GB \| Full model checkpoint \|
	\| `inference_results/` \| 13.8MB \| Example predictions with images \|

	## 🔗 Links

	- GitHub: [sagar431/multimodal-gemma-270m](https://github.com/sagar431/multimodal-gemma-270m)
	- Demo: [HuggingFace Space](https://huggingface.co/spaces/sagar007/Multimodal-Gemma)

	## 📚 References

	- [LLaVA Paper](https://arxiv.org/abs/2304.08485)
	- [Gemma Technical Report](https://arxiv.org/abs/2403.08295)

	## 📄 License

	Apache 2.0

	## 🙏 Acknowledgments

	- Google for Gemma models
	- OpenAI for CLIP
	- LLaVA team for multimodal architecture inspiration
	- PyTorch Lightning team