|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- multimodal |
|
|
- vision-language |
|
|
- gemma |
|
|
- clip |
|
|
- llava |
|
|
- pytorch |
|
|
- lightning |
|
|
datasets: |
|
|
- liuhaotian/LLaVA-Instruct-150K |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
# Multimodal Gemma-270M |
|
|
|
|
|
A **Multimodal Vision-Language Model** combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset. |
|
|
|
|
|
## π― Model Inference Examples |
|
|
|
|
|
Here are real inference results from our trained model: |
|
|
|
|
|
### π± Animal Detection |
|
|
|
|
|
| Cats on Couch | White Cat Sleeping | |
|
|
|---------------|-------------------| |
|
|
|  |  | |
|
|
|
|
|
### π Dog Recognition |
|
|
|
|
|
| Golden Retriever in Park | |
|
|
|-------------------------| |
|
|
|  | |
|
|
|
|
|
### π Room & Scene Understanding |
|
|
|
|
|
| Modern Kitchen | Clean Kitchen | |
|
|
|---------------|---------------| |
|
|
|  |  | |
|
|
|
|
|
### π Food & Objects |
|
|
|
|
|
| Food Scene | Apple on Table | |
|
|
|------------|----------------| |
|
|
|  |  | |
|
|
|
|
|
### πΉ Activity & People |
|
|
|
|
|
| Skate Park | Family Dining | |
|
|
|------------|---------------| |
|
|
|  |  | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Training Samples** | 157,712 (Full LLaVA dataset) | |
|
|
| **Epochs** | 3 | |
|
|
| **Final Training Loss** | 1.333 | |
|
|
| **Final Validation Loss** | 1.430 | |
|
|
| **Total Parameters** | 539M | |
|
|
| **Trainable Parameters** | 18.6M (3.4%) | |
|
|
| **GPU** | NVIDIA A100 40GB | |
|
|
| **Training Time** | ~9 hours | |
|
|
| **Batch Size** | 20 (effective: 40) | |
|
|
| **Precision** | bf16-mixed | |
|
|
|
|
|
## π Benchmark Results |
|
|
|
|
|
| Benchmark | Score | |
|
|
|-----------|-------| |
|
|
| **Basic VQA** | 53.8% (7/13 correct) | |
|
|
| **POPE Hallucination** | 20.0% | |
|
|
|
|
|
### VQA Breakdown |
|
|
- β
Animal identification (cats, dogs) |
|
|
- β
Room identification (kitchen, living room) |
|
|
- β
Object presence detection |
|
|
- β οΈ Color identification (moderate) |
|
|
- β οΈ Detailed attributes (needs improvement) |
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
| Component | Details | |
|
|
|-----------|---------| |
|
|
| **Language Model** | Google Gemma-3-270M with LoRA adapters | |
|
|
| **Vision Encoder** | OpenAI CLIP ViT-Large/14 (frozen, 428M params) | |
|
|
| **Vision Projector** | MLP (3.4M params) | |
|
|
| **LoRA** | r=16, alpha=32, dropout=0.1 | |
|
|
|
|
|
## π Usage |
|
|
|
|
|
```python |
|
|
from src.models.multimodal_gemma import MultimodalGemma |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
# Load model |
|
|
model = MultimodalGemma(config) |
|
|
checkpoint = torch.load("final_model.ckpt") |
|
|
model.load_state_dict(checkpoint["state_dict"]) |
|
|
model.eval() |
|
|
|
|
|
# Inference |
|
|
image = Image.open("your_image.jpg") |
|
|
prompt = "What do you see in this image?" |
|
|
response = model.generate(image, prompt) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## π Files |
|
|
|
|
|
| File | Size | Description | |
|
|
|------|------|-------------| |
|
|
| `final_model.ckpt` | 1.2GB | Full model checkpoint | |
|
|
| `inference_results/` | 13.8MB | Example predictions with images | |
|
|
|
|
|
## π Links |
|
|
|
|
|
- **GitHub**: [sagar431/multimodal-gemma-270m](https://github.com/sagar431/multimodal-gemma-270m) |
|
|
- **Demo**: [HuggingFace Space](https://huggingface.co/spaces/sagar007/Multimodal-Gemma) |
|
|
|
|
|
## π References |
|
|
|
|
|
- [LLaVA Paper](https://arxiv.org/abs/2304.08485) |
|
|
- [Gemma Technical Report](https://arxiv.org/abs/2403.08295) |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- Google for Gemma models |
|
|
- OpenAI for CLIP |
|
|
- LLaVA team for multimodal architecture inspiration |
|
|
- PyTorch Lightning team |
|
|
|