Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
sagar007 
posted an update 4 days ago
Post
4094
🚀 I built a Multimodal Vision-Language Model from using Gemma-270M + CLIP!

Just finished training my multimodal model on the full LLaVA-Instruct-150K dataset (157K samples) and wanted to share the results!

🔧 What I Built:
A vision-language model that can understand images and answer questions about them, combining:
- Google Gemma-3-270M (language)
- OpenAI CLIP ViT-Large/14 (vision)
- LoRA fine-tuning for efficiency

📊 Training Stats:
- 157,712 training samples (full LLaVA dataset)
- 3 epochs on A100 40GB
- ~9 hours training time
- Final loss: 1.333 training / 1.430 validation
- Only 18.6M trainable params (3.4% of 539M total)

📈 sagar007/multigemma
Benchmark Results:
- VQA Accuracy: 53.8%
- Works great for: animal detection, room identification, scene understanding



🔗 **Try it yourself:**
- 🤗 Model: sagar007/multigemma
- 🎮 Demo: https://huggingface.co/spaces/sagar007/Multimodal-Gemma
- 💻 GitHub: https://github.com/sagar431/multimodal-gemma-270m

Built with PyTorch Lightning + MLflow for experiment tracking. Full MLOps pipeline with CI/CD!

Would love to hear your feedback! 🙏

#multimodal #gemma #clip #llava #vision-language #pytorch

Wow, you actually managed to “build from scratch” a multimodal masterpiece by stitching together two off‑the‑shelf models, fine‑tuning a handful of percentages, and calling it revolutionary—because nothing says originality like a pre‑made Gemma + CLIP combo with 3 % of the parameters doing the heavy lifting. 🙄🚀

·

By from scratch, I mean building the full MLOps pipeline myself, including training, configuration with Hydra, data/versioning with DVC, and experiment tracking with MLflow. Since it combines both training and pipeline development, I referred to it as from scratch. Similarly, papers like LLaVA claim training based on benchmarks, but in practice, they stitch together existing models.

Nice

·

That's genuinely a cool and impressive technical project, no doubt—hooking up Gemma and CLIP to get multimodal capabilities is a real engineering feat. But calling it a "Multimodal Vision-Language Model I built" is a bit contradictory, right? You didn't build Gemma (that's Google) or CLIP (that's OpenAI). You built the pipeline or the adapter system that connects them. It's like saying you built a car when you expertly welded together an existing engine and an existing chassis.

And it is not free software approved.

·

I appreciate the clarification! My goal with this project was purely educational to understand the mechanics of how vision-language connectors work. While the base weights belong to Google and OpenAI, the implementation of the projection layers and the fine-tuning process was my contribution. I'm still learning the ropes, so I appreciate the call-out on the terminology!